Philipp Leitner (@philippleitner.net) Bsky

Of course the LLM will reinforcement-learn its way towards cheating your test suite somehow, so you'll need to stay vigilant of this. Maybe something like what we do in automated student assessment - e.g., a combination of seen and unseen test cases.

2 months ago 1 0 2 0

(Some form of) TDD seems to be a fairly natural fit for vibe coding. You and the LLM both need a way to validate what the AI has implemented. A good, complete, and ideally simple test suite written upfront seems to be a natural way to provide that.

2 months ago 1 0 2 0

I got interested and asked Claude to do a topic map of ICSE 2016 vs 2026:

2 months ago 0 0 1 1

The older I get, the more convinced I become that our entire way of organizing capitalism (through stock markets and large international corporations) is fundamentally flawed and will, soon, flop on its belly. I just hope we are not standing beyond it when it does (but we likely will).

2 months ago 2 0 1 0

“AI job losses“ is just a code word for “Global recession driven by the USA criminally unregulated financial market job losses“.

2 months ago 7 4 2 0

Redirecting

Some holiday news - paper accepted in Future Generation Computing Systems (FGCS):

doi.org/10.1016/j.fu...

3 months ago 1 0 0 0

Clair Obscur: Expedition 33 takes home an absurd 9 wins at The Game Awards, more than Baldur's Gate 3 in 2023 This year's Game Awards GOTY (and almost everything else) is the popular French RPG Clair Obscur: Expedition 33.

Don't get me wrong, it's hard not to root for Expedition 33 and their super humble and nice team, but some of these awards are a bit sus. Is it even an indie game? Best art direction in a year with freaking Silksong?

www.pcgamer.com/games/clair-...

4 months ago 1 0 0 0

Still, I found it fun to observe how PC Gamer dealt with their reviewing snafu over time. In the weeks after release, when it became clear that E33 will be a *big deal*, they took it with humor, but as time went on they adopted more of a policy of "the review shall not be mentioned henceforth".

4 months ago 0 0 0 0

(Not actual criticism of PC Gamer. Reviewing is inherently subjective, and honestly Expedition 33 is kind of a weird and difficult-to-review game.)

4 months ago 0 0 1 0

Clair Obscur: Expedition 33 review Clair Obscur: Expedition 33 is a stylish riff on the JRPG, but its real-time-infused combat is rarely as fun as it looks.

Remember when @pcgamer.com gave #expedition33 70% in their review, calling the game "rarely as fun as it looks"?

Yeah, that didn't age great. #expedition33 is now officially the most decorated game at the Game Awards, ever.

www.pcgamer.com/games/rpg/cl...

4 months ago 0 0 1 0

Vacancies

Reminder - I still have an opening for a postdoc in my lab (closing date is in one week):

www.chalmers.se/en/about-cha...

4 months ago 0 0 0 1

It goes against common sense, everything we know about economics tells us it shouldn't work, there is no serious data that suggests it works, and yet it forms the basis of all economic decision making in the west.

Why? Because it would be awfully convenient for the people in power if it *did* work.

4 months ago 3 0 0 0

At some point in the future, people will read about trickle-down economics and have the same confused reaction that we have when learning how universally accepted catholic indulgences were in the Middle Ages.

4 months ago 7 3 1 0

I'm nowhere close to a financial expert, but how these things usually go is that nobody is *obviously* overleveraged, but everyone depends on everyone and once the first domino pieces start to fall it triggers a chain reaction at whose end the bank's "safe investments" suddenly appear hazardous.

4 months ago 0 0 0 0

That's a fairly common "playing both sides" argument. Productivity+++, but really nobody needs to worry for their jobs. These things are not likely to be true at the same time.

4 months ago 2 0 0 0

I have a new job ad for a postdoc out:

www.chalmers.se/en/about-cha...

Application deadline: Dec. 18th

Find out more about the work of my lab: icet-lab.eu

4 months ago 1 1 0 0

I heard the term "spec-based programming" from a colleague for the paradigm where you really only provide and refine requirements, and do not care at all about the code. I don't think the tools I am using are there yet.

5 months ago 1 0 0 0

IDK. My definition of vibe coding is "coding based almost exclusively on prompts, without or with minimal manual editing afterwards". Not sure if that is a standard definition, but it feels right.

5 months ago 1 0 1 0

(8) An interesting mind shift happens when you vibe code a lot. Code turns into a kind of transient artifact that you just aren't very attached to. Is the code messy? Who cares (as long as it works), you aren't looking very much at it anyway.

This has strong implications for security, safety, etc.

5 months ago 0 0 1 0

(7) Overall, the final system turns kind of messy, but realistically so did all other research prototypes I implemented by hand. But now, nobody, not even me, really understands the messy system.

5 months ago 0 0 1 0

(6) Planning mode is great. Claude is surprisingly good at creating, updating, and evaluating a plan of what to do. Complex changes became much more feasible once I started working more with planning mode upfront.

5 months ago 1 0 1 0

(5) For non-trivial code, you'll still need decent understanding of the solution space. I feel like some of the more hairy implementation issues I could only solve because I implemented similar systems in the past, and could prompt the AI with *very* fine-grained designs.

5 months ago 1 0 1 0

(4) Somewhat relatedly, AI loves to generate tests alongside changes (good) but they are often not very useful. They often stub out all business logic, turning them into classic "Python isn't broken" kind of tests. Getting it to write (and keep!) useful end-to-end tests seems surprisingly hard.

5 months ago 1 0 1 0

(3) Do.Not. Trust. the AI when it declares success. Whether something actually *works* you need to check yourself. I'll leave this example here - the AI broken 25 tests, and decided after fixing one of the failures that the rest probably isn't their fault.

5 months ago 2 0 1 0

(2) Validation is king, but also very hard. Again, these tools produce a lot of code. I quickly realized that reviewing it line-by-line is unrealistic. It may be more realistic when doing small changes in an established system, but in greenfield dev you have to go with the flow.

5 months ago 2 0 1 0

Lesson Learned (1): you feel more productive than you truly are. These tools produce *a lot* of code in short time, but if you take a step back after a few weeks you notice than a fair bit of it wasn't actually that useful. It still takes time to build something that actually works, and not just 75%

5 months ago 2 0 1 0

I initially used Gemini (in the console), but eventually moved on to Claude Console. They seem similar, but results from Claude where subjectively better, the tooling seems more mature, and the rates allowed me to work without much interruption. I am using the Pro subscription for USD 25 / month.

5 months ago 1 0 1 0

For the last couple of weeks I have been trying to vibe-code a relatively complicated research system in the area of Java microbenchmarking in my spare time.

I am slowly reaching the point where the system does something useful, so here are some initial impressions:

5 months ago 1 0 1 0

People talk a lot about echo chambers on here, but I think it's important to remember that you are not entitled to anybody's attention, independently of how important you or your cause are.

5 months ago 1 0 0 0

People are saying that AI will transform the way we teach and learn. It has already transformed the way students cheat and, to my surprise, how they apologize for cheating.

5 months ago 23 6 0 2

Posts by Philipp Leitner