Of course the LLM will reinforcement-learn its way towards cheating your test suite somehow, so you'll need to stay vigilant of this. Maybe something like what we do in automated student assessment - e.g., a combination of seen and unseen test cases.
Posts by Philipp Leitner
(Some form of) TDD seems to be a fairly natural fit for vibe coding. You and the LLM both need a way to validate what the AI has implemented. A good, complete, and ideally simple test suite written upfront seems to be a natural way to provide that.
I got interested and asked Claude to do a topic map of ICSE 2016 vs 2026:
The older I get, the more convinced I become that our entire way of organizing capitalism (through stock markets and large international corporations) is fundamentally flawed and will, soon, flop on its belly. I just hope we are not standing beyond it when it does (but we likely will).
“AI job losses“ is just a code word for “Global recession driven by the USA criminally unregulated financial market job losses“.
Some holiday news - paper accepted in Future Generation Computing Systems (FGCS):
doi.org/10.1016/j.fu...
Don't get me wrong, it's hard not to root for Expedition 33 and their super humble and nice team, but some of these awards are a bit sus. Is it even an indie game? Best art direction in a year with freaking Silksong?
www.pcgamer.com/games/clair-...
Still, I found it fun to observe how PC Gamer dealt with their reviewing snafu over time. In the weeks after release, when it became clear that E33 will be a *big deal*, they took it with humor, but as time went on they adopted more of a policy of "the review shall not be mentioned henceforth".
(Not actual criticism of PC Gamer. Reviewing is inherently subjective, and honestly Expedition 33 is kind of a weird and difficult-to-review game.)
Remember when @pcgamer.com gave #expedition33 70% in their review, calling the game "rarely as fun as it looks"?
Yeah, that didn't age great. #expedition33 is now officially the most decorated game at the Game Awards, ever.
www.pcgamer.com/games/rpg/cl...
Reminder - I still have an opening for a postdoc in my lab (closing date is in one week):
www.chalmers.se/en/about-cha...
It goes against common sense, everything we know about economics tells us it shouldn't work, there is no serious data that suggests it works, and yet it forms the basis of all economic decision making in the west.
Why? Because it would be awfully convenient for the people in power if it *did* work.
At some point in the future, people will read about trickle-down economics and have the same confused reaction that we have when learning how universally accepted catholic indulgences were in the Middle Ages.
I'm nowhere close to a financial expert, but how these things usually go is that nobody is *obviously* overleveraged, but everyone depends on everyone and once the first domino pieces start to fall it triggers a chain reaction at whose end the bank's "safe investments" suddenly appear hazardous.
That's a fairly common "playing both sides" argument. Productivity+++, but really nobody needs to worry for their jobs. These things are not likely to be true at the same time.
I have a new job ad for a postdoc out:
www.chalmers.se/en/about-cha...
Application deadline: Dec. 18th
Find out more about the work of my lab: icet-lab.eu
I heard the term "spec-based programming" from a colleague for the paradigm where you really only provide and refine requirements, and do not care at all about the code. I don't think the tools I am using are there yet.
IDK. My definition of vibe coding is "coding based almost exclusively on prompts, without or with minimal manual editing afterwards". Not sure if that is a standard definition, but it feels right.
(8) An interesting mind shift happens when you vibe code a lot. Code turns into a kind of transient artifact that you just aren't very attached to. Is the code messy? Who cares (as long as it works), you aren't looking very much at it anyway.
This has strong implications for security, safety, etc.
(7) Overall, the final system turns kind of messy, but realistically so did all other research prototypes I implemented by hand. But now, nobody, not even me, really understands the messy system.
(6) Planning mode is great. Claude is surprisingly good at creating, updating, and evaluating a plan of what to do. Complex changes became much more feasible once I started working more with planning mode upfront.
(5) For non-trivial code, you'll still need decent understanding of the solution space. I feel like some of the more hairy implementation issues I could only solve because I implemented similar systems in the past, and could prompt the AI with *very* fine-grained designs.
(4) Somewhat relatedly, AI loves to generate tests alongside changes (good) but they are often not very useful. They often stub out all business logic, turning them into classic "Python isn't broken" kind of tests. Getting it to write (and keep!) useful end-to-end tests seems surprisingly hard.
(3) Do.Not. Trust. the AI when it declares success. Whether something actually *works* you need to check yourself. I'll leave this example here - the AI broken 25 tests, and decided after fixing one of the failures that the rest probably isn't their fault.
(2) Validation is king, but also very hard. Again, these tools produce a lot of code. I quickly realized that reviewing it line-by-line is unrealistic. It may be more realistic when doing small changes in an established system, but in greenfield dev you have to go with the flow.
Lesson Learned (1): you feel more productive than you truly are. These tools produce *a lot* of code in short time, but if you take a step back after a few weeks you notice than a fair bit of it wasn't actually that useful. It still takes time to build something that actually works, and not just 75%
I initially used Gemini (in the console), but eventually moved on to Claude Console. They seem similar, but results from Claude where subjectively better, the tooling seems more mature, and the rates allowed me to work without much interruption. I am using the Pro subscription for USD 25 / month.
For the last couple of weeks I have been trying to vibe-code a relatively complicated research system in the area of Java microbenchmarking in my spare time.
I am slowly reaching the point where the system does something useful, so here are some initial impressions:
People talk a lot about echo chambers on here, but I think it's important to remember that you are not entitled to anybody's attention, independently of how important you or your cause are.
People are saying that AI will transform the way we teach and learn. It has already transformed the way students cheat and, to my surprise, how they apologize for cheating.