Great speakers at this symposium - 22 April 2026 at LSHTM. Join online or in person (email if in person) to hear great talks at the cutting edge of Inequalities in infectious disease dynamics. www.lshtm.ac.uk/newsevents/e...
Posts by Spencer J Fox
I need this to be real, so I can get around NAUs rules against spending funds on coffee machines/accessories
Based on my understanding of AI, I don't think having it "think" longer will address any of the issues I faced building the app, so I think there will need to be some major breakthroughs to get there (e.g. moving beyond LLMs or embedding them with new capabilities)
I absolutely can see how AI will make coding more productive and even my research, but I don't see this replacing individuals for important tasks in the near-term and any use of this for research purposes should be heavily audited before trusting
Maybe I'll come back to vibecoding the app again, but I am not sure. In the end I produced something that looked like it worked, but couldn't even make simple calculations accurately.
This week I ended the experiment and finally opened up excel again... It was a breath of fresh air: I knew that the numbers I inputted would save properly, could fully audit calculations, and ultimately was able to get my budgets organized
I thought I was making progress, the AIs were constantly finding new bugs! But then I tried to use it again and immediately I found new errors related to salary calculations and a number of things hidden within the app.
All of this made me lose total confidence in the app, I wasn't going to spend a bunch of time inputting all of my real information unless I thought this would actually work. So I spent about a month working with the AIs to simplify the app and debug everything.
The software was way outside of my understanding (Rust, SQL databases, etc), and I had no way to dig into what was going on under the hood. I would enter in numbers, save, reload and find that the numbers had changed
Then about a month ago (and after I confirmed with both AIs that the app was ready for primetime) I started really using it. Immediately I came across issues, the app had lied to me about database structures, saving features, loading features, budget calculations, basically everything.
On the good side, the app was built pretty quickly and it was really cool to see the interface and play around with the features. It was incredible to produce an app just for myself and I even started having hopes of getting it good enough I could post online for others in academics to use
I found myself spending this vibe code time "coding" and also answering emails, which wasn't a terrible use of time, but I would have rather been writing or thinking.
First, the experience is probably about as bad as it can get. Your agent works long enough that you have to try and work on other things while it's "thinking" but not long enough for you to get anything meaningful done in that time
I probably spent about 2 months working on the app on and off, adding and removing features, testing that it was working, etc. Overall it was probably about 40 hours of work across that time period with both codex and claude editing the app and critiquing each others work
This is something that I have a fairly simple excel spreadsheet to handle that I was hoping to move beyond, as I thought it would be more convenient, flexible, and ultimately accurate.
Starting from scratch I worked with the AI to develop what really should not be hard to develop. An app that allows me to enter information about various grants, personnel costs, expenses, and then can track the spend so that I can know how much money I have and where it is, etc.
To really test the AI I worked with it to vibe code a grant planning app, since budgeting various grant spend down and personnel salary costs within university environments is basically the bane of my existence (numbers and grants here are fake)
Double checking for small and often really dumb errors can take longer than coding things up myself, so I actually found I was using the AI tools only for limited coding tasks (e.g. I code up the full pipeline and have the AI input code in specific areas to do specific tasks)
As you get further towards full full agency, double checking becomes really important. When they are developing their own models, I needed to look through the code closely to identify both common and edge case errors that would have caused issues if I used for real or research purposes
For example, I have pipelines for producing infectious disease forecasts weekly. I can confidently use these tools to modify the code to use the data in different ways, with slight model variants, and even some simple and totally new model variants that I have the code develop itself
In my testing I found that these tools work really well when I have existing code that needs to be modified e.g. with aesthetic changes to tools, slight model modifications, new data inputs, data cleaning changes, etc. Even refactoring a whole code repository goes pretty smoothly
I have been playing with Codex and Claude extensively for about 6 months now, and while they are clearly impressive, I don't see them replacing software engineers in the near future and I don't understand how people can trust their work without knowing how to code
I'm here for the vibes! Would love to chat more :)
Yea I'll take a closer look at the basket index! Ensembling baselines is an interesting idea, but what would the point be? Capturing different reference opinions or improving performance?
I really like this idea of multiple baselines with different contexts. some relationship to this basket of baselines idea I was chatting about a while ago
community.epinowcast.org/t/a-basket-o...
Congratulations to Fox Lab PhD student @ehsansuez.bsky.social for his hard work on this project!
Our recommendation:
Transparency and use multiple Baseline models:
"...include an optimal flatline model as a stringent benchmark, a robust flatline model, and a seasonal baseline to test whether information from the current season improves predictions."
Why this matters:
Forecast hubs rely heavily on baselines for:
โข benchmarking models
โข ranking performance
๐ If the baseline shifts, the leaderboard can change too. Our work is complementary and supports the conclusions from previous work: www.medrxiv.org/content/10.1...
Interestingly, this flatline model outperformed seasonal models even with highly seasonal data
Our main finding: Performance differed substantially across variations AND a flatline model that uses the ten most recent transformed observations for training outcompeted all others across all diseases