Anthropic shipped a new "web search" feature for their Claude consumer apps today, here are my notes - it's frustrating that they don't share details on whether the underlying index is their own or run by a partner simonwillison.net/2025/Mar/20/...
Posts by Madison May
In addition, many results humans classify as "errors" are the result of underspecification. If the LLM doesn't return what you wanted, perhaps you weren't specific enough with your request.
Some people frequently (rightly) point out that AIs make mistakes and are not fully reliable. Indeed, hallucinations may never completely be solved.
But I am not sure that matters much. Larger models already make far less errors & many real world processes are built with error-prone humans in mind.
Want to check out the source for the "AlexNet" paper? Google has made the code from Krizhevsky, Sutskever and Hinton's seminal "ImageNet Classification with Deep Convolutional
Neural Networks" paper open source, in partnership with the Computer History Museum.
computerhistory.org/press-releas...
If the latest and greatest LLMs aren't effective on your codebase, it may not be the LLMs that are the problem
If you regularly work with math, I can't recommend trying out Corca enough. Corca is a beautiful collaborative math editor, dubbed 'Figma for math,' built by a team that deeply cares about math, science, and their product.
corca.io
We'll be hosting weekly office hours on our Discord server! Our developer relations engineer Cameron will be there to answer questions, talk about AI engineering, and generally chat about what you're building.
Come see us on Tuesday mornings at 8am PST!
DeepSeek does not "do for $6M5 what cost US AI companies billions". I can only speak for Anthropic, but Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train (I won't give an exact number). Also, 3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors). Sonnet's training was conducted 9-12 months ago, and DeepSeek's model was trained in November/December, while Sonnet remains notably ahead in many internal and external evals. Thus, I think a fair statement is "DeepSeek produced a model close to the performance of US models 7-10 months older, for a good deal less cost (but not anywhere near the ratios people have suggested)".
Whoah.. sonnet was *not* distilled
"3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors)."
—Dario Amodei
darioamodei.com/on-deepseek-...
These four points on DeepSeek seem very likely correct and important to understand about the economics of building AI models and what DeepSeek actually did, from the CEO of Anthropic. darioamodei.com/on-deepseek-...
About to submit some of the most bonkers papers I've ever been involved in to ICML. It has taken years to get here but I'm so excited...
Dario includes details about Claude 3.5 Sonnet that I've not seen shared anywhere before: Claude 3.5 Sonnet cost "a few $10M's to train" 3.5 Sonnet "was not trained in any way that involved a larger or more expensive model (contrary to some rumors)" - I've seen those rumors, they involved Sonnet being a distilled version of a larger, unreleased 3.5 Opus. Sonnet's training was conducted "9-12 months ago" - that would be roughly between January and April 2024. If you ask Sonnet about its training cut-off it tells you "April 2024" - that's surprising, because presumably the cut-off should be at the start of that training period?
Published some notes on Dario Amodei's new essay on DeepSeek, mainly to highlight some new-to-me details he included about Claude 3.5 Sonnet
simonwillison.net/2025/Jan/29/...
Great read from Dario Amodei on what aspects of DeepSeek's R1 release are most significant:
darioamodei.com/on-deepseek-...
reminder: claude has been thinking for a while. we may never see an explicit reasoning model from anthropic, their CEO has been open about this (2024-09-05) www.interconnects.ai/p/openai-str...
30 mins might almost be preferable -- it would be enough of a wait that I'd be inclined to take the time to carefully lay out my problem before hitting send.
At 2 mins I'm often lazy and treat it like a traditional chat experience, adding context as I go if I don't get the result I want
One downside of R1 / O1 -- they take just enough time I'm likely to context switch and come back later. At 30s or less I might as well wait around for the result, but ~2 mins is an awkward amount of time.
DeepSeek R1 appears to be a VERY strong model for coding - examples for both C and Python here: simonwillison.net/2025/Jan/27/...
Why reasoning models will generalize
DeepSeek R1 is just the tip of the ice berg of rapid progress.
People underestimate the long-term potential of “reasoning.”
OpenAI's Canvas feature got a big upgrade today, turning it into a direct competitor for Anthropic's excellent Claude Artifacts feature - my notes here: simonwillison.net/2025/Jan/25/...
If you don't notice the difference between GPT-4o and o1-pro, you're probably not asking specific enough questions
I am deeply worried by the withdrawal of the US from the World Health Organization. I worked at WHO for ~2 years at WHO's Global Programme on AIDS, a worldwide response to the HIV pandemic & international cooperation was critical. The US should not withdraw from WHO's global health cooperation.
I’m thrilled to share that I’ve finished my Ph.D. at Mila and Polytechnique Montreal. For the last 4.5 years, I have worked on creating new faithfulness-centric paradigms for NLP Interpretability. Read my vision for the future of interpretability in our new position paper: arxiv.org/abs/2405.05386
Ridgeplot of route difficulty posteriors
This would’ve been useful when I wrote that rock climbing post github.com/tpvasconcelo...