Madison May (@pragmaticml) Bsky

Claude can now search the web Claude 3.7 Sonnet on the paid plan now has a web search tool that can be turned on as a global setting. This was sorely needed. ChatGPT, Gemini and Grok …

Anthropic shipped a new "web search" feature for their Claude consumer apps today, here are my notes - it's frustrating that they don't share details on whether the underlying index is their own or run by a partner simonwillison.net/2025/Mar/20/...

1 year ago 85 8 7 0

In addition, many results humans classify as "errors" are the result of underspecification. If the LLM doesn't return what you wanted, perhaps you weren't specific enough with your request.

1 year ago 0 0 0 0

Some people frequently (rightly) point out that AIs make mistakes and are not fully reliable. Indeed, hallucinations may never completely be solved.

But I am not sure that matters much. Larger models already make far less errors & many real world processes are built with error-prone humans in mind.

1 year ago 50 3 6 2

Want to check out the source for the "AlexNet" paper? Google has made the code from Krizhevsky, Sutskever and Hinton's seminal "ImageNet Classification with Deep Convolutional
Neural Networks" paper open source, in partnership with the Computer History Museum.

computerhistory.org/press-releas...

1 year ago 118 20 4 1

If the latest and greatest LLMs aren't effective on your codebase, it may not be the LLMs that are the problem

1 year ago 1 0 0 0

If you regularly work with math, I can't recommend trying out Corca enough. Corca is a beautiful collaborative math editor, dubbed 'Figma for math,' built by a team that deeply cares about math, science, and their product.
corca.io

1 year ago 10 3 0 1

Join the .txt Discord Server! Check out the .txt community on Discord - hang out with 1407 other members and enjoy free voice and text chat.

We'll be hosting weekly office hours on our Discord server! Our developer relations engineer Cameron will be there to answer questions, talk about AI engineering, and generally chat about what you're building.

Come see us on Tuesday mornings at 8am PST!

1 year ago 2 1 0 1

DeepSeek does not "do for $6M5 what cost US AI companies billions". I can only speak for Anthropic, but Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train (I won't give an exact number). Also, 3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors). Sonnet's training was conducted 9-12 months ago, and DeepSeek's model was trained in November/December, while Sonnet remains notably ahead in many internal and external evals. Thus, I think a fair statement is "DeepSeek produced a model close to the performance of US models 7-10 months older, for a good deal less cost (but not anywhere near the ratios people have suggested)".

Whoah.. sonnet was *not* distilled

"3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors)."

—Dario Amodei

darioamodei.com/on-deepseek-...

1 year ago 30 3 1 1

These four points on DeepSeek seem very likely correct and important to understand about the economics of building AI models and what DeepSeek actually did, from the CEO of Anthropic. darioamodei.com/on-deepseek-...

1 year ago 102 14 7 4

About to submit some of the most bonkers papers I've ever been involved in to ICML. It has taken years to get here but I'm so excited...

1 year ago 90 1 5 0

Dario includes details about Claude 3.5 Sonnet that I've not seen shared anywhere before: Claude 3.5 Sonnet cost "a few $10M's to train" 3.5 Sonnet "was not trained in any way that involved a larger or more expensive model (contrary to some rumors)" - I've seen those rumors, they involved Sonnet being a distilled version of a larger, unreleased 3.5 Opus. Sonnet's training was conducted "9-12 months ago" - that would be roughly between January and April 2024. If you ask Sonnet about its training cut-off it tells you "April 2024" - that's surprising, because presumably the cut-off should be at the start of that training period?

Published some notes on Dario Amodei's new essay on DeepSeek, mainly to highlight some new-to-me details he included about Claude 3.5 Sonnet

simonwillison.net/2025/Jan/29/...

1 year ago 93 9 3 2

Dario Amodei — On DeepSeek and Export Controls On DeepSeek and Export Controls

Great read from Dario Amodei on what aspects of DeepSeek's R1 release are most significant:

darioamodei.com/on-deepseek-...

1 year ago 1 0 0 0

OpenAI’s Strawberry and inference scaling laws OpenAI’s Strawberry, LM self-talk, inference scaling laws, and spending more on inference. Coming waves in LLMs.

reminder: claude has been thinking for a while. we may never see an explicit reasoning model from anthropic, their CEO has been open about this (2024-09-05) www.interconnects.ai/p/openai-str...

1 year ago 15 4 1 0

30 mins might almost be preferable -- it would be enough of a wait that I'd be inclined to take the time to carefully lay out my problem before hitting send.

At 2 mins I'm often lazy and treat it like a traditional chat experience, adding context as I go if I don't get the result I want

1 year ago 0 0 0 0

One downside of R1 / O1 -- they take just enough time I'm likely to context switch and come back later. At 30s or less I might as well wait around for the result, but ~2 mins is an awkward amount of time.

1 year ago 0 0 1 0

ggml : x2 speed for WASM by optimizing SIMD PR by Xuan-Son Nguyen for `llama.cpp`: > This PR provides a big jump in speed for WASM by leveraging SIMD instructions for `qX_K_q8_K` and `qX_0_q8_0` dot product functions. > > …

DeepSeek R1 appears to be a VERY strong model for coding - examples for both C and Python here: simonwillison.net/2025/Jan/27/...

1 year ago 276 30 19 1

Why reasoning models will generalize People underestimate the long-term potential of “reasoning.”

Why reasoning models will generalize
DeepSeek R1 is just the tip of the ice berg of rapid progress.
People underestimate the long-term potential of “reasoning.”

1 year ago 51 8 5 1

OpenAI Canvas gets a huge upgrade [Canvas](https://openai.com/index/introducing-canvas/) is the ChatGPT feature where ChatGPT can open up a shared editing environment and collaborate with the user on creating a document or piece of co...

OpenAI's Canvas feature got a big upgrade today, turning it into a direct competitor for Anthropic's excellent Claude Artifacts feature - my notes here: simonwillison.net/2025/Jan/25/...

1 year ago 76 7 2 0

If you don't notice the difference between GPT-4o and o1-pro, you're probably not asking specific enough questions

1 year ago 0 0 0 0

I am deeply worried by the withdrawal of the US from the World Health Organization. I worked at WHO for ~2 years at WHO's Global Programme on AIDS, a worldwide response to the HIV pandemic & international cooperation was critical. The US should not withdraw from WHO's global health cooperation.

1 year ago 146 24 7 2

Interpretability Needs a New Paradigm Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only model...

I’m thrilled to share that I’ve finished my Ph.D. at Mila and Polytechnique Montreal. For the last 4.5 years, I have worked on creating new faithfulness-centric paradigms for NLP Interpretability. Read my vision for the future of interpretability in our new position paper: arxiv.org/abs/2405.05386

1 year ago 36 4 3 1

Ridgeplot of route difficulty posteriors

This would’ve been useful when I wrote that rock climbing post github.com/tpvasconcelo...

1 year ago 12 2 4 1

Posts by Madison May