I've recently been getting invitations to talk about how to use AI tools to assist with TCS research. Its something I've been doing a lot, but don't have structured thoughts about how to explain process. But I'm going to try -- first such talk is tomorrow: t.co/wlHPBzXzDm
Posts by Aaron Roth
AI Agents like Codex are very good at figuring out taxes, including obscure local ones that Intuit doesn't bother with (looking at you, Philadelphia local taxes). Businesses that provide financial/legal services that involve reasoning through dense but public documentation are in trouble.
Announcing the #ICML2026 tutorials!
All ten tutorials will be presented the first day of the conference, Monday July 6.
Read the blog post for more details on the selection process!
blog.icml.cc/2026/04/02/a...
So many interesting things here. (N.b. I get to think ab interfaces for this this all day long at work :))
One thing I find interesting here is how similar the real work of science is to that of the humanities, both of which are centered around human judgement ab what is relevant and interesting.
Alpha_0 joke from Dogman
Very cool work. Empirical science has many researcher-degrees-of-freedom which makes it hard to interpret specific studies --- these are only a single trajectory through the data analysis multiverse. Human researchers are opaque. But with agents you can explore the whole space!
My favorite translation use case is from informal english to precise mathematics.
The paper is here: arxiv.org/abs/2602.23360 and is joint work with Eric Eaton, @surbhigoel.bsky.social, @marcelhussing.bsky.social, @mkearnsphilly.bsky.social, @sikatasengupta.bsky.social and @optimistsinc.bsky.social
Via this simple argument, we get out-of-the-box agreement theorems about neural networks (parameterized by size), trees (parameterized by depth), gradient boosting (parameterized by iterations), and stacking (parameterized by ensemble size). No distributional assumptions needed.
No matter what the Bayes error is, no matter how complicated the problem, the learning curve is monotonically decreasing and bounded above and below. So no matter what its shape is, it can't avoid "flatness" for a long stretch. Whenever you get flatness you get agreement.
This lets us control out-of-the-box disagreement by the "local learning curve" --- how much can you decrease error by doubling the complexity of your model class? If the answer is not much, then you get out of the box agreement. The magic is that this is guaranteed to happen.
The more interesting case: most model classes are not convex. But we can parameterize a family of model classes by a measure of complexity: size, depth, etc. Usually, the average of two models of complexity n is a model in the same class but with higher complexity, say 2n.
An easy case: If our shared model class is convex, then the average of our two models is also in our model class. So if both of us trained to approximate optimality within our model class, there just isn't room for improvement within the same class; we have to mostly agree.
But we can think about it the other way around. What if I imagined the ensemble of your model and mine. If this imagined ensemble would not substantially decrease error, then it must be that your model and mine already mostly agree. When does averaging not reduce error?
The "Ambiguity Decomposition" is a fact that comes from ensembling theory. It says that if I average two models, the error of the ensemble is equal to the average of the model errors, less the "disagreement" of the two models. The goal is usually low error for the ensemble.
Neural networks are highly non-convex, so approximate error minimizers need not look anything like each other in parameter space. But we show that nevertheless (for many model sizes) approximate error minimizers must closely agree in function/prediction space despite this!
Michael @mkearnsphilly.bsky.social ) and I wrote a blog post about our experiences using AI for research, and our thoughts on what these developments will mean for research, publication, and education: www.amazon.science/blog/how-ai-...
Agree its a necessary component, just not a sufficient one. As AI progresses the problem of it producing incorrect proofs will diminish, but that doesn't solve the problem of overloading our publication venues beyond their capacity to serve as effective attention filters.
AI is bringing a sea change in scientific research methodology, training, and peer review. Amazon Scholars and Penn professors @mkearnsphilly.bsky.social and @aaroth.bsky.social on what agentic AI tools mean for the next generation of researchers.
Michael @mkearnsphilly.bsky.social ) and I wrote a blog post about our experiences using AI for research, and our thoughts on what these developments will mean for research, publication, and education: www.amazon.science/blog/how-ai-...
Which is the better model for math? GPT 5.2, or GPT 5.3 codex (either one on high reasoning)?
Sometimes you gotta split the difference. From Aaron Roth's (@aaroth.bsky.social) plenary talk at #ALT2026
Did you just miss punching your ticket to Rio or Salt Lake City? Wanna go to a conference where people will engage with you and your paper on foundations of responsible computing, and you won't get lost in the crowd?
Submit to #FORC2026, in Boston in June! Deadline in 2 weeks.
The other paper accepted to @iclr-conf.bsky.social 2026 🇧🇷. Our work on replicable RL sheds some light on how to consistently make decisions in RL.
@ericeaton.bsky.social @mkearnsphilly.bsky.social @aaroth.bsky.social @sikatasengupta.bsky.social @optimistsinc.bsky.social
I try to avoid posting about politics here, but I feel compelled to say some things that should be obvious: 🧵
The NSF has played a key role in American science, and risks being collateral damage in the war against science.
#econsky #academicsky #NSF #science
marketdesigner.blogspot.com/2026/01/hist...
The paper is here: arxiv.org/abs/2601.05245 Its joint work with @ncollina.bsky.social, Jiuyao Lu, and George Noarov. Natalie and George are on the job market --- check them out. www.seas.upenn.edu/~ncollina/ noarov.com
Not sure of the details, but I believe its related to the experiment that STOC ran giving feedback with a version of Gemini Deep Think which got generally postiive reviews for critiquing math research.google/blog/gemini-...
Whats wrong with providing access to a fancy LLM to give feedback to authors about their own papers?
But we ended up showing that this is impossible in generality. The results in the paper also lay out a slightly more nuanced landscape, and there remain some interesting open questions about the power of reductions from multicalibration to marginal calibration. Take a look!