There's rightly a lot of excitement around Karpathy's autoresearch. We've been studying at ARG for a couple years now: what happens when you put an AI agent in a loop and let it run experiments, evaluate results, and iterate without you.
We've built a bunch of benchmarks and tools to measure this. 🧵
Posts by Matthew Kenney
bsky.app/profile/algo...
Over the past ~2 years I’ve been working hard on agents, models, and datasets to understand what recursive self-improvement might look like, and what path supporting infrastructure for this line of research might take.
Very excited to open source some of that work starting today
Very excited to launch this little tool that we’ve been building. ScoutML is an API built for AI researchers and agents that includes a ton of metadata on each paper. It’s been super helpful for us as we run our research agents internally.
x.com/algoresearch...
If you're working on ML and this resonates, I’d love to hear what you'd want it to do. We're opening up a limited beta. Link below: prospectml.com
It’s built on top of a foundation of parsed metadata from papers, code, and repos—models, metrics, datasets, SOTA claims, GPU counts (and types), ablation studies, citations, etc. It’s already become crucial to our internal research, and we hope it can be helpful to others, too.
It’s designed to support that murky, nonlinear part of the research process, where you're still figuring out what's interesting.
You give it a question like “How can we improve generalization in low-resource RL?” and it returns distilled insights, speculative ideas, and experimental code. Not final answers, just something to push the thinking forward.
Most of the time, I end up manually digging through papers, chasing links, and piecing together ideas. It works, but it’s slow, and it doesn’t scale with curiosity. I’ve been trying to fix that with a platform we're building called ProspectML.
A lot of ML tools help you implement. Not many help you think.
When I’m exploring a new research direction, I don’t want another search engine or citation graph. I want something that’s actually read the literature, can suggest promising directions, and helps me reason through tradeoffs.
hello world!
ARG is on Bluesky! Please follow here: @algoresearch.bsky.social
Back in Pennsylvania, drinking schuylkill county coal cracker (boilo) and making pierogies
That’s because it was from 2022
AI for science could be more impactful than chatbots. It is already helping win Nobel prizes and accelerating drug development and materials discovery.
Today we published an essay about it: why it matters, how it’s happening and its implications. Here is a summary from an econ / social sci lens.
Important point that the open protocol makes extracting data from bluesky easy. Can't have it both ways. I like the protocol and think this site is well designed, but that means anyone can and will analyze these posts (if there is value to them, which I'm honestly less convinced of than some)
A dataset of 1 million or 2 million Bluesky posts is completely irrelevant to training large language models.
The primary usecase for the datasets that people are losing their shit over isn't ChatGPT, it's social science research and developing systems that improve Bluesky.
Wait what even is this platform. This is insane
What! If it works for umap-learn vs umap i’m in.
I have a community project in Eleuther and open source all of my research:
bsky.app/profile/bayk...
Jk the rest are great. Just a big uncle nearest fan
5
.
.
.
.
.
.
.
.
3
2
4
1
We welcome PRs, contributions, additional tasks, and task revisions. Excited to see how agents perform on this benchmark.
We develop a baseline agent, with tools for coding, research (via Semantic Scholar), and model training, built on top of Sonnet 3.5 and GPT-4o. Our baseline agent performs well across tasks, but generally fails to move beyond baseline implementations.
ML Research Bench adapts tasks from ML conference competitions like ‘NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day’ and ‘LLM Merging Competition’. We prompt agents to complete these challenging tasks. These tasks move beyond simple ML tasks.
(re-posting from X)
Can we get AI to accelerate AI research and development?
I’m excited to release ML Research Benchmark, an agentic benchmark of 7 ML conference competition tasks.
Paper: arxiv.org/abs/2410.22553
Tasks: github.com/AlgorithmicR...
Agent: github.com/AlgorithmicR...
Maxo Kream