With FlexOlmo, we showed modular pretraining could work—including in settings where data can’t easily be pooled in one place. BAR extends that to post-training.
🤗 Models: buff.ly/rkMWLrM
📝 Blog: buff.ly/i2aljLA
📄 Paper: buff.ly/kNDxvko
Posts by Ai2
One big advantage is targeted upgrades. A better code recipe can improve the code expert; a better math recipe can improve the math expert.
That means improving a model after pretraining doesn’t always have to mean starting over.
At the 7B scale, BAR works better than other common ways of updating a model after pretraining.
It beats methods that train separate models and try to combine them later, & also gets close to the results of retraining from scratch.
We train separate “experts” for different skills, then combine them into one model that learns which expert to use.
That makes post-training more modular, so improving one area doesn’t require retraining the entire model.
When you improve a model’s skills – like math, code, or tool use – those gains often come with extra cost, retraining, or lost capabilities elsewhere.
BAR takes a more modular + flexible approach.
Last year, we introduced FlexOlmo, a novel way to train parts of a model independently then combine them later.
BAR builds on that idea for a harder problem: how to keep improving a model without having to retrain each time. 🧵
ScienceWorld + DiscoveryWorld are both open & freely available, because we believe building open evals is as important as building open models.
Read more about them in our latest blog: buff.ly/Qwu5DcR
The field is moving fast. The question isn't whether agents will eventually help treat diseases, discover new materials, & more—it's whether we're being clear-eyed about where they are right now.
That's how progress gets made.
DiscoveryWorld goes further. Agents have to design + run full scientific investigations from scratch: form hypotheses, collect data, & analyze results.
Average human scientists with advanced degrees can complete ~70% of its harder challenges. Very strong AI systems manage ~20%.
That gap between knowing and doing is what our benchmarks ScienceWorld & DiscoveryWorld measure.
ScienceWorld tests agents on elementary-school science experiments. When it launched, top models scored below 10%. As of early 2025, they're in the low 80s. Progress—but still not solved.
There's a pattern in AI: models ace the exam, then fail in the lab.
In 2022, models that got As on multiple-choice science tests couldn't perform more than 90% of those same experiments in a virtual environment.
Knowing what a boiling point is and measuring one aren't the same thing.
Everyone's building AI science agents.
The claims are extraordinary. But when we test whether these systems can actually do science, recent top ones still fail challenges that human scientists can solve the majority of the time. 🧵
🔧 The training code, eval harness, annotation tooling, & demo code are now live: buff.ly/3rm5lrV
📄 And our technical report is on arXiv: buff.ly/dstEnog
⚠️ Previously downloaded our @hf.co data? Please redownload—the datasets have been updated.
We're also releasing the client-side code for our MolmoWeb demo—so you can see how we built the interface that lets you give MolmoWeb a task and watch it navigate websites in real time.
Use it as a starting point for your own web agent UI ↓ buff.ly/zMYlbtc
▸ Our eval harness lets you evaluate agents like MolmoWeb on 4 popular navigation benchmarks including WebVoyager & Online-Mind2Web.
▸ The harness also doubles as a synth data gen pipeline—you can generate web browsing data using LLM-/VLM-powered agents w/ AxTree/screenshot input.
🔹 Our training code has everything you need to customize MolmoWeb for specific tasks.
🔹 The new annotation tool lets you record human task demonstrations, then use the training code to fine-tune MolmoWeb on that data.
MolmoWeb is our open web agent built on Molmo 2. It operates a browser by viewing screenshots and taking action – clicking, typing, and scrolling – the same way a person would.
We launched the model in March. Now we're releasing the rest of the components we used to build it.
You can now train, adapt, and eval web agents on your own tasks.
We're releasing the full MolmoWeb codebase—the training code, eval harness, annotation tooling, synthetic data pipeline, & client-side code for our demo. 🧵
As always, everything is openly available:
🤖 Models: huggingface.co/collections/...
📊 Code: github.com/allenai/Wild...
🗂️ Data: huggingface.co/datasets/all...
🎮 Demo: huggingface.co/spaces/allen...
📄 Tech report: allenai.org/papers/wildd...
📝 Blog: allenai.org/blog/wilddet3d
And we’re releasing a demo iOS app powered by WildDet3D.
Point your camera at a scene or upload a photo, select a category or draw a 2D box, & get 3D bounding boxes in real time.
Download it from the App Store: apps.apple.com/us/app/wildd...
www.youtube.com/watch?v=LJPN...
We're also releasing WildDet3D-Data, the largest open 3D detection dataset available w/:
◎ Over 1M images
◎ 3.7M verified 3D annotations
◎ 13K+ object categories
◎ 100K+ human-annotated images
Use it to train robust models that understand objects in 3D.
Most 3D detection evals cover only a handful of object types. The real world has thousands. On our newly proposed in-the-wild evaluation spanning 700+ categories, WildDet3D outperforms the strongest baseline by ~10x.
But does WildDet3D work on things it's never seen?
On out-of-distribution autonomous driving scenes & indoor environments that robots/AR systems need to navigate, it nearly doubles the best prior scores.
The biggest gains are on object categories WildDet3D was never trained on.
On widely-used evals, WildDet3D sets a new state of the art while training with a fraction of the compute used by prior methods.
When a depth sensor is available, it folds that data in automatically for even better results—no architecture changes needed.
Type "firetruck" and WildDet3D finds every one in the scene with a full 3D bounding box.
Tap an object and it does the same. Feed it a 2D detection from another model and it adds the missing 3D information.
That means any vision system can gain spatial awareness.
Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild.
It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵
Asta AutoDiscovery works by autonomously generating & testing hypotheses on your data, guided by Bayesian surprise to surface the unexpected.
Researchers have already run 35K+ hypotheses across social science, climate science, marine ecology, & more.
🧪 Try it: asta-autodiscovery.allen.ai
"When you have an agent building this tree of surprising results, you can have [an] oncologist wake up in the morning and say, 'Hey, if that's true, that's actually pretty interesting.' The [...] things that potentially lead to changes in treatment for different types of cancer."
📺 buff.ly/px3xfgi
Thrilled to have Ai2’s VP of Engineering Jeremy Tryba on stage at @geekwire.com's Agents of Transformation event last week.
He painted a vivid picture of what agentic AI can do for science, and cancer research in particular. 🧵