Ai2 (@ai2) Bsky - nopzon.com

With FlexOlmo, we showed modular pretraining could work—including in settings where data can’t easily be pooled in one place. BAR extends that to post-training.

🤗 Models: buff.ly/rkMWLrM
📝 Blog: buff.ly/i2aljLA
📄 Paper: buff.ly/kNDxvko

20 hours ago 9 0 0 0

One big advantage is targeted upgrades. A better code recipe can improve the code expert; a better math recipe can improve the math expert.

That means improving a model after pretraining doesn’t always have to mean starting over.

20 hours ago 5 0 1 0

At the 7B scale, BAR works better than other common ways of updating a model after pretraining.

It beats methods that train separate models and try to combine them later, & also gets close to the results of retraining from scratch.

20 hours ago 4 0 1 0

We train separate “experts” for different skills, then combine them into one model that learns which expert to use.

That makes post-training more modular, so improving one area doesn’t require retraining the entire model.

20 hours ago 3 0 1 0

When you improve a model’s skills – like math, code, or tool use – those gains often come with extra cost, retraining, or lost capabilities elsewhere.

BAR takes a more modular + flexible approach.

20 hours ago 4 0 1 0

Last year, we introduced FlexOlmo, a novel way to train parts of a model independently then combine them later.

BAR builds on that idea for a harder problem: how to keep improving a model without having to retrain each time. 🧵

20 hours ago 35 10 1 1

ScienceWorld + DiscoveryWorld are both open & freely available, because we believe building open evals is as important as building open models.

Read more about them in our latest blog: buff.ly/Qwu5DcR

1 week ago 4 0 0 0

The field is moving fast. The question isn't whether agents will eventually help treat diseases, discover new materials, & more—it's whether we're being clear-eyed about where they are right now.

That's how progress gets made.

1 week ago 6 0 1 0

DiscoveryWorld goes further. Agents have to design + run full scientific investigations from scratch: form hypotheses, collect data, & analyze results.

Average human scientists with advanced degrees can complete ~70% of its harder challenges. Very strong AI systems manage ~20%.

1 week ago 6 0 1 0

That gap between knowing and doing is what our benchmarks ScienceWorld & DiscoveryWorld measure.

ScienceWorld tests agents on elementary-school science experiments. When it launched, top models scored below 10%. As of early 2025, they're in the low 80s. Progress—but still not solved.

1 week ago 4 0 1 0

There's a pattern in AI: models ace the exam, then fail in the lab.

In 2022, models that got As on multiple-choice science tests couldn't perform more than 90% of those same experiments in a virtual environment.

Knowing what a boiling point is and measuring one aren't the same thing.

1 week ago 5 0 1 0

Everyone's building AI science agents.

The claims are extraordinary. But when we test whether these systems can actually do science, recent top ones still fail challenges that human scientists can solve the majority of the time. 🧵

1 week ago 22 8 2 1

GitHub - allenai/molmoweb Contribute to allenai/molmoweb development by creating an account on GitHub.

🔧 The training code, eval harness, annotation tooling, & demo code are now live: buff.ly/3rm5lrV

📄 And our technical report is on arXiv: buff.ly/dstEnog

⚠️ Previously downloaded our @hf.co data? Please redownload—the datasets have been updated.

1 week ago 2 0 0 0

MolmoWeb in Action Introducing MolmoWeb, an open visual web agent built on our Molmo 2 multimodal model family in two sizes (4B and 8B parameters) along with the weights, training data, code (training code coming…

We're also releasing the client-side code for our MolmoWeb demo—so you can see how we built the interface that lets you give MolmoWeb a task and watch it navigate websites in real time.

Use it as a starting point for your own web agent UI ↓ buff.ly/zMYlbtc

1 week ago 0 0 1 0

▸ Our eval harness lets you evaluate agents like MolmoWeb on 4 popular navigation benchmarks including WebVoyager & Online-Mind2Web.

▸ The harness also doubles as a synth data gen pipeline—you can generate web browsing data using LLM-/VLM-powered agents w/ AxTree/screenshot input.

1 week ago 0 0 1 0

🔹 Our training code has everything you need to customize MolmoWeb for specific tasks.

🔹 The new annotation tool lets you record human task demonstrations, then use the training code to fine-tune MolmoWeb on that data.

1 week ago 0 0 1 0

MolmoWeb is our open web agent built on Molmo 2. It operates a browser by viewing screenshots and taking action – clicking, typing, and scrolling – the same way a person would.

We launched the model in March. Now we're releasing the rest of the components we used to build it.

1 week ago 0 0 1 0

You can now train, adapt, and eval web agents on your own tasks.

We're releasing the full MolmoWeb codebase—the training code, eval harness, annotation tooling, synthetic data pipeline, & client-side code for our demo. 🧵

1 week ago 17 4 1 0

As always, everything is openly available:
🤖 Models: huggingface.co/collections/...
📊 Code: github.com/allenai/Wild...
🗂️ Data: huggingface.co/datasets/all...
🎮 Demo: huggingface.co/spaces/allen...
📄 Tech report: allenai.org/papers/wildd...
📝 Blog: allenai.org/blog/wilddet3d

1 week ago 1 0 0 0

WildDet3D App - App Store Download WildDet3D by Taoyang Jia on the App Store. See screenshots, ratings and reviews, user tips, and more apps like WildDet3D.

And we’re releasing a demo iOS app powered by WildDet3D.

Point your camera at a scene or upload a photo, select a category or draw a 2D box, & get 3D bounding boxes in real time.

Download it from the App Store: apps.apple.com/us/app/wildd...

www.youtube.com/watch?v=LJPN...

1 week ago 1 0 1 0

We're also releasing WildDet3D-Data, the largest open 3D detection dataset available w/:

◎ Over 1M images
◎ 3.7M verified 3D annotations
◎ 13K+ object categories
◎ 100K+ human-annotated images

Use it to train robust models that understand objects in 3D.

1 week ago 0 0 1 0

Most 3D detection evals cover only a handful of object types. The real world has thousands. On our newly proposed in-the-wild evaluation spanning 700+ categories, WildDet3D outperforms the strongest baseline by ~10x.

1 week ago 0 0 1 0

But does WildDet3D work on things it's never seen?

On out-of-distribution autonomous driving scenes & indoor environments that robots/AR systems need to navigate, it nearly doubles the best prior scores.

The biggest gains are on object categories WildDet3D was never trained on.

1 week ago 0 0 1 0

On widely-used evals, WildDet3D sets a new state of the art while training with a fraction of the compute used by prior methods.

When a depth sensor is available, it folds that data in automatically for even better results—no architecture changes needed.

1 week ago 0 0 1 0

Type "firetruck" and WildDet3D finds every one in the scene with a full 3D bounding box.

Tap an object and it does the same. Feed it a 2D detection from another model and it adds the missing 3D information.

That means any vision system can gain spatial awareness.

1 week ago 0 0 1 0

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild.

It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

1 week ago 23 6 1 0

AstaLabs AutoDiscovery

Asta AutoDiscovery works by autonomously generating & testing hypotheses on your data, guided by Bayesian surprise to surface the unexpected.

Researchers have already run 35K+ hypotheses across social science, climate science, marine ecology, & more.

🧪 Try it: asta-autodiscovery.allen.ai

3 weeks ago 0 0 0 0

AstaLabs AutoDiscovery

"When you have an agent building this tree of surprising results, you can have [an] oncologist wake up in the morning and say, 'Hey, if that's true, that's actually pretty interesting.' The [...] things that potentially lead to changes in treatment for different types of cancer."

📺 buff.ly/px3xfgi

3 weeks ago 0 0 1 0

Thrilled to have Ai2’s VP of Engineering Jeremy Tryba on stage at @geekwire.com's Agents of Transformation event last week.

He painted a vivid picture of what agentic AI can do for science, and cancer research in particular. 🧵

3 weeks ago 4 1 2 0

MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific…

The MolmoBot tech report has been updated with new benchmarks, and we've launched a technical website with more real-world videos, including every trajectory underlying our evals.

📄 Tech report: buff.ly/odTgbnU
🌐 Website: buff.ly/Cqrht2W

3 weeks ago 3 1 0 0

Posts by Ai2