METR (@metr.org) Bsky

Evidence that AI can already do some weeks-long coding tasks Early results from MirrorCode benchmark with METR: AI agents can complete weeks-long coding tasks, including reimplementing a 16,000-line codebase.

We’re excited to share the full benchmark soon, including more target programs, results from other models, and larger inference budgets. For more detail on these preliminary results, see our joint blog post with Epoch: epoch.ai/blog/mirrorc...

11 hours ago 0 0 0 0

We initially thought MirrorCode could be a scalable way to generate difficult, long-horizon tasks. Given these preliminary results, however, we’re more pessimistic about creating difficult but fair benchmarks for SWE capabilities on these types of well-specified tasks.

11 hours ago 1 0 1 0

Notably, these are much longer than what we’ve measured on Time Horizon 1.1 (our standard agentic task suite). This could be because these tasks provide a precise, checkable specification, and/or because AI companies are already training on similar tasks.

11 hours ago 0 0 1 0

We co-developed MirrorCode with @epochai.bsky.social to test AI on extremely long-horizon blackbox software reimplementation tasks. We found that recent public models are able to fully implement at least some programs we estimate would take humans weeks or months to implement.

11 hours ago 3 0 1 0

You can find details about our measurement methodology and time horizon estimates for other models on our website. metr.org/time-horizons/

11 hours ago 0 0 0 0

We observed similar situations in previous measurements as well. All measurements we published over the past year would have been higher had we not penalized reward-hacking attempts. But this discrepancy was especially pronounced for GPT-5.4.

11 hours ago 0 0 1 0

For this reason, we are also reporting our estimate of the model’s time horizon prior to rescoring the reward-hacking attempts. Allowing for reward hacks results in a point estimate of 13hrs (95% CI of 5hrs to 74hrs).

11 hours ago 0 0 1 0

However, in our GPT-5.4 evaluation we noticed its runs were producing reward hacks unusually often. A quick test suggested that using a different prompt might cause it to produce more legitimate successes instead of reward hacks.

11 hours ago 0 0 1 0

In our measurements, whenever a model succeeds on a task by reward-hacking, we consider the attempt a failure. Following this same policy, we arrived at a point estimate of 5.7hrs (95% CI of 3hrs to 13.5hrs) for GPT-5.4’s time horizon.

11 hours ago 0 0 1 0

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.

11 hours ago 8 0 1 2

As our task suite saturates, our results become more sensitive to analysis choices. We’re doing a deep dive into different methodological choices and sensitivity analyses, and expect to share more soon. Below is a sneak peak of what we’re working on. We’re open to requests for analyses to run.

1 month ago 1 0 0 0

Huge thanks to Brendan Halstead at AI Futures Project and Coz at AI Security Institute who both independently spotted this and let us know!

1 month ago 1 0 1 0

We don’t think regularization is never appropriate, just that the particular form we previously applied (penalizing any steepness in the task-length→success probability curves) doesn’t reflect our beliefs, and foreseeably biases the results, so it is better to remove it.

1 month ago 0 0 1 0

However, we didn’t recheck whether this regularization was impacting our results for new models. For newer models with higher time horizons, the fits are less constrained by the data so more sensitive to regularization. This was a foreseeable mistake and we apologize.

1 month ago 0 0 1 0

Regularizing the slope towards zero wasn’t theoretically motivated - it was mainly left in because it sped up model fitting. We had planned to lower its strength further, but due to its perceived lack of impact, decided to publish instead of going through another round of review.

1 month ago 0 0 1 0

How did this happen?

In our time horizon paper the function we used to fit the task-length→success prob curve regularizes the slope towards zero as standard. We decreased the strength by 10x from the default, checked it wasn’t notably impacting fit, and then used that setup.

1 month ago 1 0 1 0

Correcting this decreases Opus’s 4.6 50% TH to 12 hours, and increases the 80% TH to 1.2 hours. This leaves us well inside the original CI (6hrs-98hrs for 50% TH, 0.4hrs-2.6hrs for 80%) - the most important source of uncertainty is still probably the task distribution rather than analysis issues.

1 month ago 2 0 1 0

We’re correcting a mistake in our modeling that inflated recent 50%-time horizons by 10-20% (and reduced 80%-horizons). We inappropriately penalized steepness in task-length→success curve fits. This most affects the oldest and newest models, whose fits are less data-constrained.

1 month ago 8 1 2 0

We are Changing our Developer Productivity Experiment Design Our second developer productivity study faces selection effects from wider AI adoption, prompting us to redesign our approach.

Check out our website for the full announcement and for access to the underlying data: metr.org/blog/2026-02...

1 month ago 2 0 0 0

We are adjusting the design of our experiment going forward to address these concerns and to better estimate uplift. We expect that our measurement techniques will need to continually evolve as AI capabilities improve.

1 month ago 1 0 1 0

We believe this selection causes our new data to understate the true speedup. Selection effects are not the only issue we noticed with our experimental design: we also think it has trouble tracking work when participants use agents to parallelize over multiple tasks.

1 month ago 1 0 1 0

We started a continuation in August 2025. However, we noticed developers were opting not to participate or submit work. Participants said they did this mostly due to expected productivity loss on "AI disallowed” tasks. Lower pay was also a factor ($50/hr, down from $150).

1 month ago 1 0 1 0

Last year we published findings that AI tools caused a 20% slowdown among experienced open source developers, using data collected over February to June 2025. We still believe that estimate was accurate for the specific tools and population at the time.

1 month ago 1 0 1 0

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

1 month ago 23 5 3 0

Task-Completion Time Horizons of Frontier AI Models Our most up-to-date measurements of the time horizons for public frontier language models.

You can find details about our measurement methodology and time horizon estimates for other models on our website: metr.org/time-horizons/

1 month ago 1 0 0 0

For this measurement, we used our Triframe scaffold as usual (not Codex). We did a partial measurement with a Codex scaffold, and our results did not seem very different. This is in line with similar comparisons we’ve run in the past.

1 month ago 0 0 1 0

Initially scaffolding/format issues hurt performance. After addressing those, we did not observe any signs of the model being confused or directly hampered by the scaffold. However, we were left with the impression that this model’s performance may be more sensitive to the scaffold being used.

1 month ago 0 0 1 0

We estimate that GPT-5.3-Codex with reasoning effort `high` (not `xhigh`) has a 50%-time-horizon of around 6.5 hours (95% CI of 3 hrs to 17 hrs) on our suite of software tasks. OpenAI provided API access for this evaluation.

1 month ago 4 0 2 0

Task-Completion Time Horizons of Frontier AI Models Our most up-to-date measurements of the time horizons for public frontier language models.

You can find details about our measurement methodology and time-horizon estimates for other models on our website: metr.org/time-horizons

1 month ago 4 0 0 0

We are working on updated methods to better track state-of-the-art AI capabilities. However, these are still in development so they don't address our immediate measurement gap. In the meantime, we advise caution in interpreting and comparing our recent time-horizon measurements.

1 month ago 7 0 1 0

Posts by METR