We’re excited to share the full benchmark soon, including more target programs, results from other models, and larger inference budgets. For more detail on these preliminary results, see our joint blog post with Epoch: epoch.ai/blog/mirrorc...
Posts by METR
We initially thought MirrorCode could be a scalable way to generate difficult, long-horizon tasks. Given these preliminary results, however, we’re more pessimistic about creating difficult but fair benchmarks for SWE capabilities on these types of well-specified tasks.
Notably, these are much longer than what we’ve measured on Time Horizon 1.1 (our standard agentic task suite). This could be because these tasks provide a precise, checkable specification, and/or because AI companies are already training on similar tasks.
We co-developed MirrorCode with @epochai.bsky.social to test AI on extremely long-horizon blackbox software reimplementation tasks. We found that recent public models are able to fully implement at least some programs we estimate would take humans weeks or months to implement.
You can find details about our measurement methodology and time horizon estimates for other models on our website. metr.org/time-horizons/
We observed similar situations in previous measurements as well. All measurements we published over the past year would have been higher had we not penalized reward-hacking attempts. But this discrepancy was especially pronounced for GPT-5.4.
For this reason, we are also reporting our estimate of the model’s time horizon prior to rescoring the reward-hacking attempts. Allowing for reward hacks results in a point estimate of 13hrs (95% CI of 5hrs to 74hrs).
However, in our GPT-5.4 evaluation we noticed its runs were producing reward hacks unusually often. A quick test suggested that using a different prompt might cause it to produce more legitimate successes instead of reward hacks.
In our measurements, whenever a model succeeds on a task by reward-hacking, we consider the attempt a failure. Following this same policy, we arrived at a point estimate of 5.7hrs (95% CI of 3hrs to 13.5hrs) for GPT-5.4’s time horizon.
We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.
As our task suite saturates, our results become more sensitive to analysis choices. We’re doing a deep dive into different methodological choices and sensitivity analyses, and expect to share more soon. Below is a sneak peak of what we’re working on. We’re open to requests for analyses to run.
Huge thanks to Brendan Halstead at AI Futures Project and Coz at AI Security Institute who both independently spotted this and let us know!
We don’t think regularization is never appropriate, just that the particular form we previously applied (penalizing any steepness in the task-length→success probability curves) doesn’t reflect our beliefs, and foreseeably biases the results, so it is better to remove it.
However, we didn’t recheck whether this regularization was impacting our results for new models. For newer models with higher time horizons, the fits are less constrained by the data so more sensitive to regularization. This was a foreseeable mistake and we apologize.
Regularizing the slope towards zero wasn’t theoretically motivated - it was mainly left in because it sped up model fitting. We had planned to lower its strength further, but due to its perceived lack of impact, decided to publish instead of going through another round of review.
How did this happen?
In our time horizon paper the function we used to fit the task-length→success prob curve regularizes the slope towards zero as standard. We decreased the strength by 10x from the default, checked it wasn’t notably impacting fit, and then used that setup.
Correcting this decreases Opus’s 4.6 50% TH to 12 hours, and increases the 80% TH to 1.2 hours. This leaves us well inside the original CI (6hrs-98hrs for 50% TH, 0.4hrs-2.6hrs for 80%) - the most important source of uncertainty is still probably the task distribution rather than analysis issues.
We’re correcting a mistake in our modeling that inflated recent 50%-time horizons by 10-20% (and reduced 80%-horizons). We inappropriately penalized steepness in task-length→success curve fits. This most affects the oldest and newest models, whose fits are less data-constrained.
Check out our website for the full announcement and for access to the underlying data: metr.org/blog/2026-02...
We are adjusting the design of our experiment going forward to address these concerns and to better estimate uplift. We expect that our measurement techniques will need to continually evolve as AI capabilities improve.
We believe this selection causes our new data to understate the true speedup. Selection effects are not the only issue we noticed with our experimental design: we also think it has trouble tracking work when participants use agents to parallelize over multiple tasks.
We started a continuation in August 2025. However, we noticed developers were opting not to participate or submit work. Participants said they did this mostly due to expected productivity loss on "AI disallowed” tasks. Lower pay was also a factor ($50/hr, down from $150).
Last year we published findings that AI tools caused a 20% slowdown among experienced open source developers, using data collected over February to June 2025. We still believe that estimate was accurate for the specific tools and population at the time.
Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.
You can find details about our measurement methodology and time horizon estimates for other models on our website: metr.org/time-horizons/
For this measurement, we used our Triframe scaffold as usual (not Codex). We did a partial measurement with a Codex scaffold, and our results did not seem very different. This is in line with similar comparisons we’ve run in the past.
Initially scaffolding/format issues hurt performance. After addressing those, we did not observe any signs of the model being confused or directly hampered by the scaffold. However, we were left with the impression that this model’s performance may be more sensitive to the scaffold being used.
We estimate that GPT-5.3-Codex with reasoning effort `high` (not `xhigh`) has a 50%-time-horizon of around 6.5 hours (95% CI of 3 hrs to 17 hrs) on our suite of software tasks. OpenAI provided API access for this evaluation.
You can find details about our measurement methodology and time-horizon estimates for other models on our website: metr.org/time-horizons
We are working on updated methods to better track state-of-the-art AI capabilities. However, these are still in development so they don't address our immediate measurement gap. In the meantime, we advise caution in interpreting and comparing our recent time-horizon measurements.