Apache 2.0. - Works with OpenAI, Anthropic, or anything through LiteLLM.
Posts by
We shipped demos:
1. Benchmark Scout: extract benchmark rows from papers, flag questionable comparisons, then re-run only the unclear cases
2. HF Entity Graph: extract entities, detect ambiguity, then resolve it
3. Benchmark Report: full topology and throughput benchmarks with charts
Workers can be anything at the leaves.
- LLM-driven agent tasks
- deterministic reducers
- local executors
- your own agent via BYOA
You can plug in your own agent process without touching orchestrators.
You can mix agent work and deterministic work in the same job.
Failures stay localized to the nodes that failed.
Second-pass reasoning only runs on ambiguous cases instead of everything.
This lets you keep most of the system deterministic and only spend agent cycles where they are actually needed.
Seven orchestration topologies:
dag: parallel build with QA and fix loops
tree: hierarchical decomposition with branching
pipeline: staged delivery
supervisor: adaptive retries and task splitting
work_queue: flat pull-based workers
sharded_queue: large independent item sets
map_reduce: fan-out
Epsilon splits coordination, execution, and transport.
Orchestrators define task structure and dependencies.
Workers execute tasks as independent processes.
A ØMQ broker handles queueing, leases, heartbeats, and routing.
Intermediate state is written to a shared workspace on disk.
Epsilon treats a workload as a graph of tasks with explicit state, retries, and ownership. Orchestrators handle decomposition and coordination. Workers execute tasks. The runtime handles scheduling, routing, and recovery.
We use it internally for things like:
multi-agent software builds with QA/fix loops
extracting structured data across large paper corpora
manifest-backed jobs with hundreds of tasks
pipelines where only some cases need a second pass
We just open-sourced Epsilon: a runtime for structured agent workloads. Thread on what it is and why we built it. 🧵
Find it here:
github.com/AlgorithmicR...
We're effectively going open source, repo by repo, dataset by dataset 🤖
A tiny architecture ranking model that got 8-10x sample efficiency over random search in NAS and transferred zero-shot across datasets
algorithmicresearchgroup.com/projects/lea...
ArXivDLInstruct - 778K functions from research code paired with instruction prompts for fine-tuning
huggingface.co/datasets/Alg...
ArXiv Research Code Dataset - 4.7M code files from 129K research repos linked to arXiv CS papers
huggingface.co/datasets/Alg...
ARIA Benchmark - 5 closed-book benchmarks testing how much ML knowledge frontier models have actually internalized
github.com/AlgorithmicR...
Follow-up was DeltaMLBench: 50 tasks from real Papers With Code repos where the goal is to beat the published baseline, not just reproduce it. GPT-5 with our agent scaffold improved on 29 of 48 tasks, some by a lot. Under review at ICML 2026.
github.com/AlgorithmicR...
Two years ago we released the ML Research Benchmark (MLRB), 7 competition-level ML challenges from NeurIPS/ICML/CoNLL. Gave frontier agents an A100, 24 hours, no starter code.
Main finding: agents could build working pipelines but couldn't do real research iteration.
There's rightly a lot of excitement around Karpathy's autoresearch. We've been studying at ARG for a couple years now: what happens when you put an AI agent in a loop and let it run experiments, evaluate results, and iterate without you.
We've built a bunch of benchmarks and tools to measure this. 🧵
hello world!