oh lol ppl have been submitting wout reviewing forever, TIL it was boycotting all along
Posts by Kyle Lo
kinda out of the loop, ppl are submitting to neurips but not reviewing?
thanks for the support!
thanks maria! glad got to share a fun office and collaborate during s2 days! appreciate can both chat abt difficult research problems but also peak taste tv shows w u ๐ will be in touch!!
Today I'm saying farewell to @ai2.bsky.social.
I'm so proud of our team & grateful to have shared fully-open Olmo, Dolma, olmOCR, Molmo, etc with the world
I know the team is more committed than ever to advancing open-source & open-science. Forever rooting for my dear friends ๐ซถ
cs peer review atm feels like im in a user study that forgot to get irb review
lololol I subscribe to the @mariaa.bsky.social school of cozy figures
for figs/diagrams, ive been found nano banana generates images a bit too cringe-tech for me, have had some success w committing to images all in matplotlib code, one script per fig
nice post! will need to check out reveal. some of my colleagues and i have a similar workflows using markdown instead of html, but the idea of some structured doc that is in-distribution for LMs seems the right path
big congrats to @lambdaviking.bsky.social for leading this project & core contributors Yanghong Li
@tylerromero.bsky.social
@anejsvete.bsky.social
Caia Costello
blog: allenai.org/blog/olmohyb...
paper: allenai.org/papers/olmo-...
hf collection: huggingface.co/collections/...
our new Olmo Hybrid model combines attention with linear RNN layers
๐ฃtraining efficiency is crazy good. the model reaches same MMLU score as Olmo 3 in 50% of the tokens. also see this in many other tasks
as always: weights, data, ckpts, training code, etc. all fully open
DrawEduMath is our benchmark testing VLM understanding of K-12 student math work, which is prerequisite for their use in educational contexts
one year after, while VLMs are strong math solvers today, they still underperform on our bench, esp for students who need the most help
this work was led by our intern Mayee Chen and was one of the new ideas we adopted into Olmo 3!
blog post: allenai.org/blog/olmix
arxiv paper: arxiv.org/abs/2602.12237
one of my favorite topics is dealing with data constraints!
what if your proposed mix is 30% code but you don't have enough code? we can repeat our data until we hit target proportions, but too much is risky
we view data mixing as (data) constrained optimization
our paper on data mixing for LMs is out!
while building Olmo 3, we saw gaps between data mixing literature and real practice
๐ choosing proxy size, # runs, sampling, regression, constraints..
๐data shifts during LM dev: can we reuse past experiments?
Olmix tackles them all!
literally all the time ๐ฎโ๐จ this was yesterday
learning how to do something is a first-order use case for LMs, the development bottleneck has been collecting data covering a wide diversity of topics, until now โ๐ป
incredibly fun project led by our intern yapei chang
we mined the web for thousands of real-world โhow to do Xโ step by step instructions and turned it into a dataset, synth data training procedure, eval suite, etc.
lol rip ๐ฎโ๐จ
Itโs like a score calculated against gold reference citations in generated lit review, so even humans donโt score high. i think the eval is saturated cuz so much subjectivity in what counts as appropriate citation. better phrasing is maybe that the citations are sensible up to some X
theyโre separate poorly named systems lol ๐ Separate projects approaching same problem from different angles. Scholar QA approach from agentic system design, use whatever model. Ope Scholar approach from model-first, very light on system. The teams are working together to fuse ideas
our open model proving out specialized rag LMs over scientific literature has been published in nature โ๐ป
congrats to our lead @akariasai.bsky.social & team of students and Ai2 researchers/engineers
www.nature.com/articles/s41...
0 days since last mixup of eval results between "copa" (choice of plausible alternatives) & "coqa" (conversational QA) tasks ๐
The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!
Call for papers is out. Topics include:
๐ LMs as evaluators
๐ Living benchmarks
๐ฃ Eval with humans
and more
New for 2026: Opinion & Statement Papers!
Full CFP: gem-workshop.com/call-for-pap...
mm yea i think that's always the case w productivity tools.
imo ability to adopt new tools is core part of the job. just like transition from plain text editors to IDEs, from sending files via FPT to using git for collab, from ad hoc Makefiles to package managers, etc. AI is just the latest thing
my concern is the growing pool of "unknown unknowns" as i interact less with code directly.
imo probably why i subconsciously have been leaning toward cursor over claude code or similar agents, even if the latter has a higher code-to-keystrokes ratio
i dont feel worse at this even if im not writing papers from-scratch as much as during early career
but coding feels different due to mismatch between what i express to the system (english) and what the system returns (code). i've already realized some gaps in libraries I used to know well.
whether my ability to review code will degrade as I offload increasingly larger workloads to AI
of course, this shift is present in other forms of generation, like paper writing, where my role has shifted to reviewing/editing (student's) drafts.
some thoughts about skill degradation w/ AI coding
im onboard w views that "english is the new programming language" & "software engineering", translating ambiguous goals to technical specs/execution, is still a skill.
im more concerned w shift from my role as a writer to a reviewer and
lucky to chat w sen. patty murray about olmo & importance of fully open AI
using opus to extract research topics from papers & it was giving me useless words like "training", "datasets", and "evaluation"
kept prompting it w examples of more informative topics and it ended up with "LLM training", "LLM datasets", and "LLM evaluation"
thx