in collaboration with the tremendous research team at FAIR: @karen-ullrich.bsky.social Jingtong Su, @arjunsubgraph.bsky.social , @claudiashi.bsky.social, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, and @kempelab.bsky.social
Posts by Mark Ibrahim
Explore how the latest models, like GPT-5, can be used as digital agents to complete tasks on your behalf:
📒Docs: facebookresearch.github.io/OpenApps/
📃Paper: arxiv.org/abs/2511.20766
🎬Video Tutorial: www.youtube.com/watch?v=gzNW...
✅ Unlimited data (for evaluating and training agents): generate thousands of versions of each app
✅ Lightweight: runs on a single CPU; no Docker or OS emulators needed
✅ Ground truth rewards: task rewards are based on the underlying state and all app logic is transparent in Python.
Want to teach AI agents to use apps like humans? Get started with digital agents research using OpenApps, our new Python-based environment.
✅ 22k multi-scene questions
✅ New scenes not in existing web data
✅ Runs in ~15 min on one GPU
Work led by Candace Ross in collaboration with @afeinstein20.bsky.social , Florian Bordes, and @polkirichenko.bsky.social
Check it out on HuggingFace, ArXiv & NeurIPS! huggingface.co/datasets/fac...
Despite saturating single image perception, Common-O establishes a new challenging multimodal benchmark. The best performing model only achieves 35% on Common-O and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%.
🧵2/3
We introduce, Common-O, a new multimodal benchmark for hallucination when reasoning across scenes.
We find leading multimodal LLMs can reliably identify objects, yet hallucinate when reasoning across scenes.
🧵1/3
If you’re an NYU student, come learn about this wonderful opportunity to collaborate with us at FAIR events.atmeta.com/metanyuaimen... Panel is tomorrow 10am at NYU Center for Data Science.
We explain how good delimiters steer attention heads to key input tokens and offer practical recommendations for prompts and delimiter choices to get the best performance from your LLM—tldr; use “!” or “\n”.
- MMLU performance can vary by +/- 23% depending on the choice of delimiter across leading open model families (Llama, Qwen, and Gemma).
- Closed models, GPT-4o, are also brittle to the choice of delimiter.
🧵
One can manipulate LLM rankings to put any model in the lead—only by modifying the single character separating demonstration examples. Learn more in our new paper arxiv.org/abs/2510.05152
w/ Jingtong Su, Jianyu Zhang, @karen-ullrich.bsky.social , and Léon Bottou.
🧵
Open-weights for our Llip multimodal vision-language model led by @lavoiems.bsky.social are public!
LLIP proposes new pre-training objective to capture the many ways to describe an image leading to strong performance across a suite of 22-zero shot benchmarks.
bsky.app/profile/lavo...
We also find better models are not necessarily better at abstention, suggesting the skill of abstention is an open-research question.
w/ @polkirichenko.bsky.social Sam Bell Kamalika Chaudhuri
Paper: arxiv.org/abs/2506.09038
Code: github.com/facebookrese...
bsky.app/profile/polk...
🧵2/2
A good language model should say “I don’t know” by reasoning about the limits of its knowledge. Our new work AbstentionBench carefully measures this overlooked skill in an open-codebase others can build on!
We find frontier reasoning degrades models’ ability to know when NOT to answer.
🧵1/2
Join us as a PhD research intern at FAIR w/ @polkirichenko.bsky.social and Kamalika Chaudhuri
to start this summer or fall with a focus on open science into multimodal models, agents and beyond! Email polkirichenko@meta.com with the title [Prospective Intern 2025] and attach your CV if interested!
We found MLM-U training can even outperform transformers trained with additional supervision from A* search traces, showing the promise of alternative learning objectives.
Learn more on our site and code at facebookresearch.github.io/maze_navigat...
Recently, we also applied the same MLM-U objective to maze navigation. We find when training parameter-matched transformers on identical data, MLM-U without any tweaks outperforms standard next token training across all maze grid sizes (up to 30x30).
We find MLM-U training improves knowledge retrieval on Wikipedia-based questions and even outperforms a pretrained 7B Mistral model with a much smaller 100M parameter transformer trained from scratch!
Come by our NeurIPS poster Exhibit Halls A-C #3204 11am PST Thursday to learn more.
We show training with a factorization agnostic objective, MLM-U (a variable ratio BERT-style loss with links to discrete diffusion), that predicts multiple tokens ahead and back can significantly mitigate the reversal curse!
Problem: Language models struggle with the “reversal curse:” an inability to answer reformulations of a question. We show this stems from the standard next token learning objective in what we call “the factorization curse.”
Can we boost transformers’ ability to retrieve knowledge and plan in maze navigation by only tweaking the learning objective?
We emphatically say YES in our #NeurIPS 2024 study! 🧵
w/ Ouail Kitouni, Niklas Nolte, Diane Bouchacourt, Adina Williams, and Mike Rabbat
Paper arxiv.org/abs/2406.05183