🎉 Presenting at #ICML2025 tomorrow!
Come and explore how representational similarities behave across datasets :)
📅 Thu Jul 17, 11 AM-1:30 PM PDT
📍 East Exhibition Hall A-B #E-2510
Huge thanks to @lorenzlinhardt.bsky.social, Marco Morik, Jonas Dippel, Simon Kornblith, and @lukasmut.bsky.social!
Posts by Lukas Thede
📝 Paper: arxiv.org/abs/2503.05683
🗂️ Dataset: huggingface.co/datasets/luk...
💻 Code: github.com/ExplainableM...
🚨 Poster at #ICML2025!
How can LLMs really keep up with the world?
Come by E-2405 on July 15th (4:30–7:00pm) to check out WikiBigEdit – our new benchmark to test lifelong knowledge editing in LLMs at scale.
🔗 Real-world updates
📈 500k+ QA edits
🧠 Editing vs. RAG vs. CL
🧠🤖 We’re hiring a Postdoc in NeuroAI!
Join CRC1233 "Robust Vision" (Uni Tübingen) to build benchmarks & evaluation methods for vision models, bridging brain & AI. Work with top faculty & shape vision research.
Apply: tinyurl.com/3jtb4an6
#NeuroAI #Jobs
📢 Landed in Nashville🎺 for #CVPR2025! The EML group is presenting 4 exciting papers — come say hi at our poster sessions! More details in the thread — see you there! 🏁🌟
🚨 Happy to announce that one paper, "Understanding the Limits of Lifelong Knowledge Editing in LLMs", is accepted at #icml2025 ! Congrats to @lukasthede.bsky.social , @confusezius.bsky.social , Matthias Bethge, @zeynepakata.bsky.social , and @tomhartvigsen.bsky.social . 👇 Highlights in the thread
🎓PhD Spotlight: Jae Myung Kim
We’re thrilled to celebrate Jae Myung Kim, who will defend his PhD on 25th June! 🎉
Jae Myung began his PhD at @unituebingen.bsky.social as part of the ELLIS & IMPRS-IS programs, advised by @zeynepakata.bsky.social and collaborating closely with Cordelia Schmid.
We’ve landed in Singapore for #ICLR2025!
The EML group is presenting 4 exciting papers — come say hi at our poster sessions! 👇Let's chat!
More details in the thread — see you there! 🌟
10/
This project was a joint effort with amazing collaborators:
👥 @confusezius.bsky.social , Matthias Bethge, @zeynepakata.bsky.social , and @tomhartvigsen.bsky.social
Huge thanks to them for the ideas, feedback, and countless hours that made this work possible. 🙏
9/
📘 Want to test your method at scale?
📄 Paper: arxiv.org/abs/2503.05683
🗂️ Benchmark: huggingface.co/datasets/luk...
💻 Code: github.com/ExplainableM...
Let’s build LLMs that truly stay up to date. 🔄
Excited to see what the community does with this!
8/
🔍 TL;DR:
✅ We release WikiBigEdit - a new large-scale benchmark for real-world factual updates
🚨 Existing editing methods fail to scale
💡 Finetuning + merging is a surprisingly strong baseline
🧩 RAG wins - but with trade-offs
7/
Surprisingly, simple continual finetuning (LoRA) outperforms all editing baselines - at equal inference cost.
And when paired with model merging, performance improves even further over time.
💪 More scalable, more robust, and better retention across time steps.
6/
RAG performs best overall - nearly tripling accuracy on edit and generalization tasks.
But:
⏳ It comes with significantly higher inference cost
🔄 And still struggles with multi-hop reasoning over updated facts
5/
The result? 📉
Most editing methods struggle at scale.
ROME and MEMIT collapse within a few hundred updates.
Even WISE, built for lifelong edits, degrades quickly - converging to pre-edit performance.
➡️ These techniques aren’t yet ready for real-world demands.
4/
We put popular editing methods to the test:
🔧 ROME, MEMIT, WISE
🔁 LoRA finetuning & merging
🔍 Retrieval-augmented generation (RAG)
How do they stack up on update accuracy, reasoning, generalization, and locality?
3/
Unlike synthetic edit datasets, WikiBigEdit tracks real-world knowledge changes over time.
It probes multi-hop reasoning, semantic generalization, and whether new edits interfere with existing knowledge.
And it’s built to continuously grow - for future-proof evaluation.
2/
📣 Introducing WikiBigEdit: a new benchmark for lifelong knowledge editing.
It includes:
📌 500K+ real-world QA pairs based on Wikidata
📆 8 time steps over 6 months (Feb–Jul 2024) and continuously updatable
🧪 Rich evaluations: reasoning, generalization, locality, …
1/
Most LLMs are static snapshots of past knowledge.
But facts change constantly - and retraining is far too costly.
Knowledge editing offers a cheaper fix.
But how far can it actually take us?
We put it to the test - at realistic, deployment-scale.
🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing.
But what would it actually take to support this in practice at the scale and speed the real world demands?
We explore this question and really push the limits of lifelong knowledge editing in the wild.
👇
Happy to share that we have 4 papers to be presented in the coming #ICLR2025 in the beautiful city of #Singapore . Check out our website for more details: eml-munich.de/publications. We will introduce the talented authors with their papers very soon, stay tuned😉
🚨 New paper alert! 🚨
We’ve just launched openretina, an open-source framework for collaborative retina modeling across datasets and species.
A 🧵👇 (1/9)
CuratedThoughts: Data Curation for RL Datasets 🚀
Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation.
Here's why 👇🧵
🔥 #CVPR2025 Submit your cool papers to Workshop on
Emergent Visual Abilities and Limits of Foundation Models 📷📷🧠🚀✨
sites.google.com/view/eval-fo...
Submission Deadline: March 12th!
Hiring announcement: ELLIS Institute Tübingen is looking for ML Researchers & Engineers for Open-Source AI Tutoring (m/f/d). The image features a white background with bold black text and the colorful ELLIS logo at the bottom.
🚀 We’re hiring! Join Bernhard Schölkopf & me at @ellisinsttue.bsky.social to push the frontier of #AI in education!
We’re building cutting-edge, open-source AI tutoring models for high-quality, adaptive learning for all pupils with support from the Hector Foundation.
👉 forms.gle/sxvXbJhZSccr...
🚨Great Models Think Alike and this Undermines AI Oversight🚨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇
Our EML team has 4 #ICLR25 Papers accepted! I am proud of my students and grateful to be a part of many successful collaborations. More details will appear on our website (www.eml-munich.de) but here are the snapshots.
📄 New Paper: "How to Merge Your Multimodal Models Over Time?"
arxiv.org/abs/2412.06712
Model merging assumes all finetuned models are available at once. But what if they need to be created over time?
We study Temporal Model Merging through the TIME framework to find out!
🧵
🚨Looking to test your foundation model on an arbitrary and open-ended set of capabilities, not explicitly captured by static benchmarks? 🚨
Check out ✨ONEBench✨, where we show how sample-level evaluation is the solution.
🔎 arxiv.org/abs/2412.06745
🤔 Can you turn your vision-language model from a great zero-shot model into a great-at-any-shot generalist?
Turns out you can, and here is how: arxiv.org/abs/2411.15099
Really excited to this work on multimodal pretraining for my first bluesky entry!
🧵 A short and hopefully informative thread:
🚀New Paper: Active Data Curation Effectively Distills Multimodal Models
arxiv.org/abs/2411.18674
Smol models are all the rage these days & knowledge distillation (KD) is key for model compression!
We show how data curation can effectively distill to yield SoTA FLOP-efficient {C/Sig}LIPs!!
🧵👇