Sugar-shack with the lab!! 🍁
Saying goodbye to a very long winter, and welcoming sunnier days 🤩🍃
Posts by Chandar Research Lab
🗣️ Shoutout to the authors: Pranshu Malviya, Balaraman Ravindran and @sarath-chandar.bsky.social!!! (published at CoLLAs 2022).
🔗 Learn more at: lnkd.in/eSSd9m56
This was the first work to show that you can successfully use adaptive gradient optimizers for lifelong learning and still beat Stochastic Gradient Descent (i.e., RMSProp < SGD < TAG-RMSProp!).
🔥 Across benchmarks, TAG was shown to improve final accuracy over baselines like ER and A-GEM.
🔥 TAG tracks gradient traces from previously learned tasks and estimates task similarity:
⬇️ Lower α → related tasks → transfer is encouraged
⬆️ Higher α → conflicting tasks → reduce interference
In lifelong learning, acquiring new tasks can cause ML models to forget previously learned knowledge. For this, our lab introduced TAG (Task-based Accumulated Gradients), a general wrapper on top of adaptive gradient optimizers. 📈
As we hope for women to strive in research every year more than the last, we encourage all of them to apply to our lab for internships, Master’s or PhD degrees with
@sarath-chandar.bsky.social !
The Chandar Research Lab remains committed to supporting women and other underrepresented communities @mila-quebec.bsky.social and in ML with initiatives such as the graduate application assistance program or a Computer Science summer school for high school students goint to undergrad.👩🔬
This is only a sneak peek into the actual work they did last year, as much of their research is still under submission. Stay tuned for more interesting papers spanning ML for Biology, model merging, continual learning, etc...
Generalization Can Emerge in Tabular Foundation Models From a Single Table by Nour Shaheen at the AI for Tabular Data workshop @euripsconf.bsky.social 2025!
arxiv.org/abs/2511.09665
The Expressive Limits of Diagonal SSMs for State-Tracking by Behnoush Khavari @iclr-conf.bsky.social 2026.
iclr.cc/virtual/2026...
NeoBERT: A Next Generation BERT by @lola-le-breton.bsky.social published @tmlr-pub.bsky.social and @iclr-conf.bsky.social in Rio this year.
arxiv.org/abs/2502.19587
Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models by Istabrak Abbes @collasconf.bsky.social
arxiv.org/abs/2508.01908
📜 Small Encoders Can Rival Large Decoders in Detecting Groundedness by Istabrak Abbes published
@aclmeeting.bsky.social 2025.
aclanthology.org/2025.finding...
Maryam Hashemzadeh, @lola-le-breton.bsky.social, Istabrak Abbes, Nour Shaheen, Behnoush Khavari, Anabel Tan and @katelobacheva.bsky.social. Give them a follow and look at this list of their publications with our lab in the past year!⬇️
This week, as we celebrated International Women’s Right Day for the 115th time on Sunday, the Chandar Lab wanted to pay tribute to all the amazing women doing research👩🎓, and to highlight the cutting-edge work they do at our lab everyday...🧵
Work done by @nilaksh404.bsky.social, Antoine Clavaud, @mreymond.bsky.social, Francois Rivest, and @sarath-chandar.bsky.social
Checkout the paper at : arxiv.org/abs/2602.09396
Code : github.com/chandar-lab/...
Look at the latents! t-SNE analysis shows that our method (top) learns structured, temporally coherent representations faster than standard streaming RL
Our method systematically outperforms existing baselines across Atari, MinAtar, and Octax. The best part? It remains efficient enough to train on just a few CPU cores.
Streaming data is highly correlated, which usually causes poor training. To fix this, we introduced Orthogonal Gradient Updates. By projecting gradients onto a subspace orthogonal to their history, we keep learning stable and effective.
We bring Self-Predictive Representations (SPR) to the streaming pipeline. By predicting future latent states, we force the encoder to learn much richer features from every observed frame: without needing a massive memory footprint of a replay buffer.
Without a replay buffer, streaming agents struggle to build meaningful representations. Traditional value-based losses alone can’t exploit the full informational content of transient data before it's gone.
Streaming Reinforcement Learning (RL) is a huge challenge: transitions are used once and discarded immediately. This makes agents extremely sample-inefficient. But what if we could "squeeze" more information out of every single frame?
Check out our latest paper!
Shoutout to the authors: Kamran Chitsaz, Milad Aghajohari, @a-kazemnejad.bsky.social. Supervised by: @sarath-chandar@bsky.social, @murefil.bsky.social, AaronCourville and @sivareddyg.bsky.social
🔗 Learn more at: arxiv.org/abs/2510.06557
🔗Build with: github.com/McGill-NLP/the-markovian-thinker
🧩 Even state-of-the-art models show Markovian Thinking at zero-shot: both GPT-oss-120B and Qwen3-30B-A3B recover/track LongCoT with no special prompting/training required, and lots of in-distribution positives on initialization, so RL with Delethink is primed to scale!!
🔥 Further, we scaled DeepSeek R1-1.5B to a thinking budget of 96K in 150 RL steps. Accuracy jumped, with mean trace lengths at around 40K tokens.
Markovian Thinking is instantiated by Delethink, an RL enviroment. With it, we trained DeepSeek R1-1.5B and demonstrated:
1️⃣ The same scaling as LongCoT-RL, but at lower costs,
2️⃣ Better test-time scaling, improving past 24K tokens, while LongCoT-RL plateaus.
3️⃣ All this while keeping linear costs!!
Markovian Thinking works by:
1️⃣ Making LLMs reason in 8K chunks.
2️⃣ At each boundary, context is reset and a small textual state from the last chunk is carried over.
🔃 Continues from that state.
✅ This decouples thinking length from context size, achieving linear compute and constant memory!
‘The Markovian Thinker’, developed by our lab, has been accepted at @iclr-conf.bsky.social This work achieved long reasoning without the quadratic attention tax by making LLMs reason in chunks with a bounded state, achieving linear compute, constant memory and scaling beyond its training limits! 🔥
📝 openreview.net/forum?id=5bg...
Joint work of Mehran Shakerinava, Behnoush Khavari, Siamak Ravanbakhsh and @sarath-chandar.bsky.social @mila-quebec.bsky.social .
Takeaways for architecture design:
- Diagonal structure imposes a precise group-theoretic ceiling on expressivity
- Depth helps in a principled way (one layer per Abelian factor)
- But training algorithms need to catch up — expressivity alone isn't enough