Itβs incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!
π Read the full paper here: arxiv.org/abs/2504.08934
This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.
Posts by Max Vladymyrov
The result is GistPool:
β
Matches or beats average pooling;
β
Fixes the issues with Gisting;
β
Achieves compression using in-context learning with no complex model modifications, so itβs easy to implement and deploy at scale.
Itβs fast, simple, and works across datasets and compression rates.
So we added an inductive bias:
π Spread gist tokens across the context
π Restrict each gist token to attend only to its own pooling window
This tiny masking tweak nudges the model toward poolingβand suddenly performance shoots up.
Even fixing these issues, Gisting still couldnβt match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! π«
Second issue: ππ¨π§ππ₯π’πππ’π§π π¨ππ£ππππ’π―ππ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. π
First culprit: π’π§ππ¨π«π¦πππ’π¨π§ ππ₯π¨π° πππ₯ππ². Gist tokens summarizing layer i only get the summary to the model at layer i+2. Thatβs too lateβthe model expects the information a layer earlier. When we shifted activations down, performance improved immediately! π§
Surprisingly, even a simple π’π·π¦π³π’π¨π¦ π±π°π°ππͺπ―π¨βjust mean over token activationsβbeats Gisting by a mile. That shouldnβt happen. Why can't Gisting at least learn to emulate average pooling? π€
Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.
But it turns out that this method when compressing more than just a few tokens.
Handling long contexts efficiently is a major hurdle for LLMs! While models support longer windows, cost & effectiveness remain challenging.
Excited to share our paper on in-context compression for long contexts.
Check out his thread and the paper bellow! π
arxiv.org/abs/2504.08934
Itβs incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!
π Read the full paper here: arxiv.org/abs/2504.08934
This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.
The result is GistPool:
β
Fixes the issues with Gisting;
β
Matches or beats AvgPool baseline;
β
Achieves compression using in-context learning, meaning no complex model modifications are needed.
Itβs fast, simple, and works across datasets and compression rates.
So we added an inductive bias:
π Spread gist tokens across the context
π Restrict each gist token to attend only to its own pooling window
This tiny masking tweak nudges the model toward poolingβand suddenly performance shoots up.
Even fixing these issues, Gisting still couldnβt match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! π«
Second issue: ππ¨π§ππ₯π’πππ’π§π π¨ππ£ππππ’π―ππ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. π
First culprit: π’π§ππ¨π«π¦πππ’π¨π§ ππ₯π¨π° πππ₯ππ². Gist tokens summarizing layer i only get the summary to the model at layer i+2. Thatβs too lateβthe model expects the information a layer earlier. When we shifted activations down, performance improved immediately! π§
Surprisingly, even a simple π’π·π¦π³π’π¨π¦ π±π°π°ππͺπ―π¨βjust mean over token activationsβbeats Gisting by a mile. That shouldnβt happen. Why can't Gisting at least learn to emulate average pooling? π€
Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.
But it turns out that this method when compressing more than just a few tokens.
Itβs incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!
π Read the full paper here: arxiv.org/abs/2504.08934
This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.
The result is GistPool:
β
Matches or beats average pooling;
β
Fixes the issues with Gisting;
β
Makes only minimal modifications to the classic transformer so itβs easy to implement and deploy at scale.
Itβs fast, simple, and works across datasets and compression rates.
So we added an inductive bias:
π Spread gist tokens across the context
π Restrict each gist token to attend only to its own pooling window
This tiny masking tweak nudges the model toward poolingβand suddenly performance shoots up.
Even fixing these issues, Gisting still couldnβt match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! π«
Second issue: ππ¨π§ππ₯π’πππ’π§π π¨ππ£ππππ’π―ππ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. π
First culprit: π’π§ππ¨π«π¦πππ’π¨π§ ππ₯π¨π° πππ₯ππ². Gist tokens summarizing layer i only get the summary to the model at layer i+2. Thatβs too lateβthe model expects the information a layer earlier. When we shifted activations down, performance improved immediately! π§
Surprisingly, even a simple π’π·π¦π³π’π¨π¦ π±π°π°ππͺπ―π¨βjust mean over token activationsβbeats Gisting by a mile. That shouldnβt happen. Why can't Gisting at least learn to emulate average pooling? π€
Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.
But turns out that this method when compressing more than just a few tokens.
We all want LLMs to collaborate with humans to help them achieve their goals. But LLMs are not trained to collaborate, they are trained to imitate. Can we teach LM agents to help humans by first making them help each other?
arxiv.org/abs/2503.14481
Amazing work! Do you think it would be possible to express a general learning algorithms using ALTA, e.g. gradient descent?