Thanks a lot to all my amazing co-authors @alessiodevoto.bsky.social @sscardapane.bsky.social @yuzhaouoe.bsky.social @neuralnoise.com Eric de la Clergerie @bensagot.bsky.social
And a special thanks to @edoardo-ponti.bsky.social for the academic visit that made this work possible!
Posts by Simone Scardapane
Will present this at #CVPR βοΈ See you in Nashville πΊπΈ!
Kudos to the team π
Antonio A. Gargiulo, @mariasofiab.bsky.social, @sscardapane.bsky.social, Fabrizio Silvestri, Emanuele RodolΓ .
Please share it within your circles! edin.ac/3DDQK1o
π New Paper Alert! π
We introduce Q-Filters, a training-free method for efficient KV Cache compression!
It is compatible with FlashAttention and can compress along generation which is particularly useful for reasoning models β‘
TLDR: we make Streaming-LLM smarter using the geometry of attention
Q-Filters is very efficient which allows streaming compression at virtually no latency cost, just like Streaming-LLM...
...but it is also much better at retaining relevant KV pairs compared to fast alternatives (and can even beat slower algorithms such as SnapKV)
*Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces*
by @maclarke.bsky.social et al.
Studies co-occurence of SAE features and how they can be understood as composite / ambiguous concepts.
www.lesswrong.com/posts/WNoqEi...
*Weighted Skip Connections are Not Harmful for Deep Nets*
by @rupspace.bsky.social
Cool blog post "in defense" of weighted variants of ResNets (aka HighwayNets) - as a follow up to a previous post by @giffmana.ai.
rupeshks.cc/blog/skip.html
*CAT: Content-Adaptive Image Tokenization*
by @junhongshen1.bsky.social @lukezettlemoyer.bsky.social et al.
They use an LLM to predict a "complexity score" for each image token, which in turns decides the size of its VAE latent representation.
arxiv.org/abs/2501.03120
*Accurate predictions on small data with a tabular foundation model*
by Noah Hollmann et al.
A transformer for tabular data that takes an entire training set as input and provides predictions - trained on millions of synthetic datasets.
www.nature.com/articles/s41...
*Insights on Galaxy Evolution from Interpretable Sparse Feature Networks*
by @jwuphysics.bsky.social
Integrates a sparse dictionary step on the last layer of a CNN to obtain a set of interpretable features on multiple astronomical prediction tasks.
arxiv.org/abs/2501.00089
*Round and Round We Go! What makes Rotary Positional Encodings useful?*
by @petar-v.bsky.social et al.
They show RoPE has distinct behavior for different rotation angles - high freq for position, low freq for semantics.
arxiv.org/abs/2410.06205
*Cautious Optimizers: Improving Training with One Line of Code*
by Liang et al.
Adding a simple masking operation to momentum-based optimizers can significantly boost their speed.
arxiv.org/abs/2411.16085
*Byte Latent Transformer: Patches Scale Better Than Tokens*
by @artidoro.bsky.social et al.
Trains a small encoder to dynamically aggregate bytes into tokens, which are input to a standard autoregressive model. Nice direction!
arxiv.org/abs/2412.09871
*Understanding Gradient Descent through the Training Jacobian*
by @norabelrose.bsky.social @eleutherai.bsky.social
Analyzes training through the spectrum of the "training Jacobian" (β of trained weights wrt initial weights), identifying a large inactive subspace.
arxiv.org/abs/2412.07003
*Mixture of A Million Experts*
by Xu Owen He
Scales a MoE architecture up to millions of experts by implementing a fast retrieval method in the router, inspired by recent MoE scaling laws.
arxiv.org/abs/2407.04153
*Restructuring Vector Quantization with the Rotation Trick*
by Fifty et al.
Replaces the "closest codebook" operation in vector quantization with a rotation and rescaling operations to improve the back-propagation of gradients.
arxiv.org/abs/2410.06424
*On the Surprising Effectiveness of Attention Transfer
for Vision Transformers*
by Li et al.
Shows that distilling attention patterns in ViTs is competitive with standard fine-tuning.
arxiv.org/abs/2411.09702
*The Super Weight in Large Language Models*
by Yu et al.
Identifies single weights in LLMs that destroy inference when deactivated. Tracks their mechanisms through the LLM and proposes quantization-specific techniques.
arxiv.org/abs/2411.07191
*The Surprising Effectiveness of Test-Time Training for Abstract Reasoning*
by @ekinakyurek.bsky.social et al.
Shows that test-time training (fine-tuning at inference time) strongly improves performance on the ARC dataset.
arxiv.org/abs/2411.07279
Our paper βA Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion" is out as preprint!
By myself, @sscardapane.bsky.social, @rgring.bsky.social and @lanalpa.bsky.social
π arxiv.org/abs/2501.07451
*Large Concept Models*
by Barrault et al.
Builds an autoregressive model in a "concept" space by wrapping the LLM in a pre-trained sentence embedder (also works with diffusion models).
arxiv.org/abs/2412.08821
"Task Singular Vectors: Reducing Task Interference in Model Merging" by Antonio Andrea Gargiulo, @crisostomi.bsky.social , @mariasofiab.bsky.social , @sscardapane.bsky.social, Fabrizio Silvestri, Emanuele RodolΓ
Paper: arxiv.org/abs/2412.00081
Code: github.com/AntoAndGar/t...
#machinelearning
*Adaptive Length Image Tokenization via Recurrent Allocation*
by @phillipisola.bsky.social et al.
An encoder to compress an image into a sequence of 1D tokens whose length can dynamically vary depending on the specific image.
arxiv.org/abs/2411.02393
*Deep Learning Through A Telescoping Lens*
by @alanjeffares.bsky.social @aliciacurth.bsky.social
Shows that tracking 1st-order approximations to the training dynamics provides insights into many phenomena (e.g., double descent, grokking).
arxiv.org/abs/2411.00247
*MoE Graph Transformers for Interpretable Particle Collision Detection*
by @alessiodevoto.bsky.social @sgiagu.bsky.social et al.
We propose a MoE graph transformer for particle collision analysis, with many nice interpretability insights (e.g., expert specialization).
arxiv.org/abs/2501.03432
*A Meticulous Guide to Advances in Deep Learning Efficiency over the Years* by Alex Zhang
Part deep learning history, part overview on the vast landscape of "efficiency" in DL (hardware, compilers, architecture, ...). Fantastic post!
alexzhang13.github.io/blog/2024/ef...
First little project of the year: an awesome collection of papers on Dynamic Neural Networks for Computer Vision and Sensor Fusion!
Each paper comes with a brief summary and code link.
π github.com/DTU-PAS/awes...
Donβt miss out on these insights and more β check out the paper!
π Preprint β arxiv.org/abs/2412.00081
π» Code β github.com/AntoAndGar/t...
Joint work w/ Antonio A. Gargiulo, @mariasofiab.bsky.social, @sscardapane.bsky.social, Fabrizio Silvestri, Emanuele RodolΓ .
(6/6)
*Modular Duality in Deep Learning*
Develops a theory of "modular duality" for designing principled optimizers that respect the "type semantics" of each layer.
arxiv.org/abs/2410.21265
*Understanding Visual Feature Reliance through the
Lens of Complexity*
by @thomasfel.bsky.social @louisbethune.bsky.social @lampinen.bsky.social
Wonderful work! They rank features' complexity with a variant of mutual information, before analyzing their dynamics.
arxiv.org/abs/2407.06076