Advertisement · 728 × 90

Posts by bourbaki7

Preview
Video models are zero-shot learners and reasoners The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation...

arxiv.org/abs/2509.20328

6 months ago 1 0 0 0
Preview
LIMI: Less is More for Agency We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement...

arxiv.org/abs/2509.17567

6 months ago 2 0 0 0
Preview
Pre-training under infinite compute Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that exi...

arxiv.org/abs/2509.14786

7 months ago 2 0 0 0
Preview
The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an a...

arxiv.org/abs/2509.10167

7 months ago 0 0 0 0
Preview
Causal Attention with Lookahead Keys In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism ...

arxiv.org/abs/2509.07301

7 months ago 1 0 0 0
Preview
Cautious Optimizers: Improving Training with One Line of Code AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we...

Not a new paper, but I hadn't seen it til now

arxiv.org/abs/2411.16085

8 months ago 0 0 0 0
Preview
Fast and Simplex: 2-Simplicial Attention in Triton Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count toge...

arxiv.org/abs/2507.02754

8 months ago 0 0 0 0
Preview
Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large lang...

arxiv.org/abs/2507.10613

8 months ago 3 2 0 0
Preview
Training Transformers with Enforced Lipschitz Constants Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and ove...

arxiv.org/abs/2507.13338

8 months ago 0 0 0 0
Preview
Scaling Laws for Optimal Data Mixtures Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard appro...

arxiv.org/abs/2507.09404

9 months ago 0 0 0 0
Advertisement
Preview
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model y...

arxiv.org/abs/2507.06261

9 months ago 1 0 0 1
Preview
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful archite...

arxiv.org/abs/2507.07955

9 months ago 0 0 0 0
Preview
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient s...

arxiv.org/abs/2506.19697

9 months ago 0 0 0 0
Preview
Thought Anchors: Which LLM Reasoning Steps Matter? Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each gene...

arxiv.org/abs/2506.19143

9 months ago 0 0 0 0
Preview
Hardware-Efficient Attention for Fast Decoding LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decodi...

arxiv.org/abs/2505.21487

9 months ago 0 0 0 0
Preview
any4: Learned 4-bit Numeric Representation for LLMs We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. a...

www.arxiv.org/abs/2507.04610

9 months ago 0 0 0 0
Preview
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks What scaling limits govern neural network training dynamics when model size and training time grow in tandem? We show that despite the complex interactions between architecture, training algorithms, a...

arxiv.org/abs/2507.02119

9 months ago 0 0 0 0
Preview
Characterization and Mitigation of Training Instabilities in Microscaling Formats Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware ac...

arxiv.org/abs/2506.20752

9 months ago 0 0 0 0
Preview
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_...

arxiv.org/abs/2501.11873

9 months ago 0 0 0 0
Advertisement
Preview
A Statistical Physics of Language Model Reasoning Transformer LMs show emergent reasoning that resists mechanistic understanding. We offer a statistical physics framework for continuous-time chain-of-thought reasoning dynamics. We model sentence-leve...

arxiv.org/abs/2506.04374

9 months ago 0 0 0 0
Preview
Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have be...

arxiv.org/abs/2506.14913

9 months ago 1 0 0 0
Preview
OpenThoughts: Data Recipes for Reasoning Models Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-th...

arxiv.org/abs/2506.04178

9 months ago 0 0 0 0
Preview
Essential-Web v1.0: 24T tokens of organized web data Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We ...

arxiv.org/abs/2506.14111

10 months ago 0 0 0 0
Preview
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concern...

arxiv.org/abs/2506.05209

10 months ago 0 0 0 0
Preview
Scaling Laws of Motion Forecasting and Planning -- A Technical Report We study the empirical scaling laws of a family of encoder-decoder autoregressive transformer models on the task of joint motion forecasting and planning in the autonomous driving domain. Using a 500 ...

arxiv.org/abs/2506.08228

10 months ago 0 0 0 0
Preview
New Insights for Scaling Laws in Autonomous Driving Many recent AI breakthroughs have followed a common pattern: bigger models, trained on more data, with more compute, often deliver extraordinary gains. Waymo’s latest study explores whether this trend...

waymo.com/blog/2025/06...

10 months ago 0 0 0 0
Preview
Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to t...

arxiv.org/abs/2506.08300

10 months ago 0 0 0 0
Advertisement
Preview
Text-to-LoRA: Instant Transformer Adaption While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repea...

arxiv.org/abs/2506.06105

10 months ago 0 0 0 0
Preview
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes…

machinelearning.apple.com/research/ill...

10 months ago 0 0 0 0
Preview
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and...

arxiv.org/abs/2502.19002

10 months ago 0 0 0 0