Alex Wettig (@awettig) Bsky

Presenting two posters at ICML over the next two days:
- Both at 11am - 1:30pm
- Both about how to improve pre-training with domains
- Both at stall # E-2600 in East Exhibition Hall A-B (!)

Tomorrow: WebOrganizer w/ @soldaini.net & @kylelo.bsky.social
Thursday: MeCo by @gaotianyu1350.bsky.social

9 months ago 4 1 0 0

GitHub - CodeCreator/WebOrganizer: Organize the Web: Constructing Domains Enhances Pre-Training Data Curation Organize the Web: Constructing Domains Enhances Pre-Training Data Curation - CodeCreator/WebOrganizer

📜 Paper: arxiv.org/pdf/2502.10341
🌐 Website (feat. Domain Explorer): weborganizer.allen.ai
🤖 Models and Data: huggingface.co/WebOrganizer
💾 Code: github.com/CodeCreator...

w/amazing co-authors @kylelo.bsky.social @sewonm.bsky.social @hanna-nlp.bsky.social @danqi-chen.bsky.social @soldaini.net

1 year ago 4 0 0 0

Our domains also shine a light on which type of content is implicitly upsampled when using quality filters!

💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)

1 year ago 2 0 1 0

Instead of sampling from the domains, we can also pick the best documents according to quality filters, which improves the overall performance of two strong quality filters.

✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!

1 year ago 2 0 1 0

We test these domain mixtures by training 1B models and find that they improve performance across a range of tasks.

And we can combine the topic and format predictions to curate data with even better performance! 📈

1 year ago 1 0 1 0

How useful are these domains for data curation in practice?

We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"

Prediction: Heavily upsample domains such as Science or Tutorials!

1 year ago 1 0 1 0

We distill the LLM outputs into small domain classifiers to annotate data at scale!

Interesting finding: our topics and formats co-occur almost independently!

1 year ago 1 0 1 0

Modern pre-training relies on crawling the web to collect trillions of tokens

We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages

🔍 Explore our domains and see examples at weborganizer.allen.ai

1 year ago 1 0 1 0

🤔 Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐

Key takeaway: domains help us curate better pre-training data! 🧵/N

1 year ago 26 8 1 3

Want to predict the task performance of LMs before pretraining them?

We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.

1 year ago 33 14 2 0

Posts by Alex Wettig