Advertisement ยท 728 ร— 90

Posts by Alex Wettig

Post image

Presenting two posters at ICML over the next two days:
- Both at 11am - 1:30pm
- Both about how to improve pre-training with domains
- Both at stall # E-2600 in East Exhibition Hall A-B (!)

Tomorrow: WebOrganizer w/ @soldaini.net & @kylelo.bsky.social
Thursday: MeCo by @gaotianyu1350.bsky.social

9 months ago 4 1 0 0
Preview
GitHub - CodeCreator/WebOrganizer: Organize the Web: Constructing Domains Enhances Pre-Training Data Curation Organize the Web: Constructing Domains Enhances Pre-Training Data Curation - CodeCreator/WebOrganizer

๐Ÿ“œ Paper: arxiv.org/pdf/2502.10341
๐ŸŒ Website (feat. Domain Explorer): weborganizer.allen.ai
๐Ÿค– Models and Data: huggingface.co/WebOrganizer
๐Ÿ’พ Code: github.com/CodeCreator...

w/amazing co-authors @kylelo.bsky.social @sewonm.bsky.social @hanna-nlp.bsky.social @danqi-chen.bsky.social @soldaini.net

1 year ago 4 0 0 0
Post image

Our domains also shine a light on which type of content is implicitly upsampled when using quality filters!

๐Ÿ’ก FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)

1 year ago 2 0 1 0
Post image

Instead of sampling from the domains, we can also pick the best documents according to quality filters, which improves the overall performance of two strong quality filters.

โœ… Domain mixing complements quality filtering by being able to calibrate the training distribution!

1 year ago 2 0 1 0
Post image

We test these domain mixtures by training 1B models and find that they improve performance across a range of tasks.

And we can combine the topic and format predictions to curate data with even better performance! ๐Ÿ“ˆ

1 year ago 1 0 1 0
Post image

How useful are these domains for data curation in practice?

We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"

Prediction: Heavily upsample domains such as Science or Tutorials!

1 year ago 1 0 1 0
Post image

We distill the LLM outputs into small domain classifiers to annotate data at scale!

Interesting finding: our topics and formats co-occur almost independently!

1 year ago 1 0 1 0
Video

Modern pre-training relies on crawling the web to collect trillions of tokens

We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages

๐Ÿ” Explore our domains and see examples at weborganizer.allen.ai

1 year ago 1 0 1 0
Post image

๐Ÿค” Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages ๐ŸŒ

Key takeaway: domains help us curate better pre-training data! ๐Ÿงต/N

1 year ago 26 8 1 3
Post image

Want to predict the task performance of LMs before pretraining them?

We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.

1 year ago 33 14 2 0
Advertisement