Presenting two posters at ICML over the next two days:
- Both at 11am - 1:30pm
- Both about how to improve pre-training with domains
- Both at stall # E-2600 in East Exhibition Hall A-B (!)
Tomorrow: WebOrganizer w/ @soldaini.net & @kylelo.bsky.social
Thursday: MeCo by @gaotianyu1350.bsky.social
Posts by Alex Wettig
๐ Paper: arxiv.org/pdf/2502.10341
๐ Website (feat. Domain Explorer): weborganizer.allen.ai
๐ค Models and Data: huggingface.co/WebOrganizer
๐พ Code: github.com/CodeCreator...
w/amazing co-authors @kylelo.bsky.social @sewonm.bsky.social @hanna-nlp.bsky.social @danqi-chen.bsky.social @soldaini.net
Our domains also shine a light on which type of content is implicitly upsampled when using quality filters!
๐ก FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)
Instead of sampling from the domains, we can also pick the best documents according to quality filters, which improves the overall performance of two strong quality filters.
โ
Domain mixing complements quality filtering by being able to calibrate the training distribution!
We test these domain mixtures by training 1B models and find that they improve performance across a range of tasks.
And we can combine the topic and format predictions to curate data with even better performance! ๐
How useful are these domains for data curation in practice?
We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"
Prediction: Heavily upsample domains such as Science or Tutorials!
We distill the LLM outputs into small domain classifiers to annotate data at scale!
Interesting finding: our topics and formats co-occur almost independently!
Modern pre-training relies on crawling the web to collect trillions of tokens
We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages
๐ Explore our domains and see examples at weborganizer.allen.ai
๐ค Ever wondered how prevalent some type of web content is during LM pre-training?
In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages ๐
Key takeaway: domains help us curate better pre-training data! ๐งต/N
Want to predict the task performance of LMs before pretraining them?
We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.