Made a list of resources for open source language models with @soldaini.net ahead of the tutorial tomorrow at 930 AM.
github.com/allenai/awes...
Posts by LLM360
We've added you to the list!
We've added you to the list!
Can we join your list?
We've added you to the list!
Great, yes, added!
Thanks Stella! We've added eleuther to the list.
Thanks! We've added you to the list.
We've made a starter pack for researchers/organizations working on open-source LLMS.
Please let us know if we missed you or if you'd like to be added!
go.bsky.app/FELkyDr
Thank you!
ππThe global deduplication process was hairy π - and we want to share every detail.
The TxT360 dedup pipeline can be recreated and used for other datasets. We include our tips and tricks in a tell-all write up in the release blog:
llm360-txt360.hf.space
huggingface.co/spaces/LLM36...
Building on FineWebβs global deduplication findings, we introduce a strategic upsampling recipe which outperforms FineWeb using TxT360. Full details are in the Upsampling Experiment section of the release blog.
πͺπ οΈLLM360 is committed to making open source AI accessible, transparent, and reproducible.
High-quality data is the first step toward better open source models...and we are excited to join the party contributing the first globally deduplicated dataset containing 5.7T tokens!
Banner image showing the TxT360 project.
π’π’ Check out:
TxT360: a globally deduplicated dataset for LLM pretraining
π 99 Common Crawls
π 14 Curated Sources
π¨βπ³ recipe to easily adjust data weighting and train the most performant models
Dataset:
huggingface.co/datasets/LLM...
Blog:
llm360-txt360.hf.space
Can we join?