Advertisement Β· 728 Γ— 90

Posts by LLM360

Preview
GitHub - allenai/awesome-open-source-lms: Friends of OLMo and their links. Friends of OLMo and their links. Contribute to allenai/awesome-open-source-lms development by creating an account on GitHub.

Made a list of resources for open source language models with @soldaini.net ahead of the tutorial tomorrow at 930 AM.
github.com/allenai/awes...

1 year ago 110 20 2 0

We've added you to the list!

1 year ago 6 0 0 0

We've added you to the list!

1 year ago 7 0 0 0

Can we join your list?

1 year ago 1 0 1 0

We've added you to the list!

1 year ago 0 0 0 0

Great, yes, added!

1 year ago 1 0 0 0

Thanks Stella! We've added eleuther to the list.

1 year ago 0 0 0 0

Thanks! We've added you to the list.

1 year ago 1 0 0 0
Advertisement
Preview
Open-source LLMs Join the conversation

We've made a starter pack for researchers/organizations working on open-source LLMS.

Please let us know if we missed you or if you'd like to be added!

go.bsky.app/FELkyDr

1 year ago 42 14 6 0

Thank you!

1 year ago 0 0 0 0
Preview
TxT360: Trillion Extracted Text - a Hugging Face Space by LLM360 Discover amazing ML apps made by the community

🌍🌎The global deduplication process was hairy πŸ™ˆ - and we want to share every detail.

The TxT360 dedup pipeline can be recreated and used for other datasets. We include our tips and tricks in a tell-all write up in the release blog:
llm360-txt360.hf.space
huggingface.co/spaces/LLM36...

1 year ago 1 1 0 0
Post image

Building on FineWeb’s global deduplication findings, we introduce a strategic upsampling recipe which outperforms FineWeb using TxT360. Full details are in the Upsampling Experiment section of the release blog.

1 year ago 3 1 1 0
Post image

πŸͺŸπŸ› οΈLLM360 is committed to making open source AI accessible, transparent, and reproducible.

High-quality data is the first step toward better open source models...and we are excited to join the party contributing the first globally deduplicated dataset containing 5.7T tokens!

1 year ago 1 1 1 0
Banner image showing the TxT360 project.

Banner image showing the TxT360 project.

πŸ“’πŸ“’ Check out:

TxT360: a globally deduplicated dataset for LLM pretraining

🌐 99 Common Crawls
πŸ“˜ 14 Curated Sources
πŸ‘¨β€πŸ³ recipe to easily adjust data weighting and train the most performant models

Dataset:
huggingface.co/datasets/LLM...

Blog:
llm360-txt360.hf.space

1 year ago 5 1 1 0

Can we join?

1 year ago 1 0 1 0
Advertisement