Advertisement Β· 728 Γ— 90

Posts by Guilherme Penedo

Post image

πŸš€ With Meta's recent paper replacing tokenization in LLMs with patches 🩹, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!

Let's take a trip down memory lane!

[1/N]

1 year ago 33 10 4 4
Post image

We will very soon announce a big community project, and are working on a πŸ“ blogpost walking you through the entire dataset creation process. Stay tuned!

1 year ago 6 0 1 0
Preview
HuggingFaceFW/fineweb-2 Β· Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

The dataset is released under the permissive πŸ“œ ODC-By 1.0 license, and the πŸ’» code to reproduce it and our evaluations is public.

Find out all about πŸ₯‚ FineWeb2 on the πŸ€— model page:
huggingface.co/datasets/Hug...

1 year ago 4 0 1 0
Post image

Announcing πŸ₯‚ FineWeb2: A sparkling update with 1000s of πŸ—£οΈlanguages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

πŸ₯‚ FineWeb2 has 8TB of compressed text data and outperforms other datasets.

1 year ago 76 19 1 0