π With Meta's recent paper replacing tokenization in LLMs with patches π©Ή, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!
Let's take a trip down memory lane!
[1/N]
Posts by Guilherme Penedo
1 year ago
33
10
4
4
We will very soon announce a big community project, and are working on a π blogpost walking you through the entire dataset creation process. Stay tuned!
1 year ago
6
0
1
0
The dataset is released under the permissive π ODC-By 1.0 license, and the π» code to reproduce it and our evaluations is public.
Find out all about π₯ FineWeb2 on the π€ model page:
huggingface.co/datasets/Hug...
1 year ago
4
0
1
0
Announcing π₯ FineWeb2: A sparkling update with 1000s of π£οΈlanguages.
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other datasets.
1 year ago
76
19
1
0