Anton (@anton-l) Bsky

Open R1: Update #2 A Blog post by Open R1 on Hugging Face

Stay tuned for more Open R1 updates!

huggingface.co/blog/open-r1...

1 year ago 1 0 0 0

open-r1/OpenR1-Math-Raw · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

🤗 Dataset: huggingface.co/datasets/ope...

1 year ago 1 0 1 0

LLM Reasoning labs will be eating good today🍔

We commandeered the HF cluster for a few days and generated 1.2M reasoning-filled solutions to 500k NuminaMath problems with DeepSeek-R1 🐳
Have fun!

1 year ago 22 3 2 2

A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath

Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵

1 year ago 46 15 2 1

FineMath - The finest collection of mathematical content

We hope this dataset helps advance the performance of LLMs on Math 🚀 We’re also releasing all the ablation models in this collection, as well as the evaluation code.

Collection: huggingface.co/collections/...

Evaluation: github.com/huggingface/...

1 year ago 1 0 0 0

Below is the breakdown of the performance of each data source after decontamination, FineMath 4+ outperforms all other datasets when doing continued pre-training of Llama3.2-3B-Base on 60B tokens.

1 year ago 2 0 1 0

We got two high quality datasets with 34B and 10B tokens depending on the filtering threshold (3 vs 4).

We also augment the datasets by filtering the English text subset of InfiMM-WebMath-40B with our math classifier and adding it to FineMath.

1 year ago 1 0 1 0

For the text extraction, we switched to Resiliparse with OWM’s pipeline.
We then trained a classifier on Llama3's annotations to find pages with math reasoning and applied it in two stages. This helped us identify key math domains and recall high quality math data.

huggingface.co/HuggingFaceT...

1 year ago 1 0 1 0

💡It was time to re-extract the Common Crawl data directly.

We retrieved pages from FineWeb’s URLs to retain its high-quality data. Then, we added back the math pages that earlier FineWeb filters have removed, such as those containing curly braces (“{}“), a common LaTeX pattern.

1 year ago 1 0 1 0

Turns out math formatting is very important, our FineWebMath data was worse than OWM.

The classifier was mostly retrieving academic papers because math forums weren’t properly extracted with Trafilatura, and most equations needed better formatting.

1 year ago 1 0 1 0

For FineMath, we first tried starting directly from FineWeb. Although we didn’t tailor FineWeb’s text extraction for math, the data retained enough equations.

Then we trained a fastText classifier to retrieve OWM-like data.

1 year ago 0 0 1 0

Llama3 trains a DistilRoBERTa classifier to target pages with math reasoning and deduction. The process resembles FineWeb-Edu, where we train classifiers on synthetic web annotations.

The authors highlight a specialized math extractor from HTML pages to preserve the equations.

1 year ago 1 0 1 0

First let’s break down how AI labs curate math pre-training datasets 🕵️

DeepSeekMath and QwenMath train a fastText classifier on data like OpenWebMath (OWM). They iteratively filter and recall math content from Common Crawl, focusing on the most relevant domains.

1 year ago 1 0 1 0

A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath

Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵

1 year ago 46 15 2 1

The Open LLM Leaderboard got a new front page for Christmas

Check it out at huggingface.co/spaces/open-...

1 year ago 66 12 2 0

Announcing 🥂 FineWeb2: A sparkling update with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other datasets.

1 year ago 76 19 1 0

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!

1 year ago 104 22 4 4

Small yet mighty! 💫

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🤠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base huggingface.co/collections/...

1 year ago 158 27 11 4

smollm/evaluation at main · huggingface/smollm Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm

Repo: github.com/huggingface/...

Here's how we use it for SmolLM 🤏
github.com/huggingface/...

1 year ago 6 0 1 0

A screenshot of LightEval benchmarking results in a terminal

Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend

1 year ago 76 10 2 1

GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?

1 year ago 59 10 2 0

HuggingFaceTB/smoltalk · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!

The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.

Check out the dataset:
huggingface.co/datasets/Hug...

1 year ago 24 8 1 1

10x followers in the past week, I guess it's happening!

1 year ago 3 0 0 0

Posts by Anton