Daniel van Strien (@danielvanstrien) Bsky

I did spot that! Would be very curious how well it works! The rapid UMAP speed was very good and was basically just an env flag which is nice but this might still be better!

2 weeks ago 1 0 0 0

Try it: huggingface.co/spaces/davan...

Scripts: huggingface.co/datasets/uv-...

Built with Embedding Atlas, GPU UMAP via cuml.accel, and HF Jobs + Buckets + Spaces

2 weeks ago 2 0 0 0

Screenshot of an Embedding Atlas visualization showing 2 million Open Library book titles as a scatter plot. Points are colored by subject category — History (orange), Fiction (green), Religion (purple), Law & Politics (red), and others. Clusters are labeled with topic keywords like "Church-Christian-Bible-God", "Shakespeare", "surgery-anatomy-physiology". A tooltip shows a selected book titled "Rosabelle" categorized as Society, published 1995. The right panel shows a category distribution bar chart.

2 million book titles visualised as an interactive Embedding Atlas. Fiction, History, Science — each forms its own cluster.

The trick: mount the same @hf.co Bucket to a GPU Job AND a Space. Job writes, Space reads. No uploads!

2 weeks ago 18 2 2 1

Night-vision camera trap image of a white-tailed deer in a forest. The deer is highlighted with a red segmentation mask overlay showing pixel-level object boundaries, with a green bounding box and "deer 96%" confidence label. Generated automatically by SAM3 from a text prompt with zero training data.

Segment any object in an image dataset with a text prompt — one command.

uv run segment-objects.py data output --class-name deer

Pixel-level masks via SAM3. Perfect for agents building their own training data.

Runs on @hf.co Jobs.
huggingface.co/datasets/uv-...

2 weeks ago 7 1 1 0

Very cool work! Happy to see this dataset continue to be used!

3 weeks ago 1 0 0 0

Mr. Chatterbox - a Hugging Face Space by tventurella Type your questions or conversation starters into the chat box and receive replies written in authentic 19th‑century style. The bot replies with text based only on Victorian literature, letting you...

Wanted to share a personal project I've been working on: I love classic literature, and I'm also taking an AI class here at UIUC, so I trained an LLM from scratch on Victorian lit from 1837 to 1899. This is Mr. Chatterbox, the Victorian Gentleman Chatbot: huggingface.co/spaces/tvent...

3 weeks ago 109 12 9 7

You never know what data will be used for!

I uploaded a @britishlibrary.bsky.social dataset to Hugging Face in 2022. IIRC one of my first PR to a HF repo!

4 years later, someone trains a Victorian chatbot on it

More libraries should be sharing their public domain collections for AI to build on!

3 weeks ago 83 7 7 2

Is olmOCR-bench getting close to saturation? Top score is now 85.9%.

Yesterday, Datalab took #1 with chandra-ocr-2. A year ago, the best was 79.

Visualised the race to get there using @hf.co leaderboard data

1 month ago 10 1 1 0

Nemotron V3 Atlas - a Hugging Face Space by davanstrien This app lets you upload your vector embeddings (e.g., CSV or JSON files) and instantly creates an interactive 2‑D/3‑D plot where similar items cluster together. You can explore the layout, hover o...

huggingface.co/spaces/davan...

1 month ago 10 1 0 0

Interactive Embedding Atlas visualization of 250,000 training examples from NVIDIA's Nemotron post-training v3 collection, colored by category. Distinct clusters are visible for Math (blue, 56k), Code (orange, 45k), Agentic (green, 41k), Instruction Following (red, 25k), Finance (purple, 18k), Multilingual (brown, 18k), Science (pink, 18k), Safety (grey, 16k), and Identity (yellow, 8k). A selected point shows an example from the Agentic function-calling dataset: "I need to know the weight of the creature that can evolve into a Fire Dragon."

One of the nicest things about Nvidia model releases is that they ship the training data.

What does it look like? I sampled 250k examples from 24 datasets in the Nemotron post-training v3 collection and built an interactive Embedding Atlas to explore it.

1 month ago 37 7 3 1

Screenshot of Mirador IIIF viewer showing a panoramic map of Amesville, Athens County Ohio from 1875, served from a Hugging Face Storage Bucket. The map shows detailed bird's-eye view illustrations of buildings, streets, and surrounding farmland.

The new @hf.co storage Buckets open up the Hub beyond models and datasets.

Example: IIIF image hosting.

With Buckets, just upload static tiles and any IIIF viewer zooms straight from CDN!

1 month ago 14 3 2 0

cc @iiif.bsky.social :)

1 month ago 0 0 0 0

IIIF manifest (interoperable with any IIIF viewer): huggingface.co/buckets/dava...

UV script to generate tiles from your own images: huggingface.co/datasets/uv-...

1 month ago 2 1 1 0

Screenshot of Mirador IIIF viewer showing a panoramic map of Amesville, Athens County Ohio from 1875, served from a Hugging Face Storage Bucket. The map shows detailed bird's-eye view illustrations of buildings, streets, and surrounding farmland.

The new @hf.co storage Buckets open up the Hub beyond models and datasets.

Example: IIIF image hosting.

With Buckets, just upload static tiles and any IIIF viewer zooms straight from CDN!

1 month ago 14 3 2 0

GitHub - davanstrien/polars at feature/hf-bucket-sink Dataframes powered by a multithreaded, vectorized query engine, written in Rust - davanstrien/polars

(Experimental) branch for this: github.com/davanstrien/...

1 month ago 2 0 0 0

Code screenshot show loading from the hub filtering and pushing to a bucket

74GB of Dutch PDFs, filtered and written back to the Hub - without touching local disk!

Hub is your disk!

I built a PoC adding sink_parquet for @pola.rs to stream writes to @hf.co's new Storage Buckets via Xet. Constant memory ~18 min on a 2-vCPU machine.

1 month ago 10 1 1 0

dots-ocr.py · uv-scripts/ocr at main We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Some models can also predict boduning boxes for images, charts etc. You would still need to do the extra step of grabbing the images from the bounding boxes but it can work quite well. dots.ocr is the one I've used most for this, i.e. this mode huggingface.co/datasets/uv-...

1 month ago 1 0 1 0

Still an early experiment. Would love feedback on whether something like this would be useful for your work!

1 month ago 1 0 1 0

OCR Bench Viewer - a Hugging Face Space by davanstrien This tool shows two OCR results side‑by‑side for the same picture, letting you pick which one is correct or mark a tie. You can filter comparisons by model or by which side won, and your votes cont...

Early results across 3 test collections:

• Library card catalogs → LightOn #1
• Britannica 1771 → GLM #1
• Icelandic PDFs → dots.ocr #1

Different documents, different winners!

Example space: huggingface.co/spaces/davan...

1 month ago 3 0 1 0

GitHub - davanstrien/ocr-bench: Per-collection OCR leaderboards using VLM-as-judge Per-collection OCR leaderboards using VLM-as-judge - davanstrien/ocr-bench

Point it at any Hugging Face dataset, launches OCR models, compares outputs pairwise using a VLM judge, and publishes an interactive leaderboard.

Inspired by datalab's benchmarks approach, but open source so you can run it on your own collections.

github.com/davanstrien/...

1 month ago 4 0 1 0

Screenshot of plot showing ELO vs paramter count for different OCR models

There is no best VLM OCR model - rankings can flip completely by document type.

I built ocr-bench: run open OCR models on YOUR documents, get a per-collection leaderboard.

VLM-as-judge with Bradley-Terry ELO, all running on @hf.co. No local GPU needed.

1 month ago 52 11 1 1

This sounds great!

1 month ago 1 0 0 0

Docling Docling converts messy documents into structured data and simplifies downstream document and AI processing by detecting tables, formulas, reading order, OCR, and much more.

That sounds great! IIRC screen readers tend to work okay with Markdown format? Might also be worth exploring www.docling.ai if you didn't already.

1 month ago 0 0 1 0

Screenshot of a search UI showing a text box with search results showing index cards next to the ocr for the card

Is it worth re-OCR'ing old library index cards?

Re-OCR'd 453,000 from @bpl.boston.gov's rare books catalogue.

~$50 compute using @huggingface Jobs

BPL's own guide calls their search "extremely unreliable." Does better OCR + semantic search help fix it?

Demo space link below

1 month ago 41 8 1 0

BPL Card Catalog Search - a Hugging Face Space by davanstrien Enter a word or phrase to find items in the BPL Rare Books & Manuscripts card catalog. Choose semantic or keyword mode, set how many results you want, and view the original index‑card images (click...

huggingface.co/spaces/davan...

1 month ago 8 1 1 0

Screenshot of a search UI showing a text box with search results showing index cards next to the ocr for the card

Is it worth re-OCR'ing old library index cards?

Re-OCR'd 453,000 from @bpl.boston.gov's rare books catalogue.

~$50 compute using @huggingface Jobs

BPL's own guide calls their search "extremely unreliable." Does better OCR + semantic search help fix it?

Demo space link below

1 month ago 41 8 1 0

Ran the same OCR models on 68 pages of historic newspaper. Every model hallucinated or looped.

DeepSeek-OCR-2, LightOnOCR-2, GLM-OCR – all melt down on dense newspaper columns.

You can try yourself using this @hf.co dataset: huggingface.co/datasets/dav...

1 month ago 20 3 4 0

Great to hear of some fresh eyes on this task! Think there is a lot that wasn't possible a few years ago that is now.

1 month ago 2 0 0 0

Looking forward to reading it! Looking forward to it even more if it comes with data 😛

1 month ago 1 0 0 0

Ran the same OCR models on 68 pages of historic newspaper. Every model hallucinated or looped.

DeepSeek-OCR-2, LightOnOCR-2, GLM-OCR – all melt down on dense newspaper columns.

You can try yourself using this @hf.co dataset: huggingface.co/datasets/dav...

1 month ago 20 3 4 0

Posts by Daniel van Strien