I did spot that! Would be very curious how well it works! The rapid UMAP speed was very good and was basically just an env flag which is nice but this might still be better!
Posts by Daniel van Strien
Try it: huggingface.co/spaces/davan...
Scripts: huggingface.co/datasets/uv-...
Built with Embedding Atlas, GPU UMAP via cuml.accel, and HF Jobs + Buckets + Spaces
Screenshot of an Embedding Atlas visualization showing 2 million Open Library book titles as a scatter plot. Points are colored by subject category — History (orange), Fiction (green), Religion (purple), Law & Politics (red), and others. Clusters are labeled with topic keywords like "Church-Christian-Bible-God", "Shakespeare", "surgery-anatomy-physiology". A tooltip shows a selected book titled "Rosabelle" categorized as Society, published 1995. The right panel shows a category distribution bar chart.
2 million book titles visualised as an interactive Embedding Atlas. Fiction, History, Science — each forms its own cluster.
The trick: mount the same @hf.co Bucket to a GPU Job AND a Space. Job writes, Space reads. No uploads!
Night-vision camera trap image of a white-tailed deer in a forest. The deer is highlighted with a red segmentation mask overlay showing pixel-level object boundaries, with a green bounding box and "deer 96%" confidence label. Generated automatically by SAM3 from a text prompt with zero training data.
Segment any object in an image dataset with a text prompt — one command.
uv run segment-objects.py data output --class-name deer
Pixel-level masks via SAM3. Perfect for agents building their own training data.
Runs on @hf.co Jobs.
huggingface.co/datasets/uv-...
Very cool work! Happy to see this dataset continue to be used!
Wanted to share a personal project I've been working on: I love classic literature, and I'm also taking an AI class here at UIUC, so I trained an LLM from scratch on Victorian lit from 1837 to 1899. This is Mr. Chatterbox, the Victorian Gentleman Chatbot: huggingface.co/spaces/tvent...
You never know what data will be used for!
I uploaded a @britishlibrary.bsky.social dataset to Hugging Face in 2022. IIRC one of my first PR to a HF repo!
4 years later, someone trains a Victorian chatbot on it
More libraries should be sharing their public domain collections for AI to build on!
Is olmOCR-bench getting close to saturation? Top score is now 85.9%.
Yesterday, Datalab took #1 with chandra-ocr-2. A year ago, the best was 79.
Visualised the race to get there using @hf.co leaderboard data
Interactive Embedding Atlas visualization of 250,000 training examples from NVIDIA's Nemotron post-training v3 collection, colored by category. Distinct clusters are visible for Math (blue, 56k), Code (orange, 45k), Agentic (green, 41k), Instruction Following (red, 25k), Finance (purple, 18k), Multilingual (brown, 18k), Science (pink, 18k), Safety (grey, 16k), and Identity (yellow, 8k). A selected point shows an example from the Agentic function-calling dataset: "I need to know the weight of the creature that can evolve into a Fire Dragon."
One of the nicest things about Nvidia model releases is that they ship the training data.
What does it look like? I sampled 250k examples from 24 datasets in the Nemotron post-training v3 collection and built an interactive Embedding Atlas to explore it.
Screenshot of Mirador IIIF viewer showing a panoramic map of Amesville, Athens County Ohio from 1875, served from a Hugging Face Storage Bucket. The map shows detailed bird's-eye view illustrations of buildings, streets, and surrounding farmland.
The new @hf.co storage Buckets open up the Hub beyond models and datasets.
Example: IIIF image hosting.
With Buckets, just upload static tiles and any IIIF viewer zooms straight from CDN!
cc @iiif.bsky.social :)
IIIF manifest (interoperable with any IIIF viewer): huggingface.co/buckets/dava...
UV script to generate tiles from your own images: huggingface.co/datasets/uv-...
Screenshot of Mirador IIIF viewer showing a panoramic map of Amesville, Athens County Ohio from 1875, served from a Hugging Face Storage Bucket. The map shows detailed bird's-eye view illustrations of buildings, streets, and surrounding farmland.
The new @hf.co storage Buckets open up the Hub beyond models and datasets.
Example: IIIF image hosting.
With Buckets, just upload static tiles and any IIIF viewer zooms straight from CDN!
Code screenshot show loading from the hub filtering and pushing to a bucket
74GB of Dutch PDFs, filtered and written back to the Hub - without touching local disk!
Hub is your disk!
I built a PoC adding sink_parquet for @pola.rs to stream writes to @hf.co's new Storage Buckets via Xet. Constant memory ~18 min on a 2-vCPU machine.
Some models can also predict boduning boxes for images, charts etc. You would still need to do the extra step of grabbing the images from the bounding boxes but it can work quite well. dots.ocr is the one I've used most for this, i.e. this mode huggingface.co/datasets/uv-...
Still an early experiment. Would love feedback on whether something like this would be useful for your work!
Early results across 3 test collections:
• Library card catalogs → LightOn #1
• Britannica 1771 → GLM #1
• Icelandic PDFs → dots.ocr #1
Different documents, different winners!
Example space: huggingface.co/spaces/davan...
Point it at any Hugging Face dataset, launches OCR models, compares outputs pairwise using a VLM judge, and publishes an interactive leaderboard.
Inspired by datalab's benchmarks approach, but open source so you can run it on your own collections.
github.com/davanstrien/...
Screenshot of plot showing ELO vs paramter count for different OCR models
There is no best VLM OCR model - rankings can flip completely by document type.
I built ocr-bench: run open OCR models on YOUR documents, get a per-collection leaderboard.
VLM-as-judge with Bradley-Terry ELO, all running on @hf.co. No local GPU needed.
This sounds great!
That sounds great! IIRC screen readers tend to work okay with Markdown format? Might also be worth exploring www.docling.ai if you didn't already.
Screenshot of a search UI showing a text box with search results showing index cards next to the ocr for the card
Is it worth re-OCR'ing old library index cards?
Re-OCR'd 453,000 from @bpl.boston.gov's rare books catalogue.
~$50 compute using @huggingface Jobs
BPL's own guide calls their search "extremely unreliable." Does better OCR + semantic search help fix it?
Demo space link below
Screenshot of a search UI showing a text box with search results showing index cards next to the ocr for the card
Is it worth re-OCR'ing old library index cards?
Re-OCR'd 453,000 from @bpl.boston.gov's rare books catalogue.
~$50 compute using @huggingface Jobs
BPL's own guide calls their search "extremely unreliable." Does better OCR + semantic search help fix it?
Demo space link below
Ran the same OCR models on 68 pages of historic newspaper. Every model hallucinated or looped.
DeepSeek-OCR-2, LightOnOCR-2, GLM-OCR – all melt down on dense newspaper columns.
You can try yourself using this @hf.co dataset: huggingface.co/datasets/dav...
Great to hear of some fresh eyes on this task! Think there is a lot that wasn't possible a few years ago that is now.
Looking forward to reading it! Looking forward to it even more if it comes with data 😛
Ran the same OCR models on 68 pages of historic newspaper. Every model hallucinated or looped.
DeepSeek-OCR-2, LightOnOCR-2, GLM-OCR – all melt down on dense newspaper columns.
You can try yourself using this @hf.co dataset: huggingface.co/datasets/dav...