We’ve signed the #OpenHeritageStatement to support a global call for equitable access to #PublicDomain heritage in the digital environment!
Discover more about the Statement, how it links to our advocacy for open #CulturalHeritage and how you can sign ➡️ bit.ly/4sB1Ur0
@creativecommons.bsky.social
Posts by Sebastian Majstorovic
After Trump took office last year, the US Holocaust Museum quietly removed educational material about American racism from its website and canceled a workshop on the "fragility of democracy," @iriesentner.bsky.social reports. www.politico.com/news/2026/04...
You never know what data will be used for!
I uploaded a @britishlibrary.bsky.social dataset to Hugging Face in 2022. IIRC one of my first PR to a HF repo!
4 years later, someone trains a Victorian chatbot on it
More libraries should be sharing their public domain collections for AI to build on!
📣 New historical data visualization! "How Fast Was the Mail?" is an interactive map showing how long information took to travel across the US between 1882-1908: cblevins.github.io/mail-time/ +
I’m at #HSP2026 at MIT this week! I’ll be giving a talk Friday at 5:25pm entitled “Structural Priming Effects in Language Models are Less Human-like in Languages Other Than English”. Looking forward to chatting to everyone!
Interesting new piece from @iatp.bsky.social showing how confidence in USDA data is eroding under the Trump Administration, with serious consequences. www.iatp.org/farming-risk...
New DRP post: #Philly interpretive panels returned to the President's House. We congratulate the local community that fought HARD for their return. We hope this inspires other communities to #SaveOurSigns
www.datarescueproject.org/independence...
Decorative images with a screenshot of tiny.iiif in the background, and image server choices (as text labels) in the foreground: IIPImage, Cantaloupe.
Small update to #tinyIIIF, my no-nonsense #IIIF server! You can now choose your image server during setup:
• IIPImage
• Cantaloupe
Running small- to mid-sized collections? Teaching with IIIF materials? Building IIIF-enabled tools? Check out tiny.iiif!
github.com/rsimon/tiny-...
#DigitalFriday
Screenshot of old vs new ocr. old ocr text is garbled. New ocr much cleaner.
Re-OCR'd the complete 1771 Encyclopaedia Britannica (2,724 pages) with a single command on @hf.co Jobs.
- 0.9B model (GLM-OCR)
~$0.002/page
~$5 total on an L4 GPU
Before (old Tesseract ocr) → After
Get ready for live updates and quotes from @lyndamk.bsky.social and @mikalarae.bsky.social's #IDCC26 Keynote!
The Waldseemüller map in liiive (a IIIF annotation tool).
Yay! The first image served from my new #tinyIIIF test instance.
I'm running it on a 2 CPU/4GB RAM VM. Seems to be the absolute minimum & performance is pretty slow.
If anyone has advice on which specs I'd need to get more #IIIF speed out of Cantaloupe – let me know!
Announcing our latest paper: CommonLID
In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.
arxiv.org/abs/2601.18026
Meme from the Simpsons, reading "0 days without needing to save data from the US federal government."
It's official. One year ago today we formalized as the Data Rescue Project. We know this work is exhausting and are so grateful to everyone who has showed up, day in and day out.
🎉Congrats to the winners of the 2025 RDAP Work of the Year Award, The Data Rescue Project! (@datarescueproject.org)🛟🏆
This award acknowledges the work’s impact on the wider research and scholarly communication ecosystem in support of RDAP’s mission and values.
rdapassociation.org/news/13593532
“Burning the Books: A History of the Deliberate Destruction of Knowledge” by @richove.bsky.social .
@sucho-org.bsky.social has been preserving Ukrainian cultural heritage data since 2022. @safeguardingdata.bsky.social is an international group of volunteers backing up US data and distributing them as torrents on sciop.net.
Say hi! to the wonderful people doing the @datarescueproject.org AMA tonight! @quetzal1234.bsky.social @nurnberger.bsky.social @storytracer.com Tess
Just missing for now:
@mikalarae.bsky.social and @katscade.bsky.social
We are always triaging at-risk datasets based on different factors: is there an executive order targeting a specific agency, do we have requests from data users or associations like @icpsr.bsky.social, and has someone else already backed it up?
Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.
I used a dataset I labelled in 2022 and left on @hf.co for 3 years 😬.
It finds illustrated pages in historical books. No server. No GPU.
Diagram illustrating the BookReconciler workflow. On the left, a book cover of The Book of Salt by Monique Truong appears alongside “Minimal Metadata,” listing Author: Truong, Monique and Title: The Book of Salt. An arrow points to a box labeled “BookReconciler” with book and diamond icons. A downward arrow leads to “Enriched + Clustered Metadata,” showing multiple editions of the book cover and expanded metadata, including several ISBNs, subject headings (e.g., Vietnamese–France fiction, women authors, household employees, gay men, cooking), and an author VIAF identifier.
Very happy to introduce a new tool, BookReconciler!
You can take spreadsheets with book data and add subject headings, descriptions, ISBNs, HathiTrust IDs, & more. You can also cluster editions & variations of the same "Work."
Led by @thisismattmiller.com and supported by @post45data.bsky.social.
SUCHO has received the 2025 Karl Preusker medal from Bibliothek & Information Deutschland (BID)! The jury selected us as an example of “courage, solidarity, professional excellence, and the central role of libraries, archives, and digital infrastructures in the resilience of democratic societies.”
Working with a small museum/archive/project with digitized images but little metadata?
I'm looking for testers for a VLM pipeline for auto-enrichment (transcription, captions, tags, IIIF). If you share a few sample images, I'll run them through + share results. Would love to hear your feedback!
Thank you to @datarescueproject.org for publishing this blog post by @kdeeds.bsky.social and myself on GovScape! Extremely grateful to @datarescueproject.org for all their incredible work!
This is amazing from Benjamin Charles Germain Lee, @kdeeds.bsky.social, and @datarescueproject.org: www.datarescueproject.org/guest-post-g...
At the AI4LAM Fantastic Futures conference this week
Happy to chat about @hf.co, open source AI for GLAMs, or why cultural heritage should bet on small, focused models over closed-source giants!
DM or find me at breaks! #AI4LAM #FF2025
How do historians write history in the #21stcentury?
🎙️ In the podcast series Meet a Historian, our PhD researchers engage in conversations with some of today’s most innovative historians 👉 loom.ly/IWAoLc8
Brought to you by our CAPASIA 👉 loom.ly/O6KOqFo and @ecointeui.bsky.social research projects
BREAKING NEWS
Bureau of Labor Statistics announced cancellations of several key data releases
🔺 Job Openings and Labor Turnover (JOLTS)
🔺 Employment Situation
🔺 Consumer Price Index
... and more
This has knock-on effects for other products, like GDP (produced at BEA)
www.bls.gov/bls/2025-lap...
Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.
hf jobs uv run \ --flavor a100-large \ -s HF_TOKEN=HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \ -- davanstrien/newspapers-with-images-after-photography-big \ davanstrien/newspapers-photo-predictions \ --class-name "photograph" \ --confidence-threshold 0.4
Building datasets to train smaller, task-focused models used to be incredibly time-consuming.
Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically!
Try it yourself: huggingface.co/datasets/uv-...!
1/ Announcing GovScape – a public search system for 10 million U.S. government PDFs (70 million pages)! GovScape offers visual search, semantic text search, and keyword search. Explore below:
Website: www.govscape.net
ArXiv link: arxiv.org/abs/2511.11010
The @mozilla.org team has done a spectacular job for MozFest 2025. If you‘re also in Barcelona and would like to chat about Open Data and Open Source AI send me a DM, I‘m here until Monday! #mozfest #mozfest2025 #mozillafestival #mozilla #opensource #ai