Advertisement · 728 × 90

Posts by Sebastian Majstorovic

Preview
The Europeana Initiative is proud to sign the Open Heritage Statement Read on to discover more about the Statement’s unified vision for the public domain, and how it links to our work and advocacy for open cultural heritage!

We’ve signed the #OpenHeritageStatement to support a global call for equitable access to #PublicDomain heritage in the digital environment!

Discover more about the Statement, how it links to our advocacy for open #CulturalHeritage and how you can sign ➡️ bit.ly/4sB1Ur0

@creativecommons.bsky.social

3 days ago 7 3 0 0
Preview
‘Proactively fall in line:’ Holocaust Memorial Museum quietly changed content after Trump returned to office Two former employees said they believed the museum was altering its content preemptively to avoid unwanted negative attention from the Trump administration.

After Trump took office last year, the US Holocaust Museum quietly removed educational material about American racism from its website and canceled a workshop on the "fragility of democracy," @iriesentner.bsky.social reports. www.politico.com/news/2026/04...

2 weeks ago 733 357 52 28

You never know what data will be used for!

I uploaded a @britishlibrary.bsky.social dataset to Hugging Face in 2022. IIRC one of my first PR to a HF repo!

4 years later, someone trains a Victorian chatbot on it

More libraries should be sharing their public domain collections for AI to build on!

2 weeks ago 83 7 7 2
Preview
How Fast Was the Mail? Explore mail transit times via railway between major U.S. cities, 1882–1908.

📣 New historical data visualization! "How Fast Was the Mail?" is an interactive map showing how long information took to travel across the US between 1882-1908: cblevins.github.io/mail-time/ +

3 weeks ago 216 92 8 9

I’m at #HSP2026 at MIT this week! I’ll be giving a talk Friday at 5:25pm entitled “Structural Priming Effects in Language Models are Less Human-like in Languages Other Than English”. Looking forward to chatting to everyone!

3 weeks ago 13 2 0 0
Preview
Farming in the Dark: Unreliable USDA data jeopardizes a sustainable farming transition As farmers struggle to adjust to a changing climate that has led to an increased frequency of droughts, floods, and other extreme weather events, incomplete or inaccurate public data has become a grow...

Interesting new piece from @iatp.bsky.social showing how confidence in USDA data is eroding under the Trump Administration, with serious consequences. www.iatp.org/farming-risk...

1 month ago 28 11 1 2
Preview
Independence National Historical Park - A Hopeful Update We recently posted about the takedown of signs at the President’s House site at Independence National Historical Park. We also posted a call for more photos for Save Our Signs, both before and after…

New DRP post: #Philly interpretive panels returned to the President's House. We congratulate the local community that fought HARD for their return. We hope this inspires other communities to #SaveOurSigns

www.datarescueproject.org/independence...

1 month ago 15 5 0 1
Decorative images with a screenshot of tiny.iiif in the background, and image server choices (as text labels) in the foreground: IIPImage, Cantaloupe.

Decorative images with a screenshot of tiny.iiif in the background, and image server choices (as text labels) in the foreground: IIPImage, Cantaloupe.

Small update to #tinyIIIF, my no-nonsense #IIIF server! You can now choose your image server during setup:

• IIPImage
• Cantaloupe

Running small- to mid-sized collections? Teaching with IIIF materials? Building IIIF-enabled tools? Check out tiny.iiif!

github.com/rsimon/tiny-...

#DigitalFriday

1 month ago 7 4 1 0
Screenshot of old vs new ocr. 

old ocr text is garbled. New ocr much cleaner.

Screenshot of old vs new ocr. old ocr text is garbled. New ocr much cleaner.

Re-OCR'd the complete 1771 Encyclopaedia Britannica (2,724 pages) with a single command on @hf.co Jobs.

- 0.9B model (GLM-OCR)
~$0.002/page
~$5 total on an L4 GPU

Before (old Tesseract ocr) → After

2 months ago 96 16 5 6
Advertisement

Get ready for live updates and quotes from @lyndamk.bsky.social and @mikalarae.bsky.social's #IDCC26 Keynote!

2 months ago 10 3 7 1
The Waldseemüller map in liiive (a IIIF annotation tool).

The Waldseemüller map in liiive (a IIIF annotation tool).

Yay! The first image served from my new #tinyIIIF test instance.

I'm running it on a 2 CPU/4GB RAM VM. Seems to be the absolute minimum & performance is pretty slow.

If anyone has advice on which specs I'd need to get more #IIIF speed out of Cantaloupe – let me know!

2 months ago 2 1 0 0
Preview
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...

Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026

2 months ago 22 12 1 0
Meme from the Simpsons, reading "0 days without needing to save data from the US federal government."

Meme from the Simpsons, reading "0 days without needing to save data from the US federal government."

It's official. One year ago today we formalized as the Data Rescue Project. We know this work is exhausting and are so grateful to everyone who has showed up, day in and day out.

2 months ago 29 3 1 2
Research Data Access and Preservation Association - 2025 RDAP Work of the Year Award

🎉Congrats to the winners of the 2025 RDAP Work of the Year Award, The Data Rescue Project! (@datarescueproject.org)🛟🏆

This award acknowledges the work’s impact on the wider research and scholarly communication ecosystem in support of RDAP’s mission and values.

rdapassociation.org/news/13593532

2 months ago 20 7 0 3

“Burning the Books: A History of the Deliberate Destruction of Knowledge” by @richove.bsky.social .

2 months ago 6 0 1 0
SciOp - Public Information Preservation Preserving Public Information

@sucho-org.bsky.social has been preserving Ukrainian cultural heritage data since 2022. @safeguardingdata.bsky.social is an international group of volunteers backing up US data and distributing them as torrents on sciop.net.

2 months ago 4 1 0 0
Post image

Say hi! to the wonderful people doing the @datarescueproject.org AMA tonight! @quetzal1234.bsky.social @nurnberger.bsky.social @storytracer.com Tess

Just missing for now:
@mikalarae.bsky.social and @katscade.bsky.social

2 months ago 13 2 2 0
Advertisement

We are always triaging at-risk datasets based on different factors: is there an executive order targeting a specific agency, do we have requests from data users or associations like @icpsr.bsky.social, and has someone else already backed it up?

2 months ago 2 0 0 0
Video

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.

I used a dataset I labelled in 2022 and left on @hf.co for 3 years 😬.

It finds illustrated pages in historical books. No server. No GPU.

4 months ago 86 18 2 1
Diagram illustrating the BookReconciler workflow. On the left, a book cover of The Book of Salt by Monique Truong appears alongside “Minimal Metadata,” listing Author: Truong, Monique and Title: The Book of Salt. An arrow points to a box labeled “BookReconciler” with book and diamond icons. A downward arrow leads to “Enriched + Clustered Metadata,” showing multiple editions of the book cover and expanded metadata, including several ISBNs, subject headings (e.g., Vietnamese–France fiction, women authors, household employees, gay men, cooking), and an author VIAF identifier.

Diagram illustrating the BookReconciler workflow. On the left, a book cover of The Book of Salt by Monique Truong appears alongside “Minimal Metadata,” listing Author: Truong, Monique and Title: The Book of Salt. An arrow points to a box labeled “BookReconciler” with book and diamond icons. A downward arrow leads to “Enriched + Clustered Metadata,” showing multiple editions of the book cover and expanded metadata, including several ISBNs, subject headings (e.g., Vietnamese–France fiction, women authors, household employees, gay men, cooking), and an author VIAF identifier.

Very happy to introduce a new tool, BookReconciler!

You can take spreadsheets with book data and add subject headings, descriptions, ISBNs, HathiTrust IDs, & more. You can also cluster editions & variations of the same "Work."

Led by @thisismattmiller.com and supported by @post45data.bsky.social.

4 months ago 123 56 7 1
Preview
Auszeichnungen Der Dachverband Bibliothek & Information Deutschland (BID) e. V. hat die Karl-Preusker-Medaille 2025 dem Ukrainischen Bibliotheksverband verliehen. „Ausgezeichnet wird damit der außergewöhnliche Einsa...

SUCHO has received the 2025 Karl Preusker medal from Bibliothek & Information Deutschland (BID)! The jury selected us as an example of “courage, solidarity, professional excellence, and the central role of libraries, archives, and digital infrastructures in the resilience of democratic societies.”

4 months ago 22 2 2 0

Working with a small museum/archive/project with digitized images but little metadata?

I'm looking for testers for a VLM pipeline for auto-enrichment (transcription, captions, tags, IIIF). If you share a few sample images, I'll run them through + share results. Would love to hear your feedback!

4 months ago 10 5 3 3

Thank you to @datarescueproject.org for publishing this blog post by @kdeeds.bsky.social and myself on GovScape! Extremely grateful to @datarescueproject.org for all their incredible work!

4 months ago 5 2 0 0
Preview
Guest Post: GovScape: A Public Search System for 10+ Million Government PDFs This week's guest post is from Benjamin Charles Germain Lee, Assistant Professor at the University of Washington, and Kyle Deeds, Assistant Professor at Boston University. Learn more about their recen...

This is amazing from Benjamin Charles Germain Lee, @kdeeds.bsky.social, and @datarescueproject.org: www.datarescueproject.org/guest-post-g...

4 months ago 5 2 0 0

At the AI4LAM Fantastic Futures conference this week

Happy to chat about @hf.co, open source AI for GLAMs, or why cultural heritage should bet on small, focused models over closed-source giants!

DM or find me at breaks! #AI4LAM #FF2025

4 months ago 15 3 0 0
Preview
Podcast Series "Meet a Historian" A podcast brought to you by the ERC projects CAPASIA and ECOINT How do Historians write History in the 21st century?  In this podcast series, early career researchers at the European University Inst...

How do historians write history in the #21stcentury?

🎙️ In the podcast series Meet a Historian, our PhD researchers engage in conversations with some of today’s most innovative historians 👉 loom.ly/IWAoLc8

Brought to you by our CAPASIA 👉 loom.ly/O6KOqFo and @ecointeui.bsky.social research projects

4 months ago 6 2 0 0
Advertisement
Preview
Revised news release dates following the 2025 lapse in appropriations Revised news release dates following the 2025 lapse in appropriations

BREAKING NEWS
Bureau of Labor Statistics announced cancellations of several key data releases

🔺 Job Openings and Labor Turnover (JOLTS)
🔺 Employment Situation
🔺 Consumer Price Index
... and more

This has knock-on effects for other products, like GDP (produced at BEA)

www.bls.gov/bls/2025-lap...

4 months ago 4 3 1 1
Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.

Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.

hf jobs uv run \
  --flavor a100-large \
  -s HF_TOKEN=HF_TOKEN \
  https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \
  -- davanstrien/newspapers-with-images-after-photography-big \
  davanstrien/newspapers-photo-predictions \
  --class-name "photograph" \
  --confidence-threshold 0.4

hf jobs uv run \ --flavor a100-large \ -s HF_TOKEN=HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \ -- davanstrien/newspapers-with-images-after-photography-big \ davanstrien/newspapers-photo-predictions \ --class-name "photograph" \ --confidence-threshold 0.4

Building datasets to train smaller, task-focused models used to be incredibly time-consuming.

Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically!

Try it yourself: huggingface.co/datasets/uv-...!

4 months ago 53 12 1 0

1/ Announcing GovScape – a public search system for 10 million U.S. government PDFs (70 million pages)! GovScape offers visual search, semantic text search, and keyword search. Explore below:

Website: www.govscape.net
ArXiv link: arxiv.org/abs/2511.11010

5 months ago 79 35 3 4
Post image Post image Post image Post image

The @mozilla.org team has done a spectacular job for MozFest 2025. If you‘re also in Barcelona and would like to chat about Open Data and Open Source AI send me a DM, I‘m here until Monday! #mozfest #mozfest2025 #mozillafestival #mozilla #opensource #ai

5 months ago 10 0 1 0