Advertisement · 728 × 90

Posts by IDI

Preview
GRIN Transfer: A production-ready tool for libraries to retrieve digital copies from Google Books Publicly launched in 2004, the Google Books project has scanned tens of millions of items in partnership with libraries around the world. As part of this project, Google created the Google Return Inte...

Even if you're not a partner library, you might be curious about what it's like to work with GRIN. Our technical report has a wealth of details. arxiv.org/abs/2511.11447

5 months ago 0 1 0 0
Preview
Institutional Books | Institutional Data Initiative Institutional Books 1.0 is our first release of public domain books. This set was originally digitized through Harvard Library’s participation in the Google Books project..

We're also sharing the pipeline we developed for Institutional Books that seamlessly dedupes, classifies, and enhances the data once GRIN Transfer brings it down. www.institutional.org/tools

5 months ago 0 1 1 0
Preview
Announcing the release of GRIN Transfer GRIN Transfer, an open source tool that allows Google Books partner libraries to more easily access their Google Books collection.

When libraries join Google Books, Google not only scans their books, it also makes a wealth of image, OCR, & metadata available to them via the Google Return Interface (GRIN). But working with GRIN can be challenging, so we're releasing a tool to make it easier. www.institutional.org/posts/grin-t...

5 months ago 5 3 1 0

Join us tomorrow at 10AM EST:
tinyurl.com/y3ye6cz6

7 months ago 0 0 0 0
Preview
Welcome! You are invited to join a meeting: IDI Talk with Michele Dolfi & Peter Staar on SmolDocling. After registering, you will receive a confirmation email about joining the meeting. Please join us for a talk with Michele Dolfi & Peter Staar from IBM Research in Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.

Register to join the talk virtually: harvard.zoom.us/meeting/regi...

7 months ago 0 0 0 0
Post image

Can a small visual language model read documents as effectively as models 27 times its size?

Next Friday, IDI will host Michele Dolfi and Peter Staar from IBM Research Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.

7 months ago 0 0 1 1

This Monday, @institutionaldatainitiative.org will host Petr Knoth to share his experience leading CORE ("The world’s largest collection of open access research papers") as the rise of AI brings new meaning, and challenges, to stewarding knowledge repositories. Join us virtually via the link below.

10 months ago 2 2 2 1
Preview
Welcome! You are invited to join a webinar: Open AI Development. After registering, you will receive a confirmation email about joining the webinar. For AI to truly benefit society, it must be built on foundations of transparency, fairness, and accountability—starting with the most foundational building block that powers it: data. Not long ago, ...

Cohosted by @institutionaldatainitiative.org and The Berkman Klein Center. harvard.zoom.us/webinar/regi...

10 months ago 0 1 0 0
Preview
Institutional Books | Institutional Data Initiative Institutional Books 1.0 is our first release of public domain books. This set was originally digitized through Harvard Library’s participation in the Google Books project..

We hope Institutional Books will be the beginning of a process that makes millions more books accessible to the public for a variety of uses.

We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
www.institutionaldatainitiative.org/institutiona...

10 months ago 9 1 0 0
Preview
institutional/institutional-books-1.0 · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

We look forward to growing Institutional Books through community. We welcome collaboration from researchers and model makers as we:
- Evaluate the dataset’s impact on model outputs
- Continuing to refine our OCR pipelines

View the dataset on Hugging Face: huggingface.co/datasets/ins...

10 months ago 16 2 1 0
Advertisement
Post image

As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type.

10 months ago 2 0 1 0
Post image

We included extensive volume-level metadata with both original and generated components, such as results from text-level language detection.

10 months ago 3 0 1 0
Post image Post image Post image

We analyzed the dataset’s coverage across time, topic, and language and found:
- 40% of English text + long tail of 254 languages
- 20 clear topical tranches
- Largely published in the 19th and 20th centuries

Technical report here: arxiv.org/abs/2506.08300

10 months ago 8 1 2 0
Post image

Today we released Institutional Books 1.0, a 242B token dataset from Harvard Library's collections, refined for accuracy and usability. 🧵

10 months ago 75 29 9 6

The @institutionaldatainitiative.org is proud to support The New Commons challenge. $100k grants along with mentorship. Let's get impactful data into the AI ecosystem.

1 year ago 7 5 0 0
Preview
Using AI to Accelerate Digitization at Boston Public Librarys Today, as part of our mission expansion, we’re announcing a collaboration with BPL to develop AI-driven tools capable of accelerating new digitization of large collections at libraries across the worl...

As the @institutionaldatainitiative.org expands its mission, we’re announcing a collaboration with @bpl.boston.gov to develop AI-driven tools capable of accelerating new digitization at libraries across the world, starting at the Boston Public Library. institutionaldatainitiative.org/posts/using-...

1 year ago 18 10 1 1
Preview
Expanding Our Mission: An Open Call for Collaborators Today, we’re pleased to announce an open call for institutional collaborators as new support expands the research capacity of the Institutional Data Initiative.

I'm pleased to announce we're expanding our mission at the @institutionaldatainitiative.org with an open call for institutional collaborators, new digitization at Harvard Law School Library, and additional support to advance this work. institutionaldatainitiative.org/posts/open-c...

1 year ago 11 9 1 0
Preview
How Knowledge Institutions Can Build a Promethean Moment Why we’re launching the Institutional Data Initiative to work with libraries, government agencies, and other knowledge institutions to develop data collections and best practices for artificial intell...

Hello world. institutionaldatainitiative.org/hello-world....

1 year ago 7 3 0 1
Advertisement