Even if you're not a partner library, you might be curious about what it's like to work with GRIN. Our technical report has a wealth of details. arxiv.org/abs/2511.11447
Posts by IDI
We're also sharing the pipeline we developed for Institutional Books that seamlessly dedupes, classifies, and enhances the data once GRIN Transfer brings it down. www.institutional.org/tools
When libraries join Google Books, Google not only scans their books, it also makes a wealth of image, OCR, & metadata available to them via the Google Return Interface (GRIN). But working with GRIN can be challenging, so we're releasing a tool to make it easier. www.institutional.org/posts/grin-t...
Join us tomorrow at 10AM EST:
tinyurl.com/y3ye6cz6
Can a small visual language model read documents as effectively as models 27 times its size?
Next Friday, IDI will host Michele Dolfi and Peter Staar from IBM Research Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.
This Monday, @institutionaldatainitiative.org will host Petr Knoth to share his experience leading CORE ("The world’s largest collection of open access research papers") as the rise of AI brings new meaning, and challenges, to stewarding knowledge repositories. Join us virtually via the link below.
Cohosted by @institutionaldatainitiative.org and The Berkman Klein Center. harvard.zoom.us/webinar/regi...
We hope Institutional Books will be the beginning of a process that makes millions more books accessible to the public for a variety of uses.
We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
www.institutionaldatainitiative.org/institutiona...
We look forward to growing Institutional Books through community. We welcome collaboration from researchers and model makers as we:
- Evaluate the dataset’s impact on model outputs
- Continuing to refine our OCR pipelines
View the dataset on Hugging Face: huggingface.co/datasets/ins...
As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type.
We included extensive volume-level metadata with both original and generated components, such as results from text-level language detection.
We analyzed the dataset’s coverage across time, topic, and language and found:
- 40% of English text + long tail of 254 languages
- 20 clear topical tranches
- Largely published in the 19th and 20th centuries
Technical report here: arxiv.org/abs/2506.08300
Today we released Institutional Books 1.0, a 242B token dataset from Harvard Library's collections, refined for accuracy and usability. 🧵
The @institutionaldatainitiative.org is proud to support The New Commons challenge. $100k grants along with mentorship. Let's get impactful data into the AI ecosystem.
As the @institutionaldatainitiative.org expands its mission, we’re announcing a collaboration with @bpl.boston.gov to develop AI-driven tools capable of accelerating new digitization at libraries across the world, starting at the Boston Public Library. institutionaldatainitiative.org/posts/using-...
I'm pleased to announce we're expanding our mission at the @institutionaldatainitiative.org with an open call for institutional collaborators, new digitization at Harvard Law School Library, and additional support to advance this work. institutionaldatainitiative.org/posts/open-c...