5/ Infrastructure: JDK 21, Gradle 9, TensorFlow 2.17 (Python 3.10–3.11), pdfalto 0.6.0, wapiti 1.5.1, virtualenv/conda support for DeLFT.
6/ Full release notes → github.com/kermitt2/...
#GROBID #OpenSource #NLP #ScholarlyInfrastructure
6/6
Posts by Luca
4/ New pluggable NLP engines: Lingua for language identification, Blingfire for sentence segmentation — both available as drop-in alternatives to the existing defaults.
5/6
3/ Revised Crossref consolidation, developed in collaboration with the Crossref team — improved rate limit handling, better error recovery, and more robust reference matching.
4/6
Multi-architecture Docker images (amd64 + arm64) are now available, enabling native deployment on Apple Silicon and ARM cloud instances.
3/6
1/ New extraction coverage: conflict of interest & author contribution statements; figures, tables and equations from back/annex sections; URLs from PDF annotations; ORCID identifiers fetched via Crossref when absent from the source document.
2/ Native Linux ARM64 support.
2/6
I'm please to announce that GROBID 0.9.0 is out. Release notes and highlights below.
1/6
If you've ever lost time reformatting citations by hand, this one's for you.
sciencialab.github.i...
7/7
Results are automatically extracted, so always give the output a quick review before dropping it into your .bib file — especially on older or inconsistently formatted references.
6/7
Under the hood it's powered by #Grobid, a battle-tested machine learning library for extracting structured data from scientific documents. The same technology used to process millions of PDFs at scale — now doing one job really well, in one click.
5/7
No tab-switching, no copy-pasting, no reformatting.
github.com/user-atta...
4/7
But the real magic is the bookmarklet.
Drag it once to your bookmarks bar. Then, on any webpage — a journal site, a preprint server, a bibliography — highlight the citation, click the bookmarklet, and the BibTeX is already in your clipboard.
3/7
We built a simple tool to fix exactly this: text2bibtex.
The app is simple: paste any plain-text citation and get a clean, ready-to-use BibTeX entry back.
github.com/user-atta...
2/7
You found the paper. Now where's the BibTeX?
🧵
It happens every time! You track down the right reference — some older journal article, a conference paper, a preprint — scroll down to "Cite this paper", and get… a plain text citation.
1/7
@thomvaughan.bsky.social did a WCAG colour contrast audit of 240 top domains using Common Crawl's February 2026 archive. You can read more about his study in the thread below
We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes and 5.4 billion edges at the domain level.
www.commoncrawl.org/blog/host--a...
We tried to carry on without the sharp mind of Patrice Lopez, who is currently immersed in new and exciting challenges. And on a lighter note, we might finally have a logo for Grobid!
4/4
It also enabled us to better coordinate our roadmap, integrating the benefits of advanced Large Language Models while preserving our commitment to enabling processing on standard consumer-grade hardware.
3/4
The meeting was extremely productive and helped us consolidate our community-driven vision for Grobid’s development in the coming years.
2/4
On the 26-27 November we held the #Grobid Camp at the Centre de #Inria Paris.
The goal was to have a meeting with the major players in the French community which spaces from government institutes, to companies and large scale projects.
1/4
The meeting was extremely productive and helped us consolidate our community-driven vision for Grobid’s development in the coming years.
2/4
Me: Please, fix the tests!
LLM Agent: OK
The Safari browser is like a car with one gear that claim it does not pollute...
Exactly! There is a common misconception that by throwing any kind of crap into a vector it will magically work. Still at the age of AI, metadata information cannot still be ignored.
Yes. The time is now. Vaccines to treat and prevent cancer.
www.jci.org/articles/vie...
Your feedback will help us improve Grobid! 🌟 Feel free to share your thoughts, star us on GitHub, and let’s keep building! 💬🚀
Next up, we're focusing on supporting more platforms (Linux ARM), improving figures and tables extraction, enhancing CJK language support, and providing better handling for more document types like theses, reports, and more.
🔽
- 🔤 Improved recognition of non-standard fonts
- 🛠️ Various bug fixes and security vulnerabilities addressed
github.com/kermitt2/...
🔽
Grobid 0.8.2 is out! 🚀
- 🧠 New processing "flavors" for different doc types (e.g. SDO, corrections, editorials)
- 🔗 Improved URL extraction
- ✅ Better text extraction for paragraphs around figures and tables
🧵🔽
I estimate that a few examples for each model would quickly improve the results to an acceptable level.
Feel free to reach out if you are interested, and we can work out a collaboration around it.
I'm not sure Grobid is used in any project targetting any of the CJK languages, as other details might need to be addressed.
We started a branch at low-priority (github.com/kermitt2/gro...) to improve CJK languages at once, but other more urgent issues were prioritized at the time.
👇