Luca (@sciencialab.com) Bsky

Release 0.9.0 · grobidOrg/grobid What's Changed Added Conflict of interest and author contributions statement extraction in header and segmentation models #1319 Extract figures, tables and equations from back/annex sections #...

5/ Infrastructure: JDK 21, Gradle 9, TensorFlow 2.17 (Python 3.10–3.11), pdfalto 0.6.0, wapiti 1.5.1, virtualenv/conda support for DeLFT.
6/ Full release notes → github.com/kermitt2/...

#GROBID #OpenSource #NLP #ScholarlyInfrastructure
6/6

9 hours ago 1 0 0 0

4/ New pluggable NLP engines: Lingua for language identification, Blingfire for sentence segmentation — both available as drop-in alternatives to the existing defaults.
5/6

9 hours ago 0 0 1 0

3/ Revised Crossref consolidation, developed in collaboration with the Crossref team — improved rate limit handling, better error recovery, and more robust reference matching.
4/6

9 hours ago 0 0 1 0

Multi-architecture Docker images (amd64 + arm64) are now available, enabling native deployment on Apple Silicon and ARM cloud instances.
3/6

9 hours ago 0 0 1 0

1/ New extraction coverage: conflict of interest & author contribution statements; figures, tables and equations from back/annex sections; URLs from PDF annotations; ORCID identifiers fetched via Crossref when absent from the source document.
2/ Native Linux ARM64 support.
2/6

9 hours ago 1 0 1 0

I'm please to announce that GROBID 0.9.0 is out. Release notes and highlights below.
1/6

9 hours ago 0 0 1 0

If you've ever lost time reformatting citations by hand, this one's for you.
sciencialab.github.i...
7/7

2 weeks ago 0 0 0 0

Results are automatically extracted, so always give the output a quick review before dropping it into your .bib file — especially on older or inconsistently formatted references.
6/7

2 weeks ago 0 0 1 0

Under the hood it's powered by #Grobid, a battle-tested machine learning library for extracting structured data from scientific documents. The same technology used to process millions of PDFs at scale — now doing one job really well, in one click.
5/7

2 weeks ago 0 0 1 0

No tab-switching, no copy-pasting, no reformatting.

github.com/user-atta...
4/7

2 weeks ago 0 0 1 0

But the real magic is the bookmarklet.

Drag it once to your bookmarks bar. Then, on any webpage — a journal site, a preprint server, a bibliography — highlight the citation, click the bookmarklet, and the BibTeX is already in your clipboard.

3/7

2 weeks ago 0 0 1 0

We built a simple tool to fix exactly this: text2bibtex.
The app is simple: paste any plain-text citation and get a clean, ready-to-use BibTeX entry back.
github.com/user-atta...
2/7

2 weeks ago 0 0 1 0

You found the paper. Now where's the BibTeX?
🧵
It happens every time! You track down the right reference — some older journal article, a conference paper, a preprint — scroll down to "Cite this paper", and get… a plain text citation.
1/7

2 weeks ago 0 0 1 0

@thomvaughan.bsky.social did a WCAG colour contrast audit of 240 top domains using Common Crawl's February 2026 archive. You can read more about his study in the thread below

1 month ago 3 1 0 0

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2025 and January/February 2026 We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes a...

We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes and 5.4 billion edges at the domain level.

www.commoncrawl.org/blog/host--a...

1 month ago 3 2 0 0

We tried to carry on without the sharp mind of Patrice Lopez, who is currently immersed in new and exciting challenges. And on a lighter note, we might finally have a logo for Grobid!
4/4

4 months ago 0 0 0 0

It also enabled us to better coordinate our roadmap, integrating the benefits of advanced Large Language Models while preserving our commitment to enabling processing on standard consumer-grade hardware.
3/4

4 months ago 0 0 1 0

The meeting was extremely productive and helped us consolidate our community-driven vision for Grobid’s development in the coming years.
2/4

4 months ago 0 0 1 0

On the 26-27 November we held the #Grobid Camp at the Centre de #Inria Paris.
The goal was to have a meeting with the major players in the French community which spaces from government institutes, to companies and large scale projects.
1/4

4 months ago 1 0 1 0

The meeting was extremely productive and helped us consolidate our community-driven vision for Grobid’s development in the coming years.
2/4

4 months ago 0 0 0 0

Me: Please, fix the tests!
LLM Agent: OK

5 months ago 1 0 0 0

The Safari browser is like a car with one gear that claim it does not pollute...

7 months ago 0 0 0 0

Exactly! There is a common misconception that by throwing any kind of crap into a vector it will magically work. Still at the age of AI, metadata information cannot still be ignored.

8 months ago 1 0 0 0

Yes. The time is now. Vaccines to treat and prevent cancer.
www.jci.org/articles/vie...

9 months ago 327 81 4 2

Your feedback will help us improve Grobid! 🌟 Feel free to share your thoughts, star us on GitHub, and let’s keep building! 💬🚀

10 months ago 0 0 0 0

Next up, we're focusing on supporting more platforms (Linux ARM), improving figures and tables extraction, enhancing CJK language support, and providing better handling for more document types like theses, reports, and more.
🔽

10 months ago 0 0 1 0

Release 0.8.2 · kermitt2/grobid What's Changed Added New model specialization/variants (flavors) mechanism #1151 Specialization/variant process for a lightweight processing that covers other types of scientific articles that...

- 🔤 Improved recognition of non-standard fonts
- 🛠️ Various bug fixes and security vulnerabilities addressed

github.com/kermitt2/...
🔽

10 months ago 1 0 1 0

Grobid 0.8.2 is out! 🚀
- 🧠 New processing "flavors" for different doc types (e.g. SDO, corrections, editorials)
- 🔗 Improved URL extraction
- ✅ Better text extraction for paragraphs around figures and tables
🧵🔽

10 months ago 0 0 1 0

I estimate that a few examples for each model would quickly improve the results to an acceptable level.
Feel free to reach out if you are interested, and we can work out a collaboration around it.

10 months ago 0 0 0 0

I'm not sure Grobid is used in any project targetting any of the CJK languages, as other details might need to be addressed.
We started a branch at low-priority (github.com/kermitt2/gro...) to improve CJK languages at once, but other more urgent issues were prioritized at the time.
👇

10 months ago 0 0 1 0

Posts by Luca