Check out our newsletter for April 2026, with updates on what we've been up to.
www.commoncrawl.org/blog/april-2...
Posts by Common Crawl Foundation
Common Crawl is switching to reporting dataset sizes in nibbles. As an organisation dedicated to data preservation, we feel it would be remiss to allow this underrepresented unit to fall out of use. Our latest crawl now exceeds 689 tebibbles.
commoncrawl.org/blog/announc...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of January, February, and March 2026.
commoncrawl.org/blog/host--a...
We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.
commoncrawl.org/blog/march-2...
Blocking the Internet Archive Won’t Stop AI, But It Will Erase the Web’s Historical Record
www.eff.org/deeplinks/20...
@eff.org
We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail.
commoncrawl.org/blog/ipv6-ad...
Our Web Graph Statistics site has been updated with interactive charts, a domain lookup tool for tracking harmonic centrality and PageRank over time, mobile improvements, unified rank tables with OR filtering, and merged degree plots.
commoncrawl.org/blog/web-gra...
@thomvaughan.bsky.social did a WCAG colour contrast audit of 240 top domains using Common Crawl's February 2026 archive. You can read more about his study in the thread below
Introducing the second installment in our Whirlwind Tour series, covering crawl structure, index access, and content extraction, giving developers a practical foundation for building Java-based data workflows.
commoncrawl.org/blog/announc...
We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes and 5.4 billion edges at the domain level.
www.commoncrawl.org/blog/host--a...
We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and share links. We welcome community submissions.
blog.commoncrawl.org/blog/introdu...
We are pleased to announce the release of the February 2026 crawl, consisting of 2.1 billion web pages (or 363 TiB of uncompressed content). Captures are from 45.5 million hosts or 37.1 million registered domains.
blog.commoncrawl.org/blog/februar...
Preserving The Web Is Not The Problem. Losing It Is.
Mark Graham, Director of the Wayback Machine at @archive.org, walks us through the importance of preserving the Web in this recent post:
www.techdirt.com/2026/02/17/p...
Common Crawl was invited to the AI Plumbers unconference held at FOSDEM this year. The contrast between the 100 people at the unconference, compared to the 10,000 people at the main event, couldn't be bigger.
commoncrawl.org/blog/ai-plum...
We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data.
commoncrawl.org/blog/cc-cita...
Announcing our latest paper: CommonLID
In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.
arxiv.org/abs/2601.18026
Read the blogpost: commoncrawl.org/blog/commonl...
Dataset: huggingface.co/datasets/com...
Preprint: www.arxiv.org/abs/2601.18026
CommonLID can help us to create the next generation of open-source LangID models, which can in turn help create larger multilingual datasets. We would like to thank members of Masakhane and @seacrowd.bsky.social for their support in this effort.
CommonLID proves to be the most challenging dataset in our evaluation of existing LangID systems, as can be seen in the last column of the table above.
A table showing the results of our evaluation of existing LangID systems across 6 different datasets. Full text of the table is available on the paper linked below.
Current benchmarks over-estimate LangID performance on web data. In our evaluations, we show top existing models have < 80% F1, even when limiting to languages the models explicitly support.
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.
Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!
The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges.
www.commoncrawl.org/blog/host--a...
We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.
www.commoncrawl.org/blog/january...
Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.
www.commoncrawl.org/blog/web-arc...
As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.
commoncrawl.org/blog/how-seo...
GneissWeb Annotations Examples
A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.
commoncrawl.org/blog/gneissw...
From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.
www.commoncrawl.org/blog/common-...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.
commoncrawl.org/blog/host--a...