Advertisement · 728 × 90

Posts by Common Crawl Foundation

Preview
Common Crawl - Blog - April 2026 Common Crawl Newsletter Check out our newsletter for April 2026, with updates on what we've been up to.

Check out our newsletter for April 2026, with updates on what we've been up to.

www.commoncrawl.org/blog/april-2...

2 weeks ago 1 1 0 0
Preview
Common Crawl - Blog - Announcing a Change to Common Crawl Dataset Size Reporting Common Crawl is switching to reporting dataset sizes in nibbles. As an organisation dedicated to data preservation, we feel it would be remiss to allow this underrepresented unit to fall out of use. O...

Common Crawl is switching to reporting dataset sizes in nibbles. As an organisation dedicated to data preservation, we feel it would be remiss to allow this underrepresented unit to fall out of use. Our latest crawl now exceeds 689 tebibbles.

commoncrawl.org/blog/announc...

2 weeks ago 5 1 0 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2026 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of January, February, and March 2026. The graphs consist of 270.2 million nodes and 9 billion edg...

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of January, February, and March 2026.

commoncrawl.org/blog/host--a...

3 weeks ago 2 0 0 0
Preview
Common Crawl - Blog - March 2026 Crawl Archive Now Available We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, exp...

We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.

commoncrawl.org/blog/march-2...

1 month ago 1 0 0 0
Preview
Blocking the Internet Archive Won’t Stop AI, But It Will Erase the Web’s Historical Record Imagine a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper. That’s effectively what’s begun happening online in the last few months. The Internet Archive—th...

Blocking the Internet Archive Won’t Stop AI, But It Will Erase the Web’s Historical Record

www.eff.org/deeplinks/20...

@eff.org

1 month ago 18 12 0 0
Preview
Common Crawl - Blog - IPv6 Adoption Across the Top 100K Web Hosts We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the ...

We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail.

commoncrawl.org/blog/ipv6-ad...

1 month ago 2 1 0 0
Preview
Common Crawl - Blog - Web Graph Statistics Gets a Proper Upgrade Our Web Graph Statistics site has been updated with interactive charts, a domain lookup tool for tracking harmonic centrality and PageRank over time, mobile improvements, unified rank tables with OR f...

Our Web Graph Statistics site has been updated with interactive charts, a domain lookup tool for tracking harmonic centrality and PageRank over time, mobile improvements, unified rank tables with OR filtering, and merged degree plots.

commoncrawl.org/blog/web-gra...

1 month ago 3 1 0 0
Advertisement

@thomvaughan.bsky.social did a WCAG colour contrast audit of 240 top domains using Common Crawl's February 2026 archive. You can read more about his study in the thread below

1 month ago 3 1 0 0
Preview
Common Crawl - Blog - Announcing the Whirlwind Tour of Common Crawl's Datasets Using Java Introducing the second installment in our Whirlwind Tour series, covering crawl structure, index access, and content extraction, giving developers a practical foundation for building Java-based data w...

Introducing the second installment in our Whirlwind Tour series, covering crawl structure, index access, and content extraction, giving developers a practical foundation for building Java-based data workflows.

commoncrawl.org/blog/announc...

1 month ago 2 0 0 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2025 and January/February 2026 We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes a...

We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes and 5.4 billion edges at the domain level.

www.commoncrawl.org/blog/host--a...

1 month ago 3 2 0 0
Preview
Common Crawl - Blog - Introducing the New Examples & Resources Browser We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and s...

We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and share links. We welcome community submissions.

blog.commoncrawl.org/blog/introdu...

1 month ago 4 2 0 0
Preview
Common Crawl - Blog - February 2026 Crawl Archive Now Available We are pleased to announce the release of the February 2026 crawl, consisting of 2.1 billion web pages (or 363 TiB of uncompressed content). Captures are from 45.5 million hosts or 37.1 million regist...

We are pleased to announce the release of the February 2026 crawl, consisting of 2.1 billion web pages (or 363 TiB of uncompressed content). Captures are from 45.5 million hosts or 37.1 million registered domains.

blog.commoncrawl.org/blog/februar...

1 month ago 5 1 0 0
Preview
Preserving The Web Is Not The Problem. Losing It Is. Recent reporting by Nieman Lab describes how some major news organizations—including The Guardian, The New York Times, and Reddit—are limiting or blocking access to their content in the Internet Ar…

Preserving The Web Is Not The Problem. Losing It Is.

Mark Graham, Director of the Wayback Machine at @archive.org, walks us through the importance of preserving the Web in this recent post:

www.techdirt.com/2026/02/17/p...

2 months ago 1 0 0 0
Preview
Common Crawl - Blog - AI Plumbers at FOSDEM’26 Common Crawl was invited to the AI Plumbers unconference held at FOSDEM this year. The contrast between the 100 people at the unconference, compared to the 10,000 people at the main event, couldn't be...

Common Crawl was invited to the AI Plumbers unconference held at FOSDEM this year. The contrast between the 100 people at the unconference, compared to the 10,000 people at the main event, couldn't be bigger.

commoncrawl.org/blog/ai-plum...

2 months ago 2 0 0 0
Preview
Common Crawl - Blog - CC-Citations: A Visualization of Research Papers Referencing Common Crawl We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data.

We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data.

commoncrawl.org/blog/cc-cita...

2 months ago 2 0 0 0
Preview
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...

Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026

2 months ago 22 12 1 0
Advertisement
Preview
Common Crawl - Blog - CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data We are excited to announce the release of CommonLID, a language identification benchmark for the web, covering 109 languages. CommonLID was developed in collaboration with multiple open-source organiz...

Read the blogpost: commoncrawl.org/blog/commonl...

Dataset: huggingface.co/datasets/com...

Preprint: www.arxiv.org/abs/2601.18026

2 months ago 3 0 0 0

CommonLID can help us to create the next generation of open-source LangID models, which can in turn help create larger multilingual datasets. We would like to thank members of Masakhane and @seacrowd.bsky.social for their support in this effort.

2 months ago 3 0 1 0

CommonLID proves to be the most challenging dataset in our evaluation of existing LangID systems, as can be seen in the last column of the table above.

2 months ago 3 0 1 0
A table showing the results of our evaluation of existing LangID systems across 6 different datasets. Full text of the table is available on the paper linked below.

A table showing the results of our evaluation of existing LangID systems across 6 different datasets. Full text of the table is available on the paper linked below.

Current benchmarks over-estimate LangID performance on web data. In our evaluations, we show top existing models have < 80% F1, even when limiting to languages the models explicitly support.

2 months ago 3 0 1 0
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!

2 months ago 11 5 1 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026 The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level n...

The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges.

www.commoncrawl.org/blog/host--a...

2 months ago 1 0 0 0
Preview
Common Crawl - Blog - January 2026 Crawl Archive Now Available We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

www.commoncrawl.org/blog/january...

2 months ago 3 0 0 0
Preview
Common Crawl - Blog - Web Archives for Social Sciences Datathon, Bristol Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

www.commoncrawl.org/blog/web-arc...

2 months ago 1 0 0 0
Advertisement
Preview
Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

commoncrawl.org/blog/how-seo...

3 months ago 4 0 0 0
Preview
Common Crawl - Blog - GneissWeb Annotations Examples A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

GneissWeb Annotations Examples

A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

commoncrawl.org/blog/gneissw...

3 months ago 2 1 0 0
Preview
Common Crawl - Blog - Common Crawl at the Mozilla Festival 2025 From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

www.commoncrawl.org/blog/common-...

3 months ago 0 0 0 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

commoncrawl.org/blog/host--a...

3 months ago 0 0 0 0
Preview
Common Crawl - Blog - December 2025 Crawl Archive Now Available The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

commoncrawl.org/blog/decembe...

3 months ago 0 0 0 0
Preview
Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and refere...

As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and referenced.

commoncrawl.org/blog/a-sampl...

4 months ago 2 0 0 0