Common Crawl Foundation
1.4K posts

Common Crawl Foundation
@CommonCrawl
Common Crawl is a non-profit foundation dedicated to the Open Web.
San Francisco, CA Katılım Şubat 2010
1.6K Takip Edilen7.8K Takipçiler

March 2026 Crawl Archive Now Available
We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.

English

Blocking the Internet Archive Won’t Stop AI, But It Will Erase the Web’s Historical Record
eff.org/deeplinks/2026…
@eff
English

Common Crawl Foundation would like to share with you an updated overview of our organization as of March 2026.
Please let us know if we can be helpful.
drive.google.com/file/d/1ww2R0x…
English
Common Crawl Foundation retweetledi

github.com/commoncrawl/cc…
Summary of changes
This PR contains substantial redesign and refactoring for the following:
- Interactive charts instead of static images
- New domain lookup tool for plotting HC and PR (and even comparison of two different domains) over time
- Combine avgindegree and avgdegree plots into avgdegree (closes Merge plots avgoutdegree and avgindegree into avgdegree #4)
- Add appropriate reference links to harmonic centrality (closes Add links to research papers in the section explaning harmonic centrality #5)
- Make masthead image disappear quicker via parallax scrolling so that content is reached faster
- Substantial mobile/responsive UX improvements
- Improve rank tables UX to be one unified table
- Proper links and sanitation of anchor tags with rel= and target=
English

Introducing the New Examples & Resources Browser
We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and share links. We welcome community submissions.

English
Common Crawl Foundation retweetledi

Announcing our latest paper: CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
In collaboration with @CommonCrawl @MLCommons and @JohnsHopkins we worked with 80+ native speaker annotators to build a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.
English



