Common Crawl Foundation

1.4K posts

Common Crawl Foundation

@CommonCrawl

Common Crawl is a non-profit foundation dedicated to the Open Web.

San Francisco, CA Katılım Şubat 2010

1.6K Takip Edilen7.8K Takipçiler

Common Crawl Foundation@CommonCrawl·2d

commoncrawl.org/blog/may-2026-…

ZXX

169

Common Crawl Foundation@CommonCrawl·2d

May 2026 Crawl Archive Now Available We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content. 📷

English

482

Common Crawl Foundation@CommonCrawl·6d

commoncrawl.org/blog/april-202…

ZXX

240

Common Crawl Foundation@CommonCrawl·6d

As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.

English

1.3K

Common Crawl Foundation@CommonCrawl·7 May

commoncrawl.org/blog/you-can-n…

ZXX

215

Common Crawl Foundation@CommonCrawl·7 May

You can now build directly on Common Crawl from the browser Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages. 📷

English

335

Common Crawl Foundation retweetledi

Tristan Rhodes@tristanbob·3 May

Have you ever seen a user agent named "CCBOT"? If so, you were visited by @CommonCrawl, a non-profit that crawls the internet and publishes a 10+ petabytes open-source dataset. I think it's beautiful that humanity shares this data. It means that anyone with minimal resources has the access to data required to build their own AI models. It also means we don't have to crawl the entire internet thousands of times for each research, saving large amounts of bandwidth and resources.

English

551

Common Crawl Foundation@CommonCrawl·30 Nis

Sorry, now with the actual links. commoncrawl.org/blog/april-202… blog.commoncrawl.org/blog/host--and… commoncrawl.github.io/cc-crawl-stati… commoncrawl.github.io/cc-webgraph-st…

English

130

Common Crawl Foundation@CommonCrawl·30 Nis

📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics Live long and prosper!

English

204

Common Crawl Foundation@CommonCrawl·30 Nis

Our April 2026 Crawl Archive and corresponding Web Graph are now available. The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.

English

429

Common Crawl Foundation@CommonCrawl·30 Nis

📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics

English

Common Crawl Foundation@CommonCrawl·6 Nis

commoncrawl.org/blog/april-202…

ZXX

285

Common Crawl Foundation@CommonCrawl·6 Nis

April 2026 Common Crawl Newsletter

English

391

Common Crawl Foundation retweetledi

Financial Times@FT·20 Mar

Mistral CEO: AI companies should pay a content levy in Europe ft.trib.al/hKU8k0g | opinion

English

119

93K

Common Crawl Foundation@CommonCrawl·20 Mar

commoncrawl.org/blog/march-202…

ZXX

229

Common Crawl Foundation@CommonCrawl·20 Mar

March 2026 Crawl Archive Now Available We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.

English

714

Common Crawl Foundation@CommonCrawl·17 Mar

Blocking the Internet Archive Won’t Stop AI, But It Will Erase the Web’s Historical Record eff.org/deeplinks/2026… @eff

English

270

Common Crawl Foundation@CommonCrawl·16 Mar

commoncrawl.org/blog/ipv6-adop…

ZXX

263

Common Crawl Foundation@CommonCrawl·16 Mar

IPv6 Adoption Across the Top 100K Web Hosts We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail.

English

420

Keşfet

@EFF @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine