Common Crawl Foundation

1.4K posts

Common Crawl Foundation banner
Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Common Crawl is a non-profit foundation dedicated to the Open Web.

San Francisco, CA Katılım Şubat 2010
1.6K Takip Edilen7.8K Takipçiler
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
May 2026 Crawl Archive Now Available We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content. 📷
English
1
1
5
482
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.
English
1
7
15
1.3K
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
You can now build directly on Common Crawl from the browser Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages. 📷
English
2
1
2
335
Common Crawl Foundation retweetledi
Tristan Rhodes
Tristan Rhodes@tristanbob·
Have you ever seen a user agent named "CCBOT"? If so, you were visited by @CommonCrawl, a non-profit that crawls the internet and publishes a 10+ petabytes open-source dataset. I think it's beautiful that humanity shares this data. It means that anyone with minimal resources has the access to data required to build their own AI models. It also means we don't have to crawl the entire internet thousands of times for each research, saving large amounts of bandwidth and resources.
English
2
3
8
551
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics Live long and prosper!
English
2
0
0
204
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
Our April 2026 Crawl Archive and corresponding Web Graph are now available. The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.
English
1
0
3
429
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics
English
0
0
0
51
Common Crawl Foundation retweetledi
Financial Times
Financial Times@FT·
Mistral CEO: AI companies should pay a content levy in Europe ft.trib.al/hKU8k0g | opinion
English
24
22
119
93K
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
March 2026 Crawl Archive Now Available We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.
Common Crawl Foundation tweet media
English
1
2
13
714
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
IPv6 Adoption Across the Top 100K Web Hosts We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail.
Common Crawl Foundation tweet media
English
1
1
4
420