Common Crawl Foundation

1.4K posts

Common Crawl Foundation banner
Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Common Crawl is a non-profit foundation dedicated to the Open Web.

San Francisco, CA Katılım Şubat 2010
1.6K Takip Edilen7.8K Takipçiler
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
You can now build directly on Common Crawl from the browser Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages. 📷
English
2
1
2
267
Common Crawl Foundation retweetledi
Tristan Rhodes
Tristan Rhodes@tristanbob·
Have you ever seen a user agent named "CCBOT"? If so, you were visited by @CommonCrawl, a non-profit that crawls the internet and publishes a 10+ petabytes open-source dataset. I think it's beautiful that humanity shares this data. It means that anyone with minimal resources has the access to data required to build their own AI models. It also means we don't have to crawl the entire internet thousands of times for each research, saving large amounts of bandwidth and resources.
English
2
3
7
498
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics Live long and prosper!
English
2
0
0
163
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
Our April 2026 Crawl Archive and corresponding Web Graph are now available. The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.
English
1
0
3
386
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics
English
0
0
0
37
Common Crawl Foundation retweetledi
Financial Times
Financial Times@FT·
Mistral CEO: AI companies should pay a content levy in Europe ft.trib.al/hKU8k0g | opinion
English
24
21
117
92.2K
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
March 2026 Crawl Archive Now Available We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.
Common Crawl Foundation tweet media
English
1
2
13
699
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
IPv6 Adoption Across the Top 100K Web Hosts We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail.
Common Crawl Foundation tweet media
English
1
1
4
400
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
We've never had an entire city ask to be deleted before...
English
0
1
7
1.2K
Common Crawl Foundation retweetledi
Wayne Yamamoto
Wayne Yamamoto@kazabyte·
I've been dabbling with Claude Code. Using English as a programming language, I wrote a C compiler. @kazabyte/english-as-a-programming-language-how-i-wrote-a-c-compiler-with-claude-f2557fbbf20f" target="_blank" rel="nofollow noopener">medium.com/@kazabyte/engl…
English
0
2
2
486