Common Crawl Foundation
1.4K posts

Common Crawl Foundation
@CommonCrawl
Common Crawl is a non-profit foundation dedicated to the Open Web.
San Francisco, CA Katılım Şubat 2010
1.6K Takip Edilen7.8K Takipçiler
Common Crawl Foundation retweetledi

Have you ever seen a user agent named "CCBOT"?
If so, you were visited by @CommonCrawl, a non-profit that crawls the internet and publishes a 10+ petabytes open-source dataset.
I think it's beautiful that humanity shares this data.
It means that anyone with minimal resources has the access to data required to build their own AI models.
It also means we don't have to crawl the entire internet thousands of times for each research, saving large amounts of bandwidth and resources.
English

Sorry, now with the actual links.
commoncrawl.org/blog/april-202…
blog.commoncrawl.org/blog/host--and…
commoncrawl.github.io/cc-crawl-stati…
commoncrawl.github.io/cc-webgraph-st…
English

Our April 2026 Crawl Archive and corresponding Web Graph are now available.
The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.
English
Common Crawl Foundation retweetledi

Mistral CEO: AI companies should pay a content levy in Europe ft.trib.al/hKU8k0g | opinion
English

March 2026 Crawl Archive Now Available
We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.

English

Blocking the Internet Archive Won’t Stop AI, But It Will Erase the Web’s Historical Record
eff.org/deeplinks/2026…
@eff
English

Common Crawl Foundation would like to share with you an updated overview of our organization as of March 2026.
Please let us know if we can be helpful.
drive.google.com/file/d/1ww2R0x…
English
Common Crawl Foundation retweetledi


