Common Crawl Foundation

1.4K posts

Common Crawl Foundation banner
Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Common Crawl is a non-profit foundation dedicated to the Open Web.

San Francisco, CA 가입일 Şubat 2010
1.6K 팔로잉7.8K 팔로워
Common Crawl Foundation 리트윗함
Financial Times
Mistral CEO: AI companies should pay a content levy in Europe ft.trib.al/hKU8k0g | opinion
English
7
12
78
65.3K
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
March 2026 Crawl Archive Now Available We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.
Common Crawl Foundation tweet media
English
1
2
11
481
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
IPv6 Adoption Across the Top 100K Web Hosts We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail.
Common Crawl Foundation tweet media
English
1
1
4
297
Common Crawl Foundation 리트윗함
Wayne Yamamoto
Wayne Yamamoto@kazabyte·
I've been dabbling with Claude Code. Using English as a programming language, I wrote a C compiler. @kazabyte/english-as-a-programming-language-how-i-wrote-a-c-compiler-with-claude-f2557fbbf20f" target="_blank" rel="nofollow noopener">medium.com/@kazabyte/engl…
English
0
2
2
385
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
github.com/commoncrawl/cc… Summary of changes This PR contains substantial redesign and refactoring for the following: - Interactive charts instead of static images - New domain lookup tool for plotting HC and PR (and even comparison of two different domains) over time - Combine avgindegree and avgdegree plots into avgdegree (closes Merge plots avgoutdegree and avgindegree into avgdegree #4) - Add appropriate reference links to harmonic centrality (closes Add links to research papers in the section explaning harmonic centrality #5) - Make masthead image disappear quicker via parallax scrolling so that content is reached faster - Substantial mobile/responsive UX improvements - Improve rank tables UX to be one unified table - Proper links and sanitation of anchor tags with rel= and target=
English
0
0
2
255
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
February 2026 Crawl Archive Now Available We are pleased to announce the release of the February 2026 crawl, consisting of 2.1 billion web pages (or 363 TiB of uncompressed content). Captures are from 45.5 million hosts or 37.1 million registered domains.
English
1
1
2
198
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
Introducing the New Examples & Resources Browser We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and share links. We welcome community submissions.
Common Crawl Foundation tweet media
English
4
7
15
655
Common Crawl Foundation
Common Crawl Foundation@CommonCrawl·
AI Plumbers at FOSDEM’26 Common Crawl was invited to the AI Plumbers unconference held at FOSDEM this year. The contrast between the 100 people at the unconference, compared to the 10,000 people at the main event, couldn't be bigger.
Common Crawl Foundation tweet media
English
1
0
1
173
Common Crawl Foundation 리트윗함
EleutherAI
EleutherAI@AiEleuther·
Announcing our latest paper: CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data In collaboration with @CommonCrawl @MLCommons and @JohnsHopkins we worked with 80+ native speaker annotators to build a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.
English
1
6
30
3.5K