Ben Wills

328 posts

Ben Wills

@benwills

Downloading the internet, so you don't have to.

Boulder, CO Katılım Mart 2008

185 Takip Edilen2.4K Takipçiler

Sabitlenmiş Tweet

Ben Wills@benwills·3 Şub

After 9 days with the Yandex code, here's what I've found that's relevant for SEOs. First, huge credit to @iPullRank and @RyanJones. Those are a couple of smart MFers. Seeing our different strengths and perspectives and pushing each other through the code has been great.

English

129

27.5K

Ben Wills@benwills·1d

I spent the last 3 weeks running what might be the most comprehensive LLM ranking factors analysis to date. 29,562 unique domains tracked and scored across 145 industries, 1,595 buyer personas, and 105k+ ChatGPT prompts. Over 500TB of data, and 12 external signals correlated against rank-weighted LLM recommendation scores. This is a large-scale correlation study: what external signals actually predict whether a brand gets recommended by ChatGPT, across 145 industries and 1,595 buyer personas. -- Research Process 145 industries from 500 candidates. 11 personas each (10 targeted + 1 neutral). 25 runs per persona, rank-weighted scoring. 29,562 unique domains tracked. Data collected: - Common Crawl: 1.15B pages, domain mentions + phrase co-occurrences - Reddit: 5B+ posts and comments scanned - Google Search: 15,697 queries, top 100 results; 1.5M+ results captured - SERP HTML: parsed for outbound links and phrase presence - Wikimedia: 300M+ Wikidata entities + Wikipedia citations - Backlinks (Common Crawl Web Graph): PageRank + Harmonic Centrality; 4B+ - Top Site Homepages: parsed for persona-specific phrases -- Analysis Process 13 signals per domain. Spearman ρ vs. LLM recommendation score, per-industry and globally. R² shows variance explained. Lift measures over-representation in the top 10% most-recommended domains. Tiered: Dominant (ρ ≥ 0.30) down to Baseline (< 0.05). -- Key Findings SERP appearances, SERP rank, and outbound links from search results pages are the three strongest signals. Traditional SEO is the dominant measurable influence on LLM recommendations. Backlink authority (PageRank, Harmonic Centrality) follows. Combined, these point to one thing: established search authority drives LLM visibility. Signal hierarchies vary by industry. Wikidata dominates in established categories (hotels, ERP, furniture). Reddit drives community-driven ones (enterprise AI, live entertainment). No universal strategy. 80–85% of recommendation variance is inside the model. All external signals combined explain under 20%. You cannot infer LLM visibility from search rankings; you have to test it directly. -- The Two Conclusions That Matter 1. SEO is the foundation. OpenAI is using search data today and building their own index. As that matures, the connection between search authority and LLM visibility deepens. Traditional SEO principles are not obsolete, they're the starting point for LLM visibility too. 2. Persona is the measurement unit. The #1 airline for a frequent flyer is a different site from the #1 for a student flying abroad. Same model, same industry, different person, different result. You don't have one LLM rank, you have a rank per buyer segment. Monitor by persona or the number is meaningless. -- Full Report and data for all 145 industries and 1,595 personas available here: oppalerts.com/LLM-Ranking-Fa…

English

151

17.2K

Ben Wills@benwills·2d

@scott_stouffer Yeah...it can be a bit obsessive at times, lol.

English

Scott Stouffer@scott_stouffer·4d

@benwills Hi, my name is Scott and I've been a binge coder for 30 years. I don't know how to help you and I don't know if there is a cure

English

142

Ben Wills@benwills·6d

don't write a c++ web frontend on my day off... don't write a c++ web frontend on my day off... don't write a c++ web frontend on my day off... don't write a c++ web frontend on my day off... don't write a c++ web frontend on my day off...

English

Ben Wills@benwills·2d

Sneak peek at tomorrow's LLM Ranking Factors report... 145 industries, 1,595 personas, 1.1B+ web pages, 5B+ Reddit posts, 15k+ Google searches, 300M+ Wikimedia entities, 4B+ backlinks...

English

861

Ben Wills@benwills·4d

Uber blew their 2026 AI budget in 4 months. I wonder how much of a competitive advantage frugality will become as AI costs increase.

English

111

Ben Wills@benwills·5d

For scraping, what are you doing for proxies?

English

255

Ben Wills@benwills·5d

A hard lesson coming very soon for folks creating their own tools: "In the beginning you always want results. In the end all you want is control." - @EskilSteenberg - How I Program C: youtube.com/watch?v=443UNe…

YouTube

English

180

Ben Wills@benwills·6d

What are the Ranking Factors for ChatGPT recommendations? To find out, I analyzed 104,756 ChatGPT prompts across 145 industries and 1,595 personas. I then analyzed 1.1B+ web pages, 5B+ Reddit posts, 15k+ Google searches, 300M+ Wikimedia entities, and 4B+ backlinks... Full report is coming on Tuesday, May 12. ---- Overall Stats Industries Analyzed: - Industries Considered: 500 - Industries Chosen: 145 - Personas Per Industry: 11 - Industry+Persona Combos: 1,595 ChatGPT Prompts/Recommmendations: - Total LLM Prompts: 104,756 - Unique Site Recommendations: 97,620 - Unique Sites Recommended: 29,562 Google Searches: - Total Searches: 15,697 - Unique URLs in Results: 833,458 - Unique Hosts in Results: 250,899 Common Crawl Web Pages: - Web Pages Analyzed: 1,100,000,000+ Reddit: - Submissions: 627,615,021 - Comments: 4,865,389,844 Wikimedia: - Wikidata Entities: 120,100,660 - Wikipedia External Links: 187,811,877 - Wikipedia Pages: 25,513,338 Common Crawl Web Graph Link Data: - Web Graph Domains: 120,000,000+ - Web Graph Domain Links: 4,400,000,000+

English

131

Ben Wills@benwills·6 May

In case you're unaware of the scale of the non-social internet, from Jan 2025 through Mar 2026, Common Crawl found: 3,211,344,604 RSS Urls 829,951,007 Atom Urls 111,426,718 XML Urls (I checked)

English

Ben Wills@benwills·4 May

I downloaded 730,014 URLs, each from 15k+ Google results. 12,559 (~1.7%) were 404, 5xx, or another error. Remember to fix your site errors. #SEO #AIO #AEO #GEO

English

110

Ben Wills@benwills·1 May

The problem with writing a lot more code in C++ again is that I want to do everything in it, including rewriting my web application frontend. And I’m not saying it’s a bad idea. But I can’t say it’s a good idea…

English

101

Ben Wills@benwills·1 May

@CouchGuyPat Thanks. The HTML is from the Common Crawl dataset, so it’s just parsing and not making a ton of HTTP requests. I forget exactly, but I CC data is just the raw response.

English

Patrick Schofield@CouchGuyPat·1 May

@benwills Are you parsing the post-JS rendered DOM or the raw response only? Super interesting work you’ve been doing.

English

102

Ben Wills@benwills·1 May

Just passed 500 million HTML docs parsed with my newest C++ parser, with no errors, running on 5 servers, in less than 24 hours. That's a nice milestone. Will be crossing 1 billion tonight. Really looking forward to all this data and what it can tell us about how ChatGPT rankings are influenced.

English

243

Ben Wills@benwills·1 May

Gemini 3.1 Pro in Antigravity is _way_ too eager to delete files. That's uncomfortable...

English

106

Ben Wills@benwills·1 May

@RyanJones If you had labeled commodity and non-commodity content for a topic, I’d imagine you might be able to… - convert them all to vectors - drop into a vector db that uses HNSW - convert new/test content to vectors - identify new/test vectors furthest from the commodity vectors.

English

Ryan Jones@RyanJones·30 Nis

great now I have to think of a way to algorithmically detect commodity and non-commodity content in python. I'm thinking a mix of Jaccard similarity and shannon entropy. thoughts?

English

895

Ben Wills@benwills·30 Nis

1.1 billion web pages now being analyzed for relevance, along with Reddit, Wikipedia, SERPs, & backlink profiles for thousands of sites. Really curious to see the data as it all relates to reverse-engineering ChatGPT suggestions across 145 industries. Results next week.

English

321

Ben Wills@benwills·30 Nis

I can honestly say I didn't expect this: Google's Q1 report just released. Search revenue grew 19%, faster than the YoY this time last year. Looks like AI hasn't cut into ad revenue yet... abc.xyz/investor/news/…

English

Keşfet

@scott_stouffer @EskilSteenberg @CouchGuyPat @RyanJones @elonmusk @BarackObama @taylorswift13 @cristiano