Ben Wills

328 posts

Ben Wills banner
Ben Wills

Ben Wills

@benwills

Downloading the internet, so you don't have to.

Boulder, CO Katılım Mart 2008
185 Takip Edilen2.4K Takipçiler
Sabitlenmiş Tweet
Ben Wills
Ben Wills@benwills·
After 9 days with the Yandex code, here's what I've found that's relevant for SEOs. First, huge credit to @iPullRank and @RyanJones. Those are a couple of smart MFers. Seeing our different strengths and perspectives and pushing each other through the code has been great.
English
3
24
129
27.5K
Ben Wills
Ben Wills@benwills·
I spent the last 3 weeks running what might be the most comprehensive LLM ranking factors analysis to date. 29,562 unique domains tracked and scored across 145 industries, 1,595 buyer personas, and 105k+ ChatGPT prompts. Over 500TB of data, and 12 external signals correlated against rank-weighted LLM recommendation scores. This is a large-scale correlation study: what external signals actually predict whether a brand gets recommended by ChatGPT, across 145 industries and 1,595 buyer personas. -- Research Process 145 industries from 500 candidates. 11 personas each (10 targeted + 1 neutral). 25 runs per persona, rank-weighted scoring. 29,562 unique domains tracked. Data collected: - Common Crawl: 1.15B pages, domain mentions + phrase co-occurrences - Reddit: 5B+ posts and comments scanned - Google Search: 15,697 queries, top 100 results; 1.5M+ results captured - SERP HTML: parsed for outbound links and phrase presence - Wikimedia: 300M+ Wikidata entities + Wikipedia citations - Backlinks (Common Crawl Web Graph): PageRank + Harmonic Centrality; 4B+ - Top Site Homepages: parsed for persona-specific phrases -- Analysis Process 13 signals per domain. Spearman ρ vs. LLM recommendation score, per-industry and globally. R² shows variance explained. Lift measures over-representation in the top 10% most-recommended domains. Tiered: Dominant (ρ ≥ 0.30) down to Baseline (< 0.05). -- Key Findings SERP appearances, SERP rank, and outbound links from search results pages are the three strongest signals. Traditional SEO is the dominant measurable influence on LLM recommendations. Backlink authority (PageRank, Harmonic Centrality) follows. Combined, these point to one thing: established search authority drives LLM visibility. Signal hierarchies vary by industry. Wikidata dominates in established categories (hotels, ERP, furniture). Reddit drives community-driven ones (enterprise AI, live entertainment). No universal strategy. 80–85% of recommendation variance is inside the model. All external signals combined explain under 20%. You cannot infer LLM visibility from search rankings; you have to test it directly. -- The Two Conclusions That Matter 1. SEO is the foundation. OpenAI is using search data today and building their own index. As that matures, the connection between search authority and LLM visibility deepens. Traditional SEO principles are not obsolete, they're the starting point for LLM visibility too. 2. Persona is the measurement unit. The #1 airline for a frequent flyer is a different site from the #1 for a student flying abroad. Same model, same industry, different person, different result. You don't have one LLM rank, you have a rank per buyer segment. Monitor by persona or the number is meaningless. -- Full Report and data for all 145 industries and 1,595 personas available here: oppalerts.com/LLM-Ranking-Fa…
Ben Wills tweet media
English
11
32
151
17.2K
Scott Stouffer
Scott Stouffer@scott_stouffer·
@benwills Hi, my name is Scott and I've been a binge coder for 30 years. I don't know how to help you and I don't know if there is a cure
English
1
0
1
142
Ben Wills
Ben Wills@benwills·
don't write a c++ web frontend on my day off... don't write a c++ web frontend on my day off... don't write a c++ web frontend on my day off... don't write a c++ web frontend on my day off... don't write a c++ web frontend on my day off...
English
1
2
4
86
Ben Wills
Ben Wills@benwills·
Sneak peek at tomorrow's LLM Ranking Factors report... 145 industries, 1,595 personas, 1.1B+ web pages, 5B+ Reddit posts, 15k+ Google searches, 300M+ Wikimedia entities, 4B+ backlinks...
Ben Wills tweet media
English
2
1
8
861
Ben Wills
Ben Wills@benwills·
Uber blew their 2026 AI budget in 4 months. I wonder how much of a competitive advantage frugality will become as AI costs increase.
English
0
0
0
111
Ben Wills
Ben Wills@benwills·
For scraping, what are you doing for proxies?
English
2
0
1
255
Ben Wills
Ben Wills@benwills·
What are the Ranking Factors for ChatGPT recommendations? To find out, I analyzed 104,756 ChatGPT prompts across 145 industries and 1,595 personas. I then analyzed 1.1B+ web pages, 5B+ Reddit posts, 15k+ Google searches, 300M+ Wikimedia entities, and 4B+ backlinks... Full report is coming on Tuesday, May 12. ---- Overall Stats Industries Analyzed: - Industries Considered: 500 - Industries Chosen: 145 - Personas Per Industry: 11 - Industry+Persona Combos: 1,595 ChatGPT Prompts/Recommmendations: - Total LLM Prompts: 104,756 - Unique Site Recommendations: 97,620 - Unique Sites Recommended: 29,562 Google Searches: - Total Searches: 15,697 - Unique URLs in Results: 833,458 - Unique Hosts in Results: 250,899 Common Crawl Web Pages: - Web Pages Analyzed: 1,100,000,000+ Reddit: - Submissions: 627,615,021 - Comments: 4,865,389,844 Wikimedia: - Wikidata Entities: 120,100,660 - Wikipedia External Links: 187,811,877 - Wikipedia Pages: 25,513,338 Common Crawl Web Graph Link Data: - Web Graph Domains: 120,000,000+ - Web Graph Domain Links: 4,400,000,000+
Ben Wills tweet media
English
0
0
2
131
Ben Wills
Ben Wills@benwills·
In case you're unaware of the scale of the non-social internet, from Jan 2025 through Mar 2026, Common Crawl found: 3,211,344,604 RSS Urls 829,951,007 Atom Urls 111,426,718 XML Urls (I checked)
English
0
0
0
48
Ben Wills
Ben Wills@benwills·
I downloaded 730,014 URLs, each from 15k+ Google results. 12,559 (~1.7%) were 404, 5xx, or another error. Remember to fix your site errors. #SEO #AIO #AEO #GEO
Ben Wills tweet media
English
0
0
0
110
Ben Wills
Ben Wills@benwills·
The problem with writing a lot more code in C++ again is that I want to do everything in it, including rewriting my web application frontend. And I’m not saying it’s a bad idea. But I can’t say it’s a good idea…
English
0
0
1
101
Ben Wills
Ben Wills@benwills·
@CouchGuyPat Thanks. The HTML is from the Common Crawl dataset, so it’s just parsing and not making a ton of HTTP requests. I forget exactly, but I CC data is just the raw response.
English
0
0
2
23
Patrick Schofield
Patrick Schofield@CouchGuyPat·
@benwills Are you parsing the post-JS rendered DOM or the raw response only? Super interesting work you’ve been doing.
English
1
0
1
102
Ben Wills
Ben Wills@benwills·
Just passed 500 million HTML docs parsed with my newest C++ parser, with no errors, running on 5 servers, in less than 24 hours. That's a nice milestone. Will be crossing 1 billion tonight. Really looking forward to all this data and what it can tell us about how ChatGPT rankings are influenced.
English
1
0
6
243
Ben Wills
Ben Wills@benwills·
Gemini 3.1 Pro in Antigravity is _way_ too eager to delete files. That's uncomfortable...
English
0
0
1
106
Ben Wills
Ben Wills@benwills·
@RyanJones If you had labeled commodity and non-commodity content for a topic, I’d imagine you might be able to… - convert them all to vectors - drop into a vector db that uses HNSW - convert new/test content to vectors - identify new/test vectors furthest from the commodity vectors.
English
1
0
0
71
Ryan Jones
Ryan Jones@RyanJones·
great now I have to think of a way to algorithmically detect commodity and non-commodity content in python. I'm thinking a mix of Jaccard similarity and shannon entropy. thoughts?
English
3
1
6
895
Ben Wills
Ben Wills@benwills·
1.1 billion web pages now being analyzed for relevance, along with Reddit, Wikipedia, SERPs, & backlink profiles for thousands of sites. Really curious to see the data as it all relates to reverse-engineering ChatGPT suggestions across 145 industries. Results next week.
English
0
0
4
321
Ben Wills
Ben Wills@benwills·
I can honestly say I didn't expect this: Google's Q1 report just released. Search revenue grew 19%, faster than the YoY this time last year. Looks like AI hasn't cut into ad revenue yet... abc.xyz/investor/news/…
English
0
0
2
83