Chao Wang

8K posts

Chao Wang

@excel_wang

Associate Professor in health and social care statistics at Kingston University. PhD in econometrics.

London, England Inscrit le Mayıs 2014

405 Abonnements1.6K Abonnés

Tweet épinglé

Chao Wang@excel_wang·25 Kas

Great that the data & code for Bangladesh mask RCT has been released gitlab.com/emily-crawford…. I tried to run their code and it seems there are only very small differences to what was reported in the paper.

English

187

Chao Wang@excel_wang·11h

@EpochAIResearch The top four models likely don’t have any statistically significant difference, given the substantial overlap in their confidence intervals.

English

Epoch AI@EpochAIResearch·4d

GPT-5.5 Pro achieves a new high score of 159 on the Epoch Capabilities Index! ECI is our statistical tool that combines multiple benchmarks into a unified scale.

English

784

145K

Chao Wang@excel_wang·13h

On the other hand, @EpochAIResearch's "capability" seems more promising. Here is the technical paper arxiv.org/abs/2512.00193. I haven't fully read the paper yet but it says it uses a method that is similar to IRT model.

English

Chao Wang@excel_wang·13h

I see some people citing the index from @ArtificialAnlys. The problem with the AA index is it assigns arbitrary weights to different benchmark categories (artificialanalysis.ai/methodology/in…). Why so much weight on coding and agents? It’s a complete distortion of reality!

Séb Krier@sebkrier

DeepSeek V4’s capability lags behind leading U.S. models by about 8 months. nist.gov/news-events/ne…

English

Chao Wang retweeté

Lisan al Gaib@scaling01·18h

x.com/i/article/2050…

ZXX

215

37.8K

Chao Wang@excel_wang·17h

@thdxr Unlike the IRT model, the weights used in calculating the AA Intelligence Index are quite arbitrary (4 big categories each 25%; sub categories given predetermined ratios). I know which one to trust more.

English

283

dax@thdxr·20h

here's a chart showing them being a few months behind and catching up modern day is amazing you can have whatever narrative you want!

Lisan al Gaib@scaling01

chinese models are ~8 months behind and are falling further behind

English

1.2K

83K

Chao Wang retweeté

max tempers@maxtempers·1d

The largest supermarket in Britain, that operates on razor-thin margins, is about to be crushed for the crime of paying different jobs different salaries, while our legislature shrugs. How dare they suggest that “so-called market rates” can exist in Soviet Britain.

Financial Times@FT

Tesco argues equal pay claim disregards ‘economic reality’ ft.trib.al/k5n8n5E

English

187

631

5.8K

939.2K

Chao Wang@excel_wang·23h

It’s interesting IRT was adopted to estimate model capabilities.

Séb Krier@sebkrier

DeepSeek V4’s capability lags behind leading U.S. models by about 8 months. nist.gov/news-events/ne…

English

Chao Wang@excel_wang·1d

@uwunetes Did you even see the figure??? It is more intelligent but cheaper than DeepSeek Pro. Why would I use DeepSeek over Grok? You have a supercomputer and want to run this big model locally?

English

addison@uwunetes·2d

xai is the most unserious US lab lmao why would u ever release this? its a closed source model worse than open source models like why would i use this over deepseek or kimi

Artificial Analysis@ArtificialAnlys

xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20 The release of Grok 4.3 places @xAI just above Muse Spark and Claude Sonnet 4.6 on the Intelligence Index, and a 4 points ahead of the latest version of Grok 4.20. Grok 4.3 improves its Artificial Analysis Intelligence Index score while reducing cost to run the benchmark suite. Key Takeaways: ➤ Grok 4.3 improves on cost-per-intelligence relative to Grok 4.20 0309 v2: it scores higher on the Intelligence Index while costing less to run the full benchmark suite. Grok 4.3 costs $395 to run the Artificial Analysis Intelligence Index, around 20% lower than Grok 4.20 0309 v2, despite using more output tokens. This makes it one of the lower-cost models at its intelligence level ➤ Large increase in real world agentic task performance: The largest single benchmark improvement is on GDPval-AA, where Grok 4.3 scores an ELO of 1500, up 321 points from Grok 4.20 0309 v2’s score of 1179 Grok 4.3, surpassing Gemini 3.1 Pro Preview, Muse Spark, Gpt-5.4 mini (xhigh), and Kimi K2.5. Grok 4.3 narrows the gap to the leading model on GDPval-AA, but still trails GPT-5.5 (xhigh) by 276 Elo points, with an expected win rate of ~17% against GPT-5.5 (xhigh) under the standard Elo formula ➤ Grok 4.3’s performs strongly on instruction following and agentic customer support tasks. It gains 5 points on 𝜏²-Bench Telecom to reach 98%, in line with GLM-5.1. Grok 4.3 maintains an 81% IFBench score from Grok 4.20 0309 v2 ➤ Gains 8 points on AA-Omniscience Accuracy, but at the cost of lower AA-Omniscience Non-Hallucination Rate of 8 points, so Grok 4.20 0309 v2 still leads AA-Omniscience Non-Hallucination Rate, followed by MiMo-V2.5-Pro, in line with Grok 4.3 Congratulations to @xAI and @elonmusk on the impressive release!

English

238

30.8K

Chao Wang@excel_wang·1d

GPT 5.5 now available on Microsoft 365 Copilot.

Français

Chao Wang@excel_wang·2d

@MatthewBerman What’s the practical benefit of a “open” model (just open weight as a neutral network model is a black box) for most people? Run the model locally? Run the distillation process yourself?

English

Matthew Berman@MatthewBerman·3d

Demis says he wants to see a Western open source AI stack and that we’re losing to China. He also says Google doesn’t have enough compute to build two frontier (open and closed) models, which is why Gemma is a smaller family of models. Watch this incredible clip. Shout out @ycombinator and @garrytan for the fantastic interview.

Matthew Berman@MatthewBerman

American open source AI is in trouble. China is eating our lunch. This is a bigger problem than people realize.

English

146

1.4K

293.1K

Chao Wang retweeté

Christopher Snowdon@cjsnowdon·3d

Rent controls are indeed common in Europe.

The Green Party@TheGreenParty

Labour are refusing to bring in rent controls which are common across Europe. End the affordability crisis. Introduce rent controls. Vote Green on 7th May 💚

English

315

2.2K

119.5K

Chao Wang@excel_wang·3d

@spicey_lemonade "yet no one uses it" 😬 x.com/EpochAIResearc…

Epoch AI@EpochAIResearch

Among people in $100,000+ households, ChatGPT, Gemini, and Copilot all have more users than Claude. So while Claude’s user base tends to be high-income, the smaller number of Claude users overall means these users are still more likely to use services other than Claude.

English

spicylemonade@spicey_lemonade·4d

Gemini 3.1 is in the top 3 of almost every main benchmark, yet no one uses it. I think vibecodebench, swe atlas, and AA agent index are well calibrated.

English

499

66.2K

Chao Wang@excel_wang·3d

@EndWokeness Rumour has it they became more open to the idea of a king after hearing about what happened to Charles I following his clash with Parliament.

English

End Wokeness@EndWokeness·4d

"NO KINGS" crowd greets King Charles with a standing ovation

English

6.1K

27.9K

166K

4.8M

Chao Wang@excel_wang·4d

@SenAshleyMoody They clapped after hearing Charles promised he would not rule America.

English

Senator Ashley Moody@SenAshleyMoody·4d

Why did I just watch every Democrat in Congress stand and clap for an actual King? 🤔

English

2.5K

2.4K

11K

392.9K

Chao Wang retweeté

Acyn@Acyn·4d

Standing ovation for this line from King Charles: The U.S. Supreme court historical society has calculated that Magna Carta is cited in at least 160 supreme court cases since 1789, not least as the foundation of the principle that executive power is subject to checks and balances.

English

395

4.7K

22.2K

1.5M

Chao Wang@excel_wang·4d

@jenzhuscott Token pricing is meaningless for cross-model comparison as different models use different numbers of tokens for the same task. GPT5.5 for example uses ~40% fewer output tokens than GPT5.4. x.com/ArtificialAnly…

Artificial Analysis@ArtificialAnlys

GPT-5.5 (xhigh) uses ~40% fewer output tokens to run our Index than its predecessor

English

Jen Zhu@jenzhuscott·4d

5. Massive price disadvantages compared to 🇨🇳 competitors 6. Elon (xAI and lawsuit) 7. Microsoft stops rev sharing + indigenous efforts and platform hedging (note MSFT’s recent $5bn investment in Anthropic) 8. Disruptive startups pursuing orthogonal approaches like Ilya’s SSI 9. Compute shortfalls (if data center buildout gets delayed by input bottlenecks, regulatory hurdles, public backlash, etc) 10. Massive burn rate + increasing competitions What did I miss? The question is not if, it’s when.

English

Jen Zhu@jenzhuscott·4d

A few tough facts OpenAI is facing: 1. Anthropic leads in critical coding capabilities. 2. Anthropic’s overall strengths in enterprise 3. Gemini’s consumer growth at expense of ChatGPT 4. Threat from high quality Open Source models from China 🧵

English

130

15.9K

Chao Wang retweeté

Yuan Yi Zhu@yuanyi_z·4d

Some people think that things will return to normal when the Democrats are in power. But this man was in Obama's State Department and he speaks of the UK with the same contempt as the foulest MAGA bro.

Richard Stengel@stengel

I've got nothing against King Charles personally—in fact, he seems like a decent and thoughtful man—but why the heck are we inviting the 77-year-old monarch of a medium-sized nation that committed something close to national suicide with Brexit to address a joint session of Congress? "Of more worth is one honest man to society," wrote Thomas Paine, "than all the crowned ruffians that ever lived." We are the nation that threw off a crowned ruffian and ended hereditary privilege to create a republic where the people rule. Happy 250th Birthday America.

English

141

1.3K

86.5K

Chao Wang retweeté

Rod Mason@Rod__Mason·5d

When @trussliz was PM, crisis level was apparently 4%. 🤷‍♂️

Josh Hunt@iAmJoshHunt

UK 10y gilt yields back above 5%. At what point do we hit a crisis level?

English

322

1.6K

45K

Chao Wang retweeté

Terence Shen@Terenceshen·5d

Is Mark Zuckerberg the most desperate tech tycoon in the world? Learned Mandarin, jogged through Tiananmen Square, read Xi Jinping's book, even asked Xi to name his baby, got rejected. Hosted Chinese officials at Facebook, tried to re-enter China, got rejected. Built China-friendly censorship tools, tested a China-only app, got rejected. Now trying to buy a Singapore AI firm founded by some Chinese… still getting rejected.

English

368

47.6K

Chao Wang retweeté

Denise Wu@denisewu·5d

I feel sorry for Chinese engineers who realize their intellectual property belongs to the state, not to them, after the Manus order. 🥲

English

171

393

23.7K

Découvrir

@EpochAIResearch @ArtificialAnlys @thdxr @uwunetes @MatthewBerman @ycombinator @garrytan @spicey_lemonade