Hanze Dong

332 posts

Hanze Dong

@hendrydong

Research @MSFTResearch. RL Science | Generative Models | Autonomous Systems

Katılım Haziran 2011

517 Takip Edilen708 Takipçiler

Sabitlenmiş Tweet

Hanze Dong@hendrydong·26 Şub

SFT curates responses. RL curates sampling. RL improves by curating what the model experiences: condition; distribution; weighting of what gets learned from. Better signal curation shifts the performance-compute curve upward. Full write-up below 👇 hendrydong.github.io/blogs/pages/rl…

English

220

30K

Hanze Dong retweetledi

Jiayi Weng@Trinkle23897·8 May

Codex grew programmatic policies with no neural nets: max score on Breakout, and SOTA-level scores on MuJoCo. Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it's the next paradigm. trinkle23897.github.io/learning-beyon…

English

230

1.4K

2.6M

Hanze Dong retweetledi

Jiang Bian@jbian22·15 Nis

Environment Scaling Is Real. Counting Environments Is Not. open.substack.com/pub/jiangbian/…

English

138

Hanze Dong@hendrydong·15 Nis

🤔

Leeham@Liam06972452

GPT-5.4 Pro solves Erdős Problem #1196! Very pleased with this result; definitely my favourite thus far! This problem has been thought about for some time which makes this reasonably impressive and meaningful (see Lichtman's comments below). Formalisation is underway!

ART

252

Hanze Dong retweetledi

Shibo Hao@Ber18791531·14 Nis

🍫 CocoaBench v1.0 is out! CocoaBench is a benchmark for unified digital agents, built around open-world tasks that require composing 💻 coding, 👀 vision, 🌐 search. Since our first research preview last December, we have expanded the benchmark substantially with community contributed tasks, and spent months testing and refining the tasks, evaluations, and agent runs. Some takeaways: • Even the best agent system reaches only 45.1% on CocoaBench v1.0. • Coding agents like Codex are already surprisingly strong on general tasks beyond software engineering. • Stronger agents tend to push more of the work into code. • Open source models still lag behind leading frontier models on these general tasks. 👇More on the website and in the paper #AI #Agents #LLM #Benchmark #CocoaBench

Shibo Hao@Ber18791531

🍫 CocoaBench is calling for contributions from the community! Join us and help shape how next-generation agents are evaluated and built🚀✨ #LLM #AI #Agent #CocoaBench More details in the threads 👇

English

10.9K

Hanze Dong@hendrydong·10 Nis

🧐

clem 🤗@ClementDelangue

"But here is what we found when we tested: We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens. A 5.1B-active open model recovered the core chain of the 27-year-old OpenBSD bug." aisle.com/blog/ai-cybers…

ART

147

Hanze Dong@hendrydong·6 Nis

@kellypeilinchan When the test samples are new, this should be called in-domain generalization.

English

222

mrkelly@kellypeilinchan·6 Nis

@hendrydong I learned this the hard way. Contamination resistance doesn't prove understanding, it just forces more sophisticated overfitting.

English

339

Hanze Dong@hendrydong·5 Nis

Between theorem recognition and theorem proving lies theorem understanding. We introduce LiveMathematicianBench: a live, contamination-resistant testbed for research-level mathematical reasoning, built from post-cutoff arXiv theorems. It probes a capability that existing benchmarks rarely isolate: whether models can understand theorem statements, track delicate assumptions, reason over logical structure, and leverage proof-level guidance. livemathematicianbench.github.io

English

176

17.2K

Hanze Dong@hendrydong·6 Nis

@analisereal We will test more models

English

276

Análise Real@analisereal·5 Nis

@hendrydong Will you test Gemini Deep Think, ChatGPT 5.4 Pro, Grok Expert, Grok Heavy and Claude Opus 4.6?

English

833

Hanze Dong@hendrydong·5 Nis

Huge thanks to my collaborators, @LinyangNeuroAI Qiyao, @baohao_liao, Xinxing, @micahgoldblum, @jbian22, @NimaMesgarani

English

765

Hanze Dong retweetledi

Dinghuai Zhang 张鼎怀@zdhnarsil·3 Nis

Fourth time of the SPIGM workshop, what a wonderful record!

Jiajun He@JiajunHe614

🚨SPIGM is back at ICML 2026 — Call for Papers 🚨 SPIGM: Structured Probabilistic Inference & Generative Modeling — beyond scaling & benchmarks 📍Seoul 🇰🇷 🗓️Submit by April 24 (AoE) 👇Submission link below.

English

1.8K

Hanze Dong retweetledi

Claude@claudeai·3 Nis

Microsoft 365 connectors are now available on every Claude plan. Connect Outlook, OneDrive, and SharePoint to bring your email, docs, and files into the conversation. Get started here: claude.ai/customize/conn…

English

832

1.4K

16.7K

4.1M

Hanze Dong retweetledi

Yuhui Xu@xyh6666·2 Nis

Happy to have made a small contribution to this

Omar Sanseviero@osanseviero

Gemma 4 is here! 🧠 31B and 26B A4B for models with impressive intelligence per parameter 🤏E2B and E4B for mobile and IoT 🤗Apache 2.0 🤖Base and IT checkpoints available Available in AI Studio, Hugging Face, Ollama, Android, and your favorite OS tools 🚀Download it today!

English

1.7K

Hanze Dong retweetledi

Satya Nadella@satyanadella·2 Nis

We’re bringing our growing MAI model family to every developer in Foundry, including … · MAI-Transcribe-1, most accurate transcription model in world across 25 languages · MAI-Voice-1, natural, expressive speech generation · MAI-Image-2, our most capable image model yet Start building: microsoft.ai/news/today-wer…

GIF

English

219

285

1.8K

307.3K

Hanze Dong retweetledi

Julien Bek@JulienBek·6 Mar

x.com/i/article/2029…

ZXX

216

642

4.8K

2.9M

Hanze Dong@hendrydong·2 Nis

@xyh6666 👍nb

Yuhui Xu@xyh6666·2 Nis

gemma

Demis Hassabis@demishassabis

💎💎💎💎

Nederlands

456

Hanze Dong retweetledi

Satya Nadella@satyanadella·30 Mar

Introducing Critique, a new multi-model deep research system in M365 Copilot. You can use multiple models together to generate optimal responses and reports.

English

426

513

4.2K

1.4M

Hanze Dong retweetledi

ICML Conference@icmlconf·30 Mar

Google's Paper Assistant Tool was extremely popular, giving AI feedback on ~4500 submissions prior to the #ICML2026 deadline. Results were positive! 92% of participants said they'd use it again, and 73% rated the feedback as helpful. Read the full blog post for more details:

English

173

21.9K

Hanze Dong@hendrydong·29 Mar

Strong point. In the near term, the winners are not just the models with the best raw capability, but the products that can embed token costs into high-value workflows where latency and reliability directly drive retention

Mustafa Suleyman@mustafasuleyman

For the next couple years at least, the entire AI industry is going to be defined by this fact: demand is going to wildly outstrip supply, and so what matters is which companies / products have margin to pay for tokens. Those products will then rapidly improve because latency drives retention, and retention creates data to spin flywheels that improve the product and drive more adoption.

English

339

Hanze Dong@hendrydong·27 Mar

@LinyangNeuroAI xianmunin

Indonesia

Linyang He@LinyangNeuroAI·27 Mar

@hendrydong xianmu work anywhere

English

112

Hanze Dong@hendrydong·26 Mar

Burned some after-work hours spinning Copilot CLI into WeChat. Full multi-session support. WeChat ←→ copilot-wechat ←→ Copilot CLI (ACP) ←→ GPT / Claude / Gemini That makes me work anywhere with my phone. We're moving past static apps. The new meta is Dynamic Personal Interfaces. Spin up a bespoke workspace in minutes, throw it into your favorite chat app, and ship from anywhere. New vibes are incoming! github.com/hendrydong/cop…

English

432

Keşfet

@kellypeilinchan @analisereal @LinyangNeuroAI @baohao_liao @micahgoldblum @jbian22 @NimaMesgarani @xyh6666