Hanze Dong

332 posts

Hanze Dong banner
Hanze Dong

Hanze Dong

@hendrydong

Research @MSFTResearch. RL Science | Generative Models | Autonomous Systems

Katılım Haziran 2011
517 Takip Edilen708 Takipçiler
Sabitlenmiş Tweet
Hanze Dong
Hanze Dong@hendrydong·
SFT curates responses. RL curates sampling.   RL improves by curating what the model experiences: condition; distribution; weighting of what gets learned from.   Better signal curation shifts the performance-compute curve upward.   Full write-up below 👇 hendrydong.github.io/blogs/pages/rl…
Hanze Dong tweet media
English
3
25
220
30K
Hanze Dong retweetledi
Jiayi Weng
Jiayi Weng@Trinkle23897·
Codex grew programmatic policies with no neural nets: max score on Breakout, and SOTA-level scores on MuJoCo. Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it's the next paradigm. trinkle23897.github.io/learning-beyon…
English
61
230
1.4K
2.6M
Hanze Dong retweetledi
Shibo Hao
Shibo Hao@Ber18791531·
🍫 CocoaBench v1.0 is out! CocoaBench is a benchmark for unified digital agents, built around open-world tasks that require composing 💻 coding, 👀 vision, 🌐 search. Since our first research preview last December, we have expanded the benchmark substantially with community contributed tasks, and spent months testing and refining the tasks, evaluations, and agent runs. Some takeaways: • Even the best agent system reaches only 45.1% on CocoaBench v1.0. • Coding agents like Codex are already surprisingly strong on general tasks beyond software engineering. • Stronger agents tend to push more of the work into code. • Open source models still lag behind leading frontier models on these general tasks. 👇More on the website and in the paper #AI #Agents #LLM #Benchmark #CocoaBench
Shibo Hao@Ber18791531

🍫 CocoaBench is calling for contributions from the community! Join us and help shape how next-generation agents are evaluated and built🚀✨ #LLM #AI #Agent #CocoaBench More details in the threads 👇

English
2
34
78
10.9K
Hanze Dong
Hanze Dong@hendrydong·
@kellypeilinchan When the test samples are new, this should be called in-domain generalization.
English
0
0
0
222
mrkelly
mrkelly@kellypeilinchan·
@hendrydong I learned this the hard way. Contamination resistance doesn't prove understanding, it just forces more sophisticated overfitting.
English
1
0
0
339
Hanze Dong
Hanze Dong@hendrydong·
Between theorem recognition and theorem proving lies theorem understanding. We introduce LiveMathematicianBench: a live, contamination-resistant testbed for research-level mathematical reasoning, built from post-cutoff arXiv theorems. It probes a capability that existing benchmarks rarely isolate: whether models can understand theorem statements, track delicate assumptions, reason over logical structure, and leverage proof-level guidance. livemathematicianbench.github.io
Hanze Dong tweet media
English
10
35
176
17.2K
Análise Real
Análise Real@analisereal·
@hendrydong Will you test Gemini Deep Think, ChatGPT 5.4 Pro, Grok Expert, Grok Heavy and Claude Opus 4.6?
English
1
0
4
833
Hanze Dong retweetledi
Claude
Claude@claudeai·
Microsoft 365 connectors are now available on every Claude plan. Connect Outlook, OneDrive, and SharePoint to bring your email, docs, and files into the conversation. Get started here: claude.ai/customize/conn…
Claude tweet media
English
832
1.4K
16.7K
4.1M
Hanze Dong retweetledi
Satya Nadella
Satya Nadella@satyanadella·
We’re bringing our growing MAI model family to every developer in Foundry, including … · MAI-Transcribe-1, most accurate transcription model in world across 25 languages · MAI-Voice-1, natural, expressive speech generation · MAI-Image-2, our most capable image model yet Start building: microsoft.ai/news/today-wer…
GIF
English
219
285
1.8K
307.3K
Hanze Dong retweetledi
Satya Nadella
Satya Nadella@satyanadella·
Introducing Critique, a new multi-model deep research system in M365 Copilot. You can use multiple models together to generate optimal responses and reports.
English
426
513
4.2K
1.4M
Hanze Dong retweetledi
ICML Conference
ICML Conference@icmlconf·
Google's Paper Assistant Tool was extremely popular, giving AI feedback on ~4500 submissions prior to the #ICML2026 deadline. Results were positive! 92% of participants said they'd use it again, and 73% rated the feedback as helpful. Read the full blog post for more details:
ICML Conference tweet media
English
5
21
173
21.9K
Hanze Dong
Hanze Dong@hendrydong·
Burned some after-work hours spinning Copilot CLI into WeChat. Full multi-session support. WeChat ←→ copilot-wechat ←→ Copilot CLI (ACP) ←→ GPT / Claude / Gemini That makes me work anywhere with my phone. We're moving past static apps. The new meta is Dynamic Personal Interfaces. Spin up a bespoke workspace in minutes, throw it into your favorite chat app, and ship from anywhere. New vibes are incoming! github.com/hendrydong/cop…
English
1
2
10
432