Brace(Hanyang) Zhao

81 posts

Brace(Hanyang) Zhao

@OptionsGod_lgd

PhD student at @Columbia working on post training generative models | ex intern at @Netflix @CapitalOne

Manhattan, NY Katılım Nisan 2021

881 Takip Edilen123 Takipçiler

Brace(Hanyang) Zhao@OptionsGod_lgd·6d

@yule_gan Interesting work! Curious that will the conclusion hold for models beyond Qwen 2.5, e.g. Qwen3? I remember there's supurious rewards paper last year whose observations also only hold for Qwen 2.5. Could this be the Qwen 2.5's magic?

English

252

Yulu Gan@yule_gan·13 Mar

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu

English

430

670.6K

Brace(Hanyang) Zhao@OptionsGod_lgd·7 Mar

Did not think before that my favorite two topics - RL and Diffusion - can meet in a such interesting way :)

Aviral Kumar@aviral_kumar2

🚨🚨 New paper on flow-matching value functions Last year, we showed training RL value functions with a flow-matching loss achieved SOTA results. But why does it work? And what could it possibly tell us about other things that have nothing to do with VFs or even RL? Short answer: iterative compute used correctly can address feature plasticity in continual learning! 🧵⬇️

English

206

Brace(Hanyang) Zhao@OptionsGod_lgd·26 Şub

@xidulu Believe me that getting team matched twice and then both rejected could feel worse 😭

English

Xidulu@xidulu·25 Şub

Every intern application season, I had more papers, more citations, more internship experiences, and offers from more prestigious places than last year; However one thing that doesn't change is that Google student researcher program NEVER gives me any sort of interview LOL

English

223

18K

Brace(Hanyang) Zhao@OptionsGod_lgd·6 Şub

@hamishivi self-reinforcement learning

English

305

Hamish Ivison@hamishivi·6 Şub

4.6 writes a plan and then says 'Excellent plan' to itself lmao

English

2.3K

90.3K

Brace(Hanyang) Zhao@OptionsGod_lgd·5 Şub

@QPHutu I guess it is much easier to choose a proper clip ratio to start with like 0.2 in common practice than this absolute difference delta? Any idea of the range of delta and is delta sensitive?

English

582

Penghui Qi@QPHutu·5 Şub

This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇

English

542

45.3K

Brace(Hanyang) Zhao@OptionsGod_lgd·4 Şub

Great work! We did something quite similarly for image diffusion almost one year ago 🤣 In Rich Preference Optimization (arxiv.org/pdf/2503.11720), we train diffusion models by generating better images conditional on feedbacks (like your second turn answer), yet we use dpo instead of online rl.

English

756

Yuda Song@yus167·3 Şub

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

101

602

104.5K

Brace(Hanyang) Zhao@OptionsGod_lgd·14 Oca

Now I am really curious: Why Anthropic is still hiring engineers ...

Claude@claudeai

Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code.

English

266

Brace(Hanyang) Zhao@OptionsGod_lgd·5 Oca

Cool work! We had an earlier quite similar exploration in utilizing VLMs to improve Diffusion Models as well in RPO (arxiv.org/pdf/2503.11720), but instead we create synthetic preference pairs by revising upon original image from critiques and revising suggestions provided by VLMs!

English

Kyle Sargent@KyleSargentAI·19 Ara

Vision-language models are getting better every day. Can we use them to improve image compression? Yes! For my internship, working w/ @GoogleDeepMind, @GoogleResearch, we designed VLIC, a diffusion autoencoder post-trained with VLM preferences. Our preprint is out today! A🧵:

English

314

43.7K

Brace(Hanyang) Zhao@OptionsGod_lgd·5 Oca

Really miss the good old times of RL and Diffusion Models: Both domains are with interesting math involved, and cool experiments that can be done like RL gym envs or image generation tasks to verify the ideas. Recent research and papers are fundemenatlly different flavors....

English

Brace(Hanyang) Zhao@OptionsGod_lgd·4 Oca

Happy moments of the past 2025: received a grant from @thinkymachines 🥹 Hope to build something cool in the new year!

English

124

Brace(Hanyang) Zhao retweetledi

Yang Song@DrYangSong·29 Eki

Applications change, but the principles are enduring. After a year's hard work led by @JCJesseLai, we are really excited to share this deep, systematic dive into the mathematical principles of diffusion models. This is a monograph we always wished we had.

Chieh-Hsin (Jesse) Lai@JCJesseLai

Tired to go back to the original papers again and again? Our monograph: a systematic and fundamental recipe you can rely on! 📘 We’re excited to release 《The Principles of Diffusion Models》— with @DrYangSong, @gimdong58085414, @mittu1204, and @StefanoErmon. It traces the core ideas that shaped diffusion modeling and explains how today’s models work, why they work, and where they’re heading. 🧵You’ll find the link and a few highlights in the thread. We’d love to hear your thoughts and join some discussions! ⚡ Stay tuned for our markdown version, where you can drop your comments!

English

447

57.8K

Brace(Hanyang) Zhao retweetledi

Thinking Machines@thinkymachines·1 Eki

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker

English

245

789

5.9K

4.2M

Brace(Hanyang) Zhao retweetledi

Thinking Machines@thinkymachines·29 Eyl

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/

English

557

3.5K

1.4M

Brace(Hanyang) Zhao@OptionsGod_lgd·10 Eyl

@aviral_kumar2 Sounds interesting! We also did some investigation in utilizing diffusion models structure for value function design in arxiv.org/pdf/2502.01819, not sure whether some parameterizations can be correlated!

English

401

Aviral Kumar@aviral_kumar2·9 Eyl

🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️

English

703

70.1K

Brace(Hanyang) Zhao@OptionsGod_lgd·13 Ağu

This paper clearly reminds myself of our ICLR 2025 RainbowPO (arxiv.org/pdf/2410.04203) paper last year, in which we did similar things for DPO variants. Enjoy reading (and doing) such works which could help elucidating the design space.

Andrew Zhao@_AndrewZhao

Nice empirical paper investigating all your bag of tricks in reasoning LLMs arxiv.org/abs/2508.08221

English

326

Brace(Hanyang) Zhao retweetledi

机器之心 JIQIZHIXIN@jiqizhixin·1 Ağu

ByteDance is exploring diffusion LLMs too! 👀 Seed Diffusion Preview: a blazing-fast LLM for code, built on discrete-state diffusion. With 2,146 tokens/sec inference on H20 GPUs, it outpaces Mercury & Gemini Diffusion, while matching their performance on standard code benchmarks. New SOTA on the speed–quality Pareto frontier. 🚀

English

533

46.4K

Brace(Hanyang) Zhao retweetledi

Google DeepMind@GoogleDeepMind·21 Tem

An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵

English

152

705

4.3K

1.1M

Brace(Hanyang) Zhao@OptionsGod_lgd·14 Tem

Will be presenting this paper on Wednesday (July 16) 4:30 p.m.-7:00 PDT p.m. at West Exhibition Hall B2-B3 W-706! Love to chat about Diffusion Models, RL, LLMs or anything else!

Brace(Hanyang) Zhao@OptionsGod_lgd

Maybe a bit late, but I am thrilled to announce that our Scores as Actions (arxiv.org/abs/2502.01819) paper is accepted by #icml2025 🎉 We propose a continuous-time RL method for diffusion models RLHF (which are naturally continuous time) and perform better than discrete-time RL!

English

203

Keşfet

@yule_gan @xidulu @hamishivi @QPHutu @GoogleDeepMind @GoogleResearch @thinkymachines @JCJesseLai