Kianté Brantley

2.6K posts

Kianté Brantley

@xkianteb

Assistant Professor at Harvard University @KempnerInst and SEAS | Fitness enthusiast | (He/Him/His)

Katılım Mayıs 2009

1.1K Takip Edilen1.9K Takipçiler

Sabitlenmiş Tweet

Kianté Brantley@xkianteb·27 Şub

Does LLM RL post-training need to be on-policy?

English

328

111.5K

Kianté Brantley retweetledi

Rosinality@rosinality·24 Mar

Prefix RL for multi-turn agentic tasks. Additionally the difficulty of each turn is measured and intermediate rewards are assigned (potentially using a model).

English

128

6.8K

Kianté Brantley retweetledi

Ching-An Cheng@chinganc_rl·19 Mar

LLM has been struggling to solve search and optimization at scale when feedback is stochastic. We propose a simple solution, POLCA, using text embedding with “provable” guarantee. Excited to see the first theoretically correct work of LLM optimization. Kudos to @XuanfeiRen

Xuanfei Ren@XuanfeiRen

🚀 How can we make LLM-based optimization stable and scalable when the feedback signal is stochastic? Introducing POLCA: a framework for robust, scalable stochastic generative optimization. Paper: arxiv.org/abs/2603.14769 Code: github.com/rlx-lab/POLCA 🧵👇 1/

English

9.5K

Kianté Brantley retweetledi

Hanlin Zhang@_hanlin_zhang_·18 Mar

Learning from feedback is instrumental but human preference data can be expensive. How much reward supervision could we get from raw web text instead, without human labels? Our latest work, built on a year of incredible effort by @fjxdaisy, advances pure RLHF training across multiple models and tasks. Unsupervised Reward Modeling: split web docs into (prefix, true continuation); treat mismatched continuations in-batch as negatives; train w/ BT loss + score-centering. Findings: 📝 steady gains on RewardBench v1/v2 using just 11M tokens of math web text 📝 transfers across backbones (Llama-3.2 1B/3B, Qwen2.5 3B/7B Instruct) 📝 improves Best-of-N selection (math + safety) and provides a usable reward for GRPO policy optimization 📝 acts as a mid-training procedure that helps further RLHF 📑Paper: arxiv.org/abs/2603.02225 🌐Project: jingxuanf0214.github.io/reward-scaling Joint work with @lisali126, @ZhentingQi, @zdhnarsil, @xkianteb, @ShamKakade6

English

155

278.9K

Kianté Brantley retweetledi

Dan Goldstein@dggoldst·18 Mar

Attention NYC undergrads: Applications are open for our 13th annual Data Science Summer school at Microsoft Research NYC! Apply here by April 14th: bit.ly/3pCQENh

English

2.8K

Kianté Brantley retweetledi

Tim Vieira@xtimv·13 Mar

I built an interactive JavaScript thingy to study the two faces of KL divergence. timvieira.github.io/blog/interacti… I have wanted this since 2009. Thank you, Claude Code, for helping me get there!

English

7.7K

Kianté Brantley@xkianteb·14 Mar

@xtimv has dropped a new blog post!

Tim Vieira@xtimv

English

767

Kianté Brantley retweetledi

Kyunghyun Cho@kchonyc·12 Mar

students continue to be eager to learn, absorb all i can teach them and improve every week. this semester is one of the most rewarding semesters so far. some of us here feel unnecessarily, incorrectly and prematurely so defeated that we send out wrong messages of defeatism. stop! it’s just you who feel defeated and depressed. don’t impose it on others who are up and coming, and seriously much better than us.

English

194

14.2K

Kianté Brantley retweetledi

Yacine Mahdid@yacinelearning·12 Mar

I'm reading this stuff in my hotel room right now btw

Kianté Brantley@xkianteb

Does LLM RL post-training need to be on-policy?

English

19.3K

Kianté Brantley retweetledi

Kempner Institute at Harvard University@KempnerInst·12 Mar

📢✨We are thrilled to announce the 2026 recipients of #KempnerInstitute Research Fellowships: Jieneng Chen, Marco Fumero, Audrey Huang, Raja Marjieh, Yinuo Ren and Anne Wu! Learn more: bit.ly/3NfzB2Y @anne_youw @marco_fumero @Yinuo_Ren @RajaMarjieh

Kempner Institute at Harvard University tweet media

English

19.2K

Kianté Brantley retweetledi

Wen Sun@WenSun1·5 Mar

Going to do a more technical deep dive on our enterprise knowledge agents and how we train them with RL. Overall we found that simple, yet principled off-policy RL works at scale for complex agentic tasks with hundreds of steps of tool use and context management. Here are the key takeaways from our 80 page technical report. (1) RL does not just sharpen base model's distribution. We see test-time scaling improves consistently over the iterations of the RL training. Skills learned during RL transfers to unseen prompts and agent learns to solve prompts where base model has zero accuracy under pass@16. (2) Multi-task RL generalizes really well Simple mixing of training data from multiple tasks works well and allows multi-task RL scale beyond your in-distribution training tasks. We found that multi-task RL just works better than multi-expert distillation. (3) End-to-end RL for tools and context management works best. We skipped mid-training, and directly trained everything end-to-end using RL at scale (2m tokens per gradient computation). Models learned to use vector database tools and context compression at the same time.

Jonathan Frankle@jefrankle

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English

15.2K

Kianté Brantley retweetledi

Rosie Zhao@rosieyzh·7 Mar

(1/7) RL-finetuned VLMs report steady gains on visual reasoning benchmarks, but whether those improvements are robust in practice is still unclear—especially given ongoing grounding failures and hallucinations. Our preprint shows that simple controlled perturbations can cause substantial accuracy drops—and even when the final answer is right, the chain-of-thought is often wrong or inconsistent even in the presence of grounding signals. Work done during my internship at Apple last summer!

English

182

27.4K

Kianté Brantley retweetledi

Kianté Brantley@xkianteb·27 Şub

Does LLM RL post-training need to be on-policy?

English

328

111.5K

Kianté Brantley retweetledi

Jonathan Frankle@jefrankle·5 Mar

Can confirm: OAPL is great for real-world tasks. x.com/xkianteb/statu…

Kianté Brantley@xkianteb

Does LLM RL post-training need to be on-policy?

English

1.4K

Kianté Brantley retweetledi

Alessandro Sordoni@murefil·5 Mar

tested and this works super well.. congrats @xkianteb!!

Kianté Brantley@xkianteb

Does LLM RL post-training need to be on-policy?

English

3.5K

Kianté Brantley@xkianteb·6 Mar

Thanks @murefil! Our team spent a lot of time ensuring this algorithm makes sense. I am glad it is working for you!

Alessandro Sordoni@murefil

tested and this works super well.. congrats @xkianteb!!

English

551

Kianté Brantley retweetledi

Jonathan Frankle@jefrankle·5 Mar

English

241

68.6K

Kianté Brantley retweetledi

Abhishek Gupta@abhishekunique7·3 Mar

Specifying rewards has always been a pain ☹️Perhaps not such a pain going forward? Robometer proposes a methodology for training general purpose reward models that work out of the box across new robots/tasks/environments! The key - find a good way to use negative data :) Once you do this - you can enable fun downstream applications like - efficient real-world RL, world model steering, offline RL, failure detection and more. We release code and models for you to play around with it yourself! Website: robometer.github.io Paper: arxiv.org/abs/2603.02115 Code: github.com/robometer/robo… Special shout out to our incredible postdoc @Jesse_Y_Zhang for corraling a bunch of awesome collaborators to make this happen!

Jesse Zhang@Jesse_Y_Zhang

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

English

7.2K

Kianté Brantley retweetledi

Jonathan Frankle@jefrankle·28 Şub

If you aren’t following @owenoertell, you’re missing out on a rising star in RL

Owen Oertell@owenoertell

really fun project! you don't need to be on-policy to do RL! off-policy is just as good (and more sample efficient)!

English

163

49K

Kianté Brantley retweetledi