Kianté Brantley

2.6K posts

Kianté Brantley

Kianté Brantley

@xkianteb

Assistant Professor at Harvard University @KempnerInst and SEAS | Fitness enthusiast | (He/Him/His)

Katılım Mayıs 2009
1.1K Takip Edilen1.9K Takipçiler
Sabitlenmiş Tweet
Kianté Brantley
Kianté Brantley@xkianteb·
Does LLM RL post-training need to be on-policy?
English
10
45
328
111.5K
Kianté Brantley retweetledi
Rosinality
Rosinality@rosinality·
Prefix RL for multi-turn agentic tasks. Additionally the difficulty of each turn is measured and intermediate rewards are assigned (potentially using a model).
Rosinality tweet media
English
2
13
128
6.8K
Kianté Brantley retweetledi
Ching-An Cheng
Ching-An Cheng@chinganc_rl·
LLM has been struggling to solve search and optimization at scale when feedback is stochastic. We propose a simple solution, POLCA, using text embedding with “provable” guarantee. Excited to see the first theoretically correct work of LLM optimization. Kudos to @XuanfeiRen
Ching-An Cheng tweet media
Xuanfei Ren@XuanfeiRen

🚀 How can we make LLM-based optimization stable and scalable when the feedback signal is stochastic? Introducing POLCA: a framework for robust, scalable stochastic generative optimization. Paper: arxiv.org/abs/2603.14769 Code: github.com/rlx-lab/POLCA 🧵👇 1/

English
2
17
51
9.5K
Kianté Brantley retweetledi
Hanlin Zhang
Hanlin Zhang@_hanlin_zhang_·
Learning from feedback is instrumental but human preference data can be expensive. How much reward supervision could we get from raw web text instead, without human labels? Our latest work, built on a year of incredible effort by @fjxdaisy, advances pure RLHF training across multiple models and tasks. Unsupervised Reward Modeling: split web docs into (prefix, true continuation); treat mismatched continuations in-batch as negatives; train w/ BT loss + score-centering. Findings: 📝 steady gains on RewardBench v1/v2 using just 11M tokens of math web text 📝 transfers across backbones (Llama-3.2 1B/3B, Qwen2.5 3B/7B Instruct) 📝 improves Best-of-N selection (math + safety) and provides a usable reward for GRPO policy optimization 📝 acts as a mid-training procedure that helps further RLHF 📑Paper: arxiv.org/abs/2603.02225 🌐Project: jingxuanf0214.github.io/reward-scaling Joint work with @lisali126, @ZhentingQi, @zdhnarsil, @xkianteb, @ShamKakade6
English
0
14
155
278.9K
Kianté Brantley retweetledi
Dan Goldstein
Dan Goldstein@dggoldst·
Attention NYC undergrads: Applications are open for our 13th annual Data Science Summer school at Microsoft Research NYC! Apply here by April 14th: bit.ly/3pCQENh
Dan Goldstein tweet media
English
0
9
19
2.8K
Kianté Brantley retweetledi
Tim Vieira
Tim Vieira@xtimv·
I built an interactive JavaScript thingy to study the two faces of KL divergence. timvieira.github.io/blog/interacti… I have wanted this since 2009. Thank you, Claude Code, for helping me get there!
English
3
9
54
7.7K
Kianté Brantley retweetledi
Kyunghyun Cho
Kyunghyun Cho@kchonyc·
students continue to be eager to learn, absorb all i can teach them and improve every week. this semester is one of the most rewarding semesters so far. some of us here feel unnecessarily, incorrectly and prematurely so defeated that we send out wrong messages of defeatism. stop! it’s just you who feel defeated and depressed. don’t impose it on others who are up and coming, and seriously much better than us.
English
0
10
194
14.2K
Kianté Brantley retweetledi
Wen Sun
Wen Sun@WenSun1·
Going to do a more technical deep dive on our enterprise knowledge agents and how we train them with RL. Overall we found that simple, yet principled off-policy RL works at scale for complex agentic tasks with hundreds of steps of tool use and context management. Here are the key takeaways from our 80 page technical report. (1) RL does not just sharpen base model's distribution. We see test-time scaling improves consistently over the iterations of the RL training. Skills learned during RL transfers to unseen prompts and agent learns to solve prompts where base model has zero accuracy under pass@16. (2) Multi-task RL generalizes really well Simple mixing of training data from multiple tasks works well and allows multi-task RL scale beyond your in-distribution training tasks. We found that multi-task RL just works better than multi-expert distillation. (3) End-to-end RL for tools and context management works best. We skipped mid-training, and directly trained everything end-to-end using RL at scale (2m tokens per gradient computation). Models learned to use vector database tools and context compression at the same time.
Jonathan Frankle@jefrankle

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English
2
18
71
15.2K
Kianté Brantley retweetledi
Rosie Zhao
Rosie Zhao@rosieyzh·
(1/7) RL-finetuned VLMs report steady gains on visual reasoning benchmarks, but whether those improvements are robust in practice is still unclear—especially given ongoing grounding failures and hallucinations. Our preprint shows that simple controlled perturbations can cause substantial accuracy drops—and even when the final answer is right, the chain-of-thought is often wrong or inconsistent even in the presence of grounding signals. Work done during my internship at Apple last summer!
Rosie Zhao tweet media
English
2
29
182
27.4K
Kianté Brantley retweetledi
Kianté Brantley
Kianté Brantley@xkianteb·
Does LLM RL post-training need to be on-policy?
English
10
45
328
111.5K
Kianté Brantley retweetledi
Jonathan Frankle
Jonathan Frankle@jefrankle·
Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵
Jonathan Frankle tweet mediaJonathan Frankle tweet media
English
9
46
241
68.6K
Kianté Brantley retweetledi
Abhishek Gupta
Abhishek Gupta@abhishekunique7·
Specifying rewards has always been a pain ☹️Perhaps not such a pain going forward? Robometer proposes a methodology for training general purpose reward models that work out of the box across new robots/tasks/environments! The key - find a good way to use negative data :) Once you do this - you can enable fun downstream applications like - efficient real-world RL, world model steering, offline RL, failure detection and more. We release code and models for you to play around with it yourself! Website: robometer.github.io Paper: arxiv.org/abs/2603.02115 Code: github.com/robometer/robo… Special shout out to our incredible postdoc @Jesse_Y_Zhang for corraling a bunch of awesome collaborators to make this happen!
Jesse Zhang@Jesse_Y_Zhang

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

English
3
9
61
7.2K