sid bharthulwar

14 posts

sid bharthulwar

@bharthulwar

@jumptrading prev @harvard @twosigma @bridgewater

nyc Katılım Ekim 2017

1.2K Takip Edilen373 Takipçiler

sid bharthulwar@bharthulwar·9 Ara

@brianryhuang @rronak_ BH research taste unparalleled 🤌🤌

English

141

Brian Huang@brianryhuang·8 Ara

@rronak_ you should read the 1000 layer CRL paper from eysenbach's lab (won best paper) it's my favorite paper of this year's neurips

English

1.3K

Ronak Malde@rronak_·8 Ara

Best post you’ll bookmark all week

Michael Elabd@MichaelElabd

Here are some research directions I enjoyed in #neurips (will compile some more soon!) Bootstrapping long‑horizon reasoning: Recent work [1, 2] shows we can train LLMs on short-step problems and curriculum them into much longer chains. By composing simple problems into multi-step tasks and using outcome-only rewards, models learned to solve much harder problems. This suggests an efficient path to scale deep reasoning, would love to see this scale outside of non-verifiable domains. Reward shaping and PRMs: To get better reasoning, we need to reward beyond basic task completion. Posterior-GRPO uses process-based rewards in code generation outperforming ORM-based RL [3], RL-Tango uses an LLM PRM that is co-trained with the generator to achieve SOTA on maths benchmarks [4]. ToolRL focuses on PRMs for tool usage [5]. RL on non-verifiable tasks: I saw a really nice transition from verifiable tasks (maths/code) to more open-ended objectives (dialogue, automation, etc). One interesting trend here is using offline RL for non-verifiable rewards and online RL for verifiable rewards [6]. Would have loved to see more work on online RL for non-verifiable rewards [7]. Science behind RL: There are a lot of interesting questions on what capabilities RL is illicting in LLMs. [8] questions whether RL is adding any more reasoning capacity to the base model. [9] examines mechanisms to actively elicit meta-cognition to overcome these limitations. Would love to see more critical examination of the science behind RL. [1] H1 by @sumeetrm, @philiptorr, @riashatislam, @sytelus, @casdewitt, @CharlieLondon02 [2] Reasoning Curriculum by @bo_pang0, @silviocinguetta, @CaimingXiong, @yingbozhou_ai [3] Posterior-GRPO by @MouxiangC, @Zhongxin_Liu [4] RL-Tango by @KaiwenZha, @ZhengqiGao, @maohaos2, @ZhangWeiHong9, @dina_katabi [5] ToolRL by @emrecanacikgoz, @qiancheng1231, @dilekhakkanitur, @tur_gokhan, @hengjinlp [6] Writing Zero (Not in NeurIPS) by @YunyiYang2 [7] JEPO by @robinphysics, @sidawxyz, @louvishh [8] Does RL incentive reasoning by @YangYue_THU, @RayLu_THU, @_AndrewZhao [9] ReMA by @raywzy1, @MarkSchmidtUBC, @seawan, @linyi_yang

English

20.9K

sid bharthulwar@bharthulwar·3 Ara

Check out our #NeurIPS2025 poster at 4:30pm today in halls C/D/E!

Stone Tao@Stone_Tao

GPU parallelized envs have accelerated RL, but most implementations exhibit critical instability when running on-policy RL with short rollouts. We present Staggered Environment Resets. A few lines of code are all you need! Presenting today, 4:30PM poster 310 #NeurIPS2025 🧵(1/8)

English

2.5K

sid bharthulwar@bharthulwar·26 Ağu

next-token prediction may be a sufficient objective (no RL postttraining) for general intelligence if neural networks didn’t bias for locality over globality (arxiv.org/abs/2412.11521). at some threshold a better compressor is more about understanding the underlying reality that led to the next token. poor compression in today’s LLMs may be because pretraining is currently done on i.i.d shuffled samples of data (multiple passes over), rather than some natural MDL-based curriculum (similar to how humans learn).

English

sid bharthulwar@bharthulwar·19 Ağu

people are sleeping on the “multi-agent” aspect of the IMO/IOI results by @polynoamial @alexwei_. my guess is that the same model proposes tasks, generates candidate solutions, and verifies them. At a given iteration, if you can conceptualize and verify a problem of hardness x + epsilon and generate valid solutions of hardness x, you can bridge this epsilon gap iteratively without external verifier information.

English

1.6K

sid bharthulwar@bharthulwar·19 Ağu

@MechanizeWork my impression is that the larger risk of RL'ing models on low-quality envs is reward hacking, rather than compute wastage. changes in models' weights from RL-ing on poorly specified reward propagate far deeper than SFT-ig on low-quality human data.

English

962

Mechanize@MechanizeWork·18 Ağu

People don't understand the basic economics of RL. AI labs are wasting valuable compute by training flagship models on cheap, outsourced RL tasks. It's like putting discount tires on a Ferrari. Invest in quality tasks, or waste your compute.

English

226

22.7K

sid bharthulwar@bharthulwar·16 Ağu

@jchencxh is this provably private tho? even with a sparse/vague reward from evals from a customers employees, couldn’t you construct some setting in which RL can introduce sensitive information into the weights? tbf i think this would be very rare / afaik hasn’t been studied enough

English

110

James Chen@jchencxh·16 Ağu

@bharthulwar Hillclimb this internally, just don’t train on their envs/data. Then give them the final model to do a private training run.

English

124

James Chen@jchencxh·16 Ağu

I don't think the: labs can't afford to "go deep" for a given company => make a env + RL startup holds logically. By accepting this, you're basically assuming that: 1) LLMs don't broadly transfer from a few "golden envs" for a while (reasonable) 2) Labs (or someone else) can't find a general way to get generate rubrics/envs from customer feedback, even semi-autonomously (risky) I feel like 2) is a very shaky assumption. By making one of these companies, you're prima facie a consulting company with a better multiplier. The moment someone figures out 2), a bit of product innovation which doesn't rely on a lot of new science, you're dead. Even without broad "golden envs" transfer, once you have enough enough customers you cover most bases anyways with 2). Maybe I'm missing something?

English

134

21.6K

sid bharthulwar@bharthulwar·16 Ağu

@jchencxh is (2) feasible with privacy concerns? do large customers want these generalist models being RL’ed on their internal workflows? my impression was the rise of the RL FDE business is partially attributed to this

English

104

James Chen@jchencxh·16 Ağu

2) is hard so it doesn’t seem like there are any startups focused on it. But if you want exponentials in this specific market, it seems like there’s very little in terms of long term viability for anything that’s not 2) (or you make 1)) as a startup.

English

sid bharthulwar@bharthulwar·23 Tem

@grx_xce @designarena congrats on launch!! rl in non verifiable domains 🤩🤩

English

226

sid bharthulwar@bharthulwar·16 Nis

I’ve been bullish on learned tool usage for a while, especially for reasoning on images. Excited to see this in the latest OpenAI reasoning models! I’m presenting a similar work on learning tool usage through evolutionary search for multimodal reasoning at #ICLR2025 in Singapore next week. Would love to chat with anyone on: diffusion models, symmetry learning / science of DL, RL, robotics, and more.

English

3.9K

sid bharthulwar@bharthulwar·11 Oca

@NinaPanickssery banger

Indonesia

286

Nina@NinaPanickssery·11 Oca

people on the internet discussing synthetic data

dr. jack morris@jxmnop

@kalomaze nah, the only "signal" comes from D en.wikipedia.org/wiki/Data_proc…

English

5.8K

sid bharthulwar@bharthulwar·3 Kas

@jeffzwang leverett towers 🤝

English

500

Jeffrey Wang@jeffzwang·3 Kas

my Harvard dorm room ML/crypto mining setup 😂 4 1080tis, 2 1070s. Many BERTs trained, many ETHs mined

English

113

2.2K

1.2M

sid bharthulwar@bharthulwar·27 May

@john7rho newest generational Silicon Valley thought leader right here

English

433

john rho@john7rho·27 May

Hey Twitter. I wrote a quick thought piece about building value over pedigree as a budding college entrepreneur. It’s a bit long but love to hear any thoughts/feedback. johnrho.medium.com/entry-0002-bui…

English

sid bharthulwar@bharthulwar·12 Kas

@jeffreygwang @ferhatay @abhijitchak23 🥳congrats!!!

English

Jeffrey Wang@jeffreygwang·12 Kas

My first scientific paper got published! (w/ @ferhatay and @abhijitchak23) Check it out here: nature.com/articles/s4146…

English

sid bharthulwar@bharthulwar·23 May

@DayyThrz CHADFACE

English

Dayy@DayyThrz·22 May

Keşfet

@brianryhuang @rronak_ @polynoamial @alexwei_ @MechanizeWork @jchencxh @grx_xce @designarena