sid bharthulwar

14 posts

sid bharthulwar banner
sid bharthulwar

sid bharthulwar

@bharthulwar

@jumptrading prev @harvard @twosigma @bridgewater

nyc Katılım Ekim 2017
1.2K Takip Edilen373 Takipçiler
Brian Huang
Brian Huang@brianryhuang·
@rronak_ you should read the 1000 layer CRL paper from eysenbach's lab (won best paper) it's my favorite paper of this year's neurips
English
2
0
17
1.3K
Ronak Malde
Ronak Malde@rronak_·
Best post you’ll bookmark all week
Michael Elabd@MichaelElabd

Here are some research directions I enjoyed in #neurips (will compile some more soon!) Bootstrapping long‑horizon reasoning: Recent work [1, 2] shows we can train LLMs on short-step problems and curriculum them into much longer chains. By composing simple problems into multi-step tasks and using outcome-only rewards, models learned to solve much harder problems. This suggests an efficient path to scale deep reasoning, would love to see this scale outside of non-verifiable domains. Reward shaping and PRMs: To get better reasoning, we need to reward beyond basic task completion. Posterior-GRPO uses process-based rewards in code generation outperforming ORM-based RL [3], RL-Tango uses an LLM PRM that is co-trained with the generator to achieve SOTA on maths benchmarks [4]. ToolRL focuses on PRMs for tool usage [5]. RL on non-verifiable tasks: I saw a really nice transition from verifiable tasks (maths/code) to more open-ended objectives (dialogue, automation, etc). One interesting trend here is using offline RL for non-verifiable rewards and online RL for verifiable rewards [6]. Would have loved to see more work on online RL for non-verifiable rewards [7]. Science behind RL: There are a lot of interesting questions on what capabilities RL is illicting in LLMs. [8] questions whether RL is adding any more reasoning capacity to the base model. [9] examines mechanisms to actively elicit meta-cognition to overcome these limitations. Would love to see more critical examination of the science behind RL. [1] H1 by @sumeetrm, @philiptorr, @riashatislam, @sytelus, @casdewitt, @CharlieLondon02 [2] Reasoning Curriculum by @bo_pang0, @silviocinguetta, @CaimingXiong, @yingbozhou_ai [3] Posterior-GRPO by @MouxiangC, @Zhongxin_Liu [4] RL-Tango by @KaiwenZha, @ZhengqiGao, @maohaos2, @ZhangWeiHong9, @dina_katabi [5] ToolRL by @emrecanacikgoz, @qiancheng1231, @dilekhakkanitur, @tur_gokhan, @hengjinlp [6] Writing Zero (Not in NeurIPS) by @YunyiYang2 [7] JEPO by @robinphysics, @sidawxyz, @louvishh [8] Does RL incentive reasoning by @YangYue_THU, @RayLu_THU, @_AndrewZhao [9] ReMA by @raywzy1, @MarkSchmidtUBC, @seawan, @linyi_yang

English
3
0
39
20.9K
sid bharthulwar
sid bharthulwar@bharthulwar·
next-token prediction may be a sufficient objective (no RL postttraining) for general intelligence if neural networks didn’t bias for locality over globality (arxiv.org/abs/2412.11521). at some threshold a better compressor is more about understanding the underlying reality that led to the next token. poor compression in today’s LLMs may be because pretraining is currently done on i.i.d shuffled samples of data (multiple passes over), rather than some natural MDL-based curriculum (similar to how humans learn).
English
1
0
12
2K
sid bharthulwar
sid bharthulwar@bharthulwar·
people are sleeping on the “multi-agent” aspect of the IMO/IOI results by @polynoamial @alexwei_. my guess is that the same model proposes tasks, generates candidate solutions, and verifies them. At a given iteration, if you can conceptualize and verify a problem of hardness x + epsilon and generate valid solutions of hardness x, you can bridge this epsilon gap iteratively without external verifier information.
English
0
0
8
1.6K
sid bharthulwar
sid bharthulwar@bharthulwar·
@MechanizeWork my impression is that the larger risk of RL'ing models on low-quality envs is reward hacking, rather than compute wastage. changes in models' weights from RL-ing on poorly specified reward propagate far deeper than SFT-ig on low-quality human data.
English
0
0
8
962
Mechanize
Mechanize@MechanizeWork·
People don't understand the basic economics of RL. AI labs are wasting valuable compute by training flagship models on cheap, outsourced RL tasks. It's like putting discount tires on a Ferrari. Invest in quality tasks, or waste your compute.
Mechanize tweet media
English
7
10
226
22.7K
sid bharthulwar
sid bharthulwar@bharthulwar·
@jchencxh is this provably private tho? even with a sparse/vague reward from evals from a customers employees, couldn’t you construct some setting in which RL can introduce sensitive information into the weights? tbf i think this would be very rare / afaik hasn’t been studied enough
English
1
0
0
110
James Chen
James Chen@jchencxh·
@bharthulwar Hillclimb this internally, just don’t train on their envs/data. Then give them the final model to do a private training run.
English
1
0
0
124
James Chen
James Chen@jchencxh·
I don't think the: labs can't afford to "go deep" for a given company => make a env + RL startup holds logically. By accepting this, you're basically assuming that: 1) LLMs don't broadly transfer from a few "golden envs" for a while (reasonable) 2) Labs (or someone else) can't find a general way to get generate rubrics/envs from customer feedback, even semi-autonomously (risky) I feel like 2) is a very shaky assumption. By making one of these companies, you're prima facie a consulting company with a better multiplier. The moment someone figures out 2), a bit of product innovation which doesn't rely on a lot of new science, you're dead. Even without broad "golden envs" transfer, once you have enough enough customers you cover most bases anyways with 2). Maybe I'm missing something?
English
11
4
134
21.6K
sid bharthulwar
sid bharthulwar@bharthulwar·
@jchencxh is (2) feasible with privacy concerns? do large customers want these generalist models being RL’ed on their internal workflows? my impression was the rise of the RL FDE business is partially attributed to this
English
1
0
2
104
James Chen
James Chen@jchencxh·
2) is hard so it doesn’t seem like there are any startups focused on it. But if you want exponentials in this specific market, it seems like there’s very little in terms of long term viability for anything that’s not 2) (or you make 1)) as a startup.
English
2
0
2
1K
sid bharthulwar
sid bharthulwar@bharthulwar·
I’ve been bullish on learned tool usage for a while, especially for reasoning on images. Excited to see this in the latest OpenAI reasoning models! I’m presenting a similar work on learning tool usage through evolutionary search for multimodal reasoning at #ICLR2025 in Singapore next week. Would love to chat with anyone on: diffusion models, symmetry learning / science of DL, RL, robotics, and more.
English
2
0
21
3.9K
Jeffrey Wang
Jeffrey Wang@jeffzwang·
my Harvard dorm room ML/crypto mining setup 😂 4 1080tis, 2 1070s. Many BERTs trained, many ETHs mined
Jeffrey Wang tweet media
English
113
54
2.2K
1.2M
sid bharthulwar
sid bharthulwar@bharthulwar·
@john7rho newest generational Silicon Valley thought leader right here
English
2
0
2
433
john rho
john rho@john7rho·
Hey Twitter. I wrote a quick thought piece about building value over pedigree as a budding college entrepreneur. It’s a bit long but love to hear any thoughts/feedback. johnrho.medium.com/entry-0002-bui…
English
3
1
22
4K
Dayy
Dayy@DayyThrz·
Me
Dayy tweet media
13
2
31
0