Hubert Strauss

20 posts

Hubert Strauss banner
Hubert Strauss

Hubert Strauss

@hubstrauss

Research Engineer @PrincetonPLI @Princeton

Katılım Aralık 2022
450 Takip Edilen66 Takipçiler
Sabitlenmiş Tweet
Hubert Strauss
Hubert Strauss@hubstrauss·
Very excited to share our latest work with @susieshang, @stanleyrwei, @prfsanjeevarora, and @noamrazin! One concrete application is in the code generation setting: unit tests give verifiable rewards, but partial credit for partially passing code is not always benign. It can help learning, or trap the model on almost-correct solutions. Still a lot to understand about what makes a good proxy reward function!
Noam Razin@noamrazin

📰 RL for LMs often relies on imperfect proxy rewards, which can lead to reward hacking. But are incorrect rewards necessarily harmful? Turns out, they can also be benign or even beneficial! This has implications for reward model evaluation and verifiable reward design. 🧵

English
0
0
5
413
Hubert Strauss retweetledi
Chengshuai Shi
Chengshuai Shi@chengshuai_shi·
Excited to see MLS-Bench out! I think “auto research” is becoming the next major direction beyond coding agents. What I especially like about MLS-Bench is the emphasis on science: not just tuning systems for a fixed task, but asking whether agents can discover improvements that genuinely generalize and scale. Excited to have been part of this amazing team effort!
Wenhao Chai@wenhaocha1

Introducing MLS-Bench for machine learning science. Auto research built on coding agents is undoubtedly another major market beyond SWE coding. It is harder and more challenging. However, we believe there are two different categories here. Auto research from @karpathy , MLE-Bench, and PostTrainBench are one type of attempt: engineering. Agents are asked to optimize a specific engineering objective, but we do not require them to produce transferable, generalizable behavior. MLS-Bench contains 140 tasks across 12 domains. Each task requires an agent to improve a specific component of an ML system or algorithm, and to demonstrate that the improvement generalizes and scales under controlled settings. We find that current agents are still far from consistently outperforming human-designed methods, and that engineering-style tuning is much easier for them than genuine method invention. @Lyubh22 is the lead of this project. I was deeply impressed by the way he used Discord to organize agent trajectories and share them with the team. Paper: arxiv.org/abs/2605.08678 Code: github.com/Imbernoulli/ML… Website: mls-bench.com

English
0
2
17
1.3K
Hubert Strauss retweetledi
Chengshuai Shi
Chengshuai Shi@chengshuai_shi·
🔥 Excited to share our new paper: 🚀 Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning 🎮 We study how to make RL stable and effective for training VLM agents in long-horizon, visually grounded environments — using the video game Super Mario Land as a testbed. 📜 Paper: arxiv.org/abs/2605.00347 🔗 Project page (w/ video demos): odysseus-project.github.io
Chengshuai Shi tweet media
English
5
48
248
97.7K
Hubert Strauss retweetledi
Kilian Lieret
Kilian Lieret@KLieret·
Introducing ProgramBench: 200 whole-repo generation tasks rigorously evaluated in cleanroom settings (no internet, no decompilation, no leaked source, no systracing, ...). Best score is **0**.
John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English
5
6
50
6.2K
Hubert Strauss retweetledi
Ted Zadouri
Ted Zadouri@tedzadouri·
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
Ted Zadouri tweet media
English
7
131
780
228.7K
Hubert Strauss
Hubert Strauss@hubstrauss·
I’ll be at #NeurIPS2025 from Tuesday to Sunday! If you’re planning to attend, let’s meet up and chat about RL for LLMs, efficient inference, or just grab ☕️
English
0
0
2
141
Hubert Strauss retweetledi
John Yang
John Yang@jyangballin·
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
English
31
99
416
101.9K
Hubert Strauss retweetledi
Yong Lin
Yong Lin@Yong18850571·
(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B model matches DeepSeek-671B on MiniF2F. 📚 Leading on MathOlympiadBench (IMO-level problems) * Solves 73 vs 50 over 671B DeepSeek Prover 🔓 Website: blog.goedel-prover.com 🔓 Model 32B: huggingface.co/Goedel-LM/Goed… 🔓 Model 8B huggingface.co/Goedel-LM/Goed… 🔓Data and training pipeline will be released soon. Amazing Collaborators: @sangertang1999 @Lyubh22 @__zrrr__ @juihuichung @thomaszhao1998 @pero733858111 @thiiis_user @EmilyJge @JingruoS5931 @wujiayun12 @GesiJiri68334 @davidjesusacu @KaiyuYang4 @hongzhou__lin @YejinChoinka @danqi_chen @prfsanjeevarora @chijinML
Yong Lin tweet mediaYong Lin tweet media
English
9
91
260
95.3K
Hubert Strauss retweetledi
Noam Razin
Noam Razin@noamrazin·
Reward models (RMs) are key to language model post-training and inference pipelines. But, little is known about the relative pros and cons of different RM types. 📰 We investigate why RMs implicitly defined by language models (LMs) often generalize worse than explicit RMs 🧵 1/6
Noam Razin tweet media
English
3
18
164
11.5K
Hubert Strauss retweetledi
Adithya Bhaskar
Adithya Bhaskar@AdithyaNLP·
There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7
Adithya Bhaskar tweet media
English
2
32
227
31.9K
Hubert Strauss retweetledi
Zixuan Wang
Zixuan Wang@zzZixuanWang·
LLMs can solve complex tasks that require combining multiple reasoning steps. But when are such capabilities learnable via gradient-based training? In our new COLT 2025 paper, we show that easy-to-hard data is necessary and sufficient! arxiv.org/abs/2505.23683 🧵 below (1/10)
Zixuan Wang tweet media
English
3
47
265
55.3K
Hubert Strauss
Hubert Strauss@hubstrauss·
Great project where we rethink attention for inference: Grouped-Tied Attn (GTA) ties the KV, and Grouped Latent Attn (GLA) shards low-rank latents across GPUs. Results: high arithmetic intensity, high quality models and parallel-friendly. Kudos to the team !!
Ted Zadouri@tedzadouri

"Pre-training was hard, inference easy; now everything is hard."-Jensen Huang. Inference drives AI progress b/c of test-time compute. Introducing inference aware attn: parallel-friendly, high arithmetic intensity – Grouped-Tied Attn & Grouped Latent Attn

English
0
1
7
1K
Hubert Strauss retweetledi
Noam Razin
Noam Razin@noamrazin·
I’m seeing reward variance popping up in papers on RL for language model post-training, so thought it is worth connecting the dots in our line of work that shows how reward variance is related to the RL objective landscape. 🧵 1/7
Noam Razin tweet media
English
4
9
99
11.7K
Hubert Strauss retweetledi
Xindi Wu
Xindi Wu@cindy_x_wu·
Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10
Xindi Wu tweet media
English
6
47
158
55.3K
Hubert Strauss retweetledi
Abhishek Panigrahi
Abhishek Panigrahi@Abhishek_034·
🎉Excited to present 2 papers at #ICLR2025 in Singapore! 🧠 Progressive distillation induces an implicit curriculum 📢 Oral: Sat, 4:30–4:42pm @ Garnet 216–218 🖼️ Poster: Sat, 10:00am–12:30pm (#632) ⚙️ Efficient stagewise pretraining via progressive subnetworks 🖼️ Poster: Thurs, 3:00–5:30 pm (#584) Happy to chat about distillation, curricula, and efficient pretraining!
Abhishek Panigrahi tweet media
English
2
10
61
9.6K
Hubert Strauss retweetledi
Stanley Wei
Stanley Wei@stanleyrwei·
New unlearning work at #ICLR2025! We give guarantees for unlearning a simple class of language models (topic models), and we further show it's easier to unlearn pretraining data during fine-tuning, without even modifying the base model. Paper: arxiv.org/abs/2411.12600 🧵:
English
2
15
66
6.9K
Hubert Strauss retweetledi
Kilian Lieret
Kilian Lieret@KLieret·
Evaluating SWE-agent on SWE-bench lite was once an overnight job. With SWE-ReX parallelizing our execution it now takes half an hour! SWE-ReX spins up docker containers with a @FastAPI server that uses pexpect to interface with shell sessions. MIT licensed, lightweight & hackable
English
1
3
21
1.7K
Hubert Strauss retweetledi
Yong Lin
Yong Lin@Yong18850571·
We are excited to announce the release of Goedel-Pset (huggingface.co/datasets/Goede…), the largest Lean statement dataset, which contains 1.73 million samples. Goedel-Pset is 10 times larger than Lean Workbook. We hope this resource will facilitate further research within the community.
Yong Lin tweet media
English
1
20
80
10.5K
Hubert Strauss
Hubert Strauss@hubstrauss·
An interesting project that shows, through theory and experiments, that an accurate reward model does not necessarily make a good teacher for RLHF: the reward variance it induces comes into play ! Kudos to the team !!
Noam Razin@noamrazin

The success of RLHF depends heavily on the quality of the reward model (RM), but how should we measure this quality? 📰 We study what makes a good RM from an optimization perspective. Among other results, we formalize why more accurate RMs are not necessarily better teachers! 🧵

English
0
0
4
757