Hubert Strauss

20 posts

Hubert Strauss

@hubstrauss

Research Engineer @PrincetonPLI @Princeton

Katılım Aralık 2022

450 Takip Edilen66 Takipçiler

Sabitlenmiş Tweet

Hubert Strauss@hubstrauss·7 May

Very excited to share our latest work with @susieshang, @stanleyrwei, @prfsanjeevarora, and @noamrazin! One concrete application is in the code generation setting: unit tests give verifiable rewards, but partial credit for partially passing code is not always benign. It can help learning, or trap the model on almost-correct solutions. Still a lot to understand about what makes a good proxy reward function!

Noam Razin@noamrazin

📰 RL for LMs often relies on imperfect proxy rewards, which can lead to reward hacking. But are incorrect rewards necessarily harmful? Turns out, they can also be benign or even beneficial! This has implications for reward model evaluation and verifiable reward design. 🧵

English

413

Hubert Strauss retweetledi

Chengshuai Shi@chengshuai_shi·12 May

Excited to see MLS-Bench out! I think “auto research” is becoming the next major direction beyond coding agents. What I especially like about MLS-Bench is the emphasis on science: not just tuning systems for a fixed task, but asking whether agents can discover improvements that genuinely generalize and scale. Excited to have been part of this amazing team effort!

Wenhao Chai@wenhaocha1

Introducing MLS-Bench for machine learning science. Auto research built on coding agents is undoubtedly another major market beyond SWE coding. It is harder and more challenging. However, we believe there are two different categories here. Auto research from @karpathy , MLE-Bench, and PostTrainBench are one type of attempt: engineering. Agents are asked to optimize a specific engineering objective, but we do not require them to produce transferable, generalizable behavior. MLS-Bench contains 140 tasks across 12 domains. Each task requires an agent to improve a specific component of an ML system or algorithm, and to demonstrate that the improvement generalizes and scales under controlled settings. We find that current agents are still far from consistently outperforming human-designed methods, and that engineering-style tuning is much easier for them than genuine method invention. @Lyubh22 is the lead of this project. I was deeply impressed by the way he used Discord to organize agent trajectories and share them with the team. Paper: arxiv.org/abs/2605.08678 Code: github.com/Imbernoulli/ML… Website: mls-bench.com

English

1.3K

Hubert Strauss retweetledi

Chengshuai Shi@chengshuai_shi·4 May

🔥 Excited to share our new paper: 🚀 Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning 🎮 We study how to make RL stable and effective for training VLM agents in long-horizon, visually grounded environments — using the video game Super Mario Land as a testbed. 📜 Paper: arxiv.org/abs/2605.00347 🔗 Project page (w/ video demos): odysseus-project.github.io

English

248

97.7K

Hubert Strauss retweetledi

Kilian Lieret@KLieret·5 May

Introducing ProgramBench: 200 whole-repo generation tasks rigorously evaluated in cleanroom settings (no internet, no decompilation, no leaked source, no systracing, ...). Best score is **0**.

John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English

6.2K

Hubert Strauss retweetledi

Ted Zadouri@tedzadouri·5 Mar

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

131

780

228.7K

Hubert Strauss retweetledi

Wentao Guo@WentaoGuo7·19 Ara

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

English

112

639

247.6K

Hubert Strauss@hubstrauss·2 Ara

I’ll be at #NeurIPS2025 from Tuesday to Sunday! If you’re planning to attend, let’s meet up and chat about RL for LLMs, efficient inference, or just grab ☕️

English

141

Hubert Strauss retweetledi

John Yang@jyangballin·5 Kas

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

English

416

101.9K

Hubert Strauss retweetledi

Yong Lin@Yong18850571·15 Tem

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B model matches DeepSeek-671B on MiniF2F. 📚 Leading on MathOlympiadBench (IMO-level problems) * Solves 73 vs 50 over 671B DeepSeek Prover 🔓 Website: blog.goedel-prover.com 🔓 Model 32B: huggingface.co/Goedel-LM/Goed… 🔓 Model 8B huggingface.co/Goedel-LM/Goed… 🔓Data and training pipeline will be released soon. Amazing Collaborators: @sangertang1999 @Lyubh22 @__zrrr__ @juihuichung @thomaszhao1998 @pero733858111 @thiiis_user @EmilyJge @JingruoS5931 @wujiayun12 @GesiJiri68334 @davidjesusacu @KaiyuYang4 @hongzhou__lin @YejinChoinka @danqi_chen @prfsanjeevarora @chijinML

English

260

95.3K

Hubert Strauss retweetledi

Noam Razin@noamrazin·11 Tem

Reward models (RMs) are key to language model post-training and inference pipelines. But, little is known about the relative pros and cons of different RM types. 📰 We investigate why RMs implicitly defined by language models (LMs) often generalize worse than explicit RMs 🧵 1/6

English

164

11.5K

Hubert Strauss retweetledi

Adithya Bhaskar@AdithyaNLP·23 Haz

There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7

English

227

31.9K

Hubert Strauss retweetledi

Zixuan Wang@zzZixuanWang·30 May

LLMs can solve complex tasks that require combining multiple reasoning steps. But when are such capabilities learnable via gradient-based training? In our new COLT 2025 paper, we show that easy-to-hard data is necessary and sufficient! arxiv.org/abs/2505.23683 🧵 below (1/10)

English

265

55.3K

Hubert Strauss@hubstrauss·30 May

Great project where we rethink attention for inference: Grouped-Tied Attn (GTA) ties the KV, and Grouped Latent Attn (GLA) shards low-rank latents across GPUs. Results: high arithmetic intensity, high quality models and parallel-friendly. Kudos to the team !!

Ted Zadouri@tedzadouri

"Pre-training was hard, inference easy; now everything is hard."-Jensen Huang. Inference drives AI progress b/c of test-time compute. Introducing inference aware attn: parallel-friendly, high arithmetic intensity – Grouped-Tied Attn & Grouped Latent Attn

English

Hubert Strauss retweetledi

Noam Razin@noamrazin·25 May

I’m seeing reward variance popping up in papers on RL for language model post-training, so thought it is worth connecting the dots in our line of work that shows how reward variance is related to the RL objective landscape. 🧵 1/7

English

11.7K

Hubert Strauss retweetledi

Xindi Wu@cindy_x_wu·3 May

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

English

158

55.3K

Hubert Strauss retweetledi

Abhishek Panigrahi@Abhishek_034·23 Nis

🎉Excited to present 2 papers at #ICLR2025 in Singapore! 🧠 Progressive distillation induces an implicit curriculum 📢 Oral: Sat, 4:30–4:42pm @ Garnet 216–218 🖼️ Poster: Sat, 10:00am–12:30pm (#632) ⚙️ Efficient stagewise pretraining via progressive subnetworks 🖼️ Poster: Thurs, 3:00–5:30 pm (#584) Happy to chat about distillation, curricula, and efficient pretraining!

English

9.6K

Hubert Strauss retweetledi

Stanley Wei@stanleyrwei·22 Nis

New unlearning work at #ICLR2025! We give guarantees for unlearning a simple class of language models (topic models), and we further show it's easier to unlearn pretraining data during fine-tuning, without even modifying the base model. Paper: arxiv.org/abs/2411.12600 🧵:

English

6.9K

Hubert Strauss retweetledi

Kilian Lieret@KLieret·10 Nis

Evaluating SWE-agent on SWE-bench lite was once an overnight job. With SWE-ReX parallelizing our execution it now takes half an hour! SWE-ReX spins up docker containers with a @FastAPI server that uses pexpect to interface with shell sessions. MIT licensed, lightweight & hackable

English

1.7K

Hubert Strauss retweetledi

Yong Lin@Yong18850571·1 Nis

We are excited to announce the release of Goedel-Pset (huggingface.co/datasets/Goede…), the largest Lean statement dataset, which contains 1.73 million samples. Goedel-Pset is 10 times larger than Lean Workbook. We hope this resource will facilitate further research within the community.

English

10.5K

Hubert Strauss@hubstrauss·21 Mar

An interesting project that shows, through theory and experiments, that an accurate reward model does not necessarily make a good teacher for RLHF: the reward variance it induces comes into play ! Kudos to the team !!

Noam Razin@noamrazin

The success of RLHF depends heavily on the quality of the reward model (RM), but how should we measure this quality? 📰 We study what makes a good RM from an optimization perspective. Among other results, we formalize why more accurate RMs are not necessarily better teachers! 🧵

English

757

Keşfet

@ultraproduct @__tensorcore__ @tri_dao @MayankMish98 @XinleC295 @istoica05 @sangertang1999 @Lyubh22