Minwu Kim

55 posts

Minwu Kim

@MinwuKim3

LLM Reasoning @NYUAbuDhabi

Abu Dhabi, UAE Katılım Ağustos 2022

255 Takip Edilen44 Takipçiler

Sabitlenmiş Tweet

Minwu Kim@MinwuKim3·29 Oca

🚀 New paper: Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning 📄 arxiv.org/pdf/2601.20829 RLVR works great-until it doesn’t. Training stalls when problems saturate. We propose a way to extend learning from these problems. Details in the 🧵

English

112

Minwu Kim retweetledi

Maksym Andriushchenko@maksym_andr·5d

💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted? We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc. Why forecasting? Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold. Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer. Why *numerical* forecasting? Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates. So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.

GIF

English

7.9K

Minwu Kim retweetledi

clem 🤗@ClementDelangue·6d

Paper of the day! huggingface.co/papers/2605.13…

Ning Ding@stingning

We’re releasing a 30B-A3B reasoning model that reaches gold-medal level across both physics and math Olympiad evaluations: IPhO directly, and IMO/USAMO with test-time self-verification and refinement. A simple, unified scaling recipe for proof search. huggingface.co/papers/2605.13…

English

288

62.1K

Minwu Kim retweetledi

Brian D. Earp, Ph.D.@briandavidearp·15 May

"Writing is thinking." This phrase went viral recently (from lnkd.in/gYj2c9uE), often quoted in the context of objections to use of AI in drafting academic prose. In Nature Reviews Bioengineering we respond: "Thinking is not only writing." Preview below. Shareable full access link: rdcu.be/fiuYi

English

156

722

115.4K

Minwu Kim retweetledi

Linlu Qiu@linluqiu·12 May

Language is discrete. Language models don’t have to be. 🧚Introducing ELF🧚‍♀️: Embedded Language Flows—a class of diffusion models in continuous embedding space based on continuous-time Flow Matching 🧵

English

130

804

134K

Minwu Kim retweetledi

jenny huang@JennyHuang99·5 May

recently, i’ve been thinking about ways to design ai systems to be more compatible with slow thinking 🐌. you can check out the full blogpost here 🤗: jennyhuang19.github.io/slow-ai-ai-tha…

English

167

11.5K

Minwu Kim retweetledi

Keshav Ramji@KeshavRamji·27 Nis

What if your language model could reason efficiently in an entirely new language? We introduce Abstract Chain-of-Thought, a new mechanism which allows language models to reason through a short sequence of reserved "abstract" tokens through reinforcement learning. It is as performant as verbalized CoT at a fraction of the cost, achieving major gains in inference-time efficiency.

English

133

1.1K

1.2M

Minwu Kim retweetledi

Saining Xie@sainingxie·7 Kas

Introducing Cambrian-S it’s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶

English

102

687

257.2K

Minwu Kim retweetledi

Anthropic@AnthropicAI·24 Nis

New Anthropic research: Project Deal. We created a marketplace for employees in our San Francisco office, with one big twist. We tasked Claude with buying, selling and negotiating on our colleagues’ behalf.

English

470

733

7.6K

2.9M

Minwu Kim retweetledi

Mayee Chen@MayeeChen·13 Şub

Data mixing - determining ratios across your training datasets - matters a lot for model quality. While building Olmo 3, we learned it’s hard to set up a method that finds a strong mix, and hard to maintain that mix as datasets change throughout development. Introducing Olmix👇

English

269

56.5K

Minwu Kim retweetledi

Russ Salakhutdinov@rsalakhu·22 Nis

Foresight will be the defining frontier on the path to AGI. I am excited to start Sooth Labs with my amazing co-founders: Yaser Sheikh @subail, Chuck Hoover @chuckjhoover, David LaRose, and Shih-En Wei. Deeply grateful to Aydin Senkut @asenkut and Feyza Haskaraman @FHaskaraman at @felicis for leading the round, alongside an exceptional group of partners. bloomberg.com/news/articles/…

English

275

71.9K

Minwu Kim retweetledi

Pavel Izmailov@Pavel_Izmailov·21 Nis

Excited to share our new paper! As LLMs get stronger, reliable reward signals get harder to build. We study RLVR generalization under three weak supervision settings (scarce data, noisy rewards, and proxy rewards) across Qwen and Llama on math, science, and graph reasoning. Some models learn to reason. Others just memorize. We show why, and how to fix it 🧵 📄 salmanrahman.net/rlvr-weak-supe…

English

187

16.6K

Minwu Kim retweetledi

Daniel Khashabi 🕊️@DanielKhashabi·15 Nis

LLMs are increasingly embedded in agentic systems, where they must interpret and prioritize instructions from 𝐡𝐞𝐭𝐞𝐫𝐨𝐠𝐞𝐧𝐞𝐨𝐮𝐬 sources: system messages, user queries, tool outputs, ... you name it! 𝘊𝘰𝘯𝘧𝘭𝘪𝘤𝘵𝘴 among these sources may arise naturally, e.g., when a subagent's feedback 𝘤𝘰𝘯𝘵𝘳𝘢𝘥𝘪𝘤𝘵𝘴 a system-level requirement or a tool output conflicts with user preferences. Such conflicts can lead to dire vulnerabilities such as system prompt extraction ( @daphneipp et al.) and indirect prompt injection attacks ( @KGreshake et al.). To resolve this, "𝐈𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐇𝐢𝐞𝐫𝐚𝐫𝐜𝐡𝐲" (IH; @Eric_Wallace_) formalizes how models should resolve conflicts among instructions of different trust levels. IH is therefore a key abstraction for ensuring models behave according to their designer's specifications. In current practice, IH is typically instantiated with a fixed, small set of privilege levels determined during post-training. For example, OpenAI hardcodes a fixed set of roles (e.g., root, system, developer, user). But fixed-/few-tier IH is unlikely to suffice for real-world agents that interact with 𝘮𝘢𝘯𝘺 heterogeneous sources. Here, we propose "𝐌𝐚𝐧𝐲-𝐓𝐢𝐞𝐫 Instruction Hierarchy" (ManyIH), which: (1) motivates scaling up the depth of IHs, and (2) decouples privilege from message role names and instead assigns each instruction its own privilege value. We also introduce ManyIH-Bench🪜 on which the best frontier model (Gemini 3.1 Pro) achieves only 42.7% accuracy! Similarly, GPT-5.4 scores 39.4% here, despite having scored >99% on 2-tier IH evals. The takeaway is that existing models trained on fixed-tier IH do 𝐧𝐨𝐭 immediately generalize to many-tier settings. Help us scale them! 🤗

English

7.9K

Minwu Kim retweetledi

AI Native Foundation@AINativeF·30 Oca

7. Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning 🔑 Keywords: failure-prefix conditioning, Reinforcement Learning, informative failures, token efficiency, robustness 💡 Category: Reinforcement Learning 🌟 Research Objective: - The study aims to enhance reinforcement learning by leveraging failure-prefix conditioning to improve exploration and robustness in saturated problems. 🛠️ Research Methods: - The method reallocates exploration by focusing training on prefixes from rare incorrect reasoning trajectories, allowing models to encounter informative failures effectively and maintaining token efficiency. 💬 Research Conclusions: - Failure-prefix conditioning improves performance similar to training on medium-difficulty problems and enhances robustness, though with a minor compromise on adherence to initial correct reasoning. - An iterative approach to refresh failure prefixes during training provides further performance gains after reaching a plateau. 👉 Paper link: huggingface.co/papers/2601.20…

English

Minwu Kim@MinwuKim3·4 Şub

Glad to have contributed! Do check out the paper :D

Safal Shrestha@saffffal

📄 New paper: On the Limits of Layer Pruning for Generative Reasoning in LLMs TL;DR: You can prune entire layers and keep classification accuracy — but generative reasoning breaks, often irreversibly. arXiv: arxiv.org/abs/2602.01997 Code + models below 👇

English

Minwu Kim retweetledi

Google DeepMind@GoogleDeepMind·28 Oca

Our breakthrough AI model AlphaGenome is helping scientists understand our DNA, predict the molecular impact of genetic changes, and drive new biological discoveries. 🧬 Find out more in @Nature ↓ goo.gle/4bXlV6y

English

107

728

3.4K

1.1M

Minwu Kim@MinwuKim3·29 Oca

7) 🙏 Big shoutout to my amazing co-authors @saffffal and Prof. Keith Ross! Please go to arxiv.org/pdf/2601.20829 for full paper.

English

Minwu Kim@MinwuKim3·29 Oca

6) Iteration matters: Failure prefixes themselves become stale as the model improves. By iteratively refreshing failure prefixes, we get additional gains after performance plateaus. Saturated data keeps giving—if you know where to look.

English

Minwu Kim@MinwuKim3·29 Oca

English

112

Keşfet

@Jjq2221 @subail @chuckjhoover @asenkut @FHaskaraman @felicis @daphneipp @KGreshake