Minwu Kim

55 posts

Minwu Kim

Minwu Kim

@MinwuKim3

LLM Reasoning @NYUAbuDhabi

Abu Dhabi, UAE Katılım Ağustos 2022
255 Takip Edilen44 Takipçiler
Sabitlenmiş Tweet
Minwu Kim
Minwu Kim@MinwuKim3·
🚀 New paper: Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning 📄 arxiv.org/pdf/2601.20829 RLVR works great-until it doesn’t. Training stalls when problems saturate. We propose a way to extend learning from these problems. Details in the 🧵
Minwu Kim tweet media
English
1
1
1
112
Minwu Kim retweetledi
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted? We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc. Why forecasting? Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold. Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer. Why *numerical* forecasting? Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates. So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.
GIF
English
4
20
87
7.9K
Minwu Kim retweetledi
Brian D. Earp, Ph.D.
Brian D. Earp, Ph.D.@briandavidearp·
"Writing is thinking." This phrase went viral recently (from lnkd.in/gYj2c9uE), often quoted in the context of objections to use of AI in drafting academic prose. In Nature Reviews Bioengineering we respond: "Thinking is not only writing." Preview below. Shareable full access link: rdcu.be/fiuYi
Brian D. Earp, Ph.D. tweet media
English
21
156
722
115.4K
Minwu Kim retweetledi
Linlu Qiu
Linlu Qiu@linluqiu·
Language is discrete. Language models don’t have to be. 🧚Introducing ELF🧚‍♀️: Embedded Language Flows—a class of diffusion models in continuous embedding space based on continuous-time Flow Matching 🧵
Linlu Qiu tweet media
English
15
130
804
134K
Minwu Kim retweetledi
jenny huang
jenny huang@JennyHuang99·
recently, i’ve been thinking about ways to design ai systems to be more compatible with slow thinking 🐌. you can check out the full blogpost here 🤗: jennyhuang19.github.io/slow-ai-ai-tha…
jenny huang tweet media
English
4
21
167
11.5K
Minwu Kim retweetledi
Keshav Ramji
Keshav Ramji@KeshavRamji·
What if your language model could reason efficiently in an entirely new language? We introduce Abstract Chain-of-Thought, a new mechanism which allows language models to reason through a short sequence of reserved "abstract" tokens through reinforcement learning. It is as performant as verbalized CoT at a fraction of the cost, achieving major gains in inference-time efficiency.
Keshav Ramji tweet media
English
60
133
1.1K
1.2M
Minwu Kim retweetledi
Saining Xie
Saining Xie@sainingxie·
Introducing Cambrian-S it’s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶
English
30
102
687
257.2K
Minwu Kim retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Project Deal. We created a marketplace for employees in our San Francisco office, with one big twist. We tasked Claude with buying, selling and negotiating on our colleagues’ behalf.
English
470
733
7.6K
2.9M
Minwu Kim retweetledi
Mayee Chen
Mayee Chen@MayeeChen·
Data mixing - determining ratios across your training datasets - matters a lot for model quality. While building Olmo 3, we learned it’s hard to set up a method that finds a strong mix, and hard to maintain that mix as datasets change throughout development. Introducing Olmix👇
Mayee Chen tweet media
English
13
72
269
56.5K
Minwu Kim retweetledi
Pavel Izmailov
Pavel Izmailov@Pavel_Izmailov·
Excited to share our new paper! As LLMs get stronger, reliable reward signals get harder to build. We study RLVR generalization under three weak supervision settings (scarce data, noisy rewards, and proxy rewards) across Qwen and Llama on math, science, and graph reasoning. Some models learn to reason. Others just memorize. We show why, and how to fix it 🧵 📄 salmanrahman.net/rlvr-weak-supe…
Pavel Izmailov tweet media
English
6
31
187
16.6K
Minwu Kim retweetledi
Daniel Khashabi 🕊️
Daniel Khashabi 🕊️@DanielKhashabi·
LLMs are increasingly embedded in agentic systems, where they must interpret and prioritize instructions from 𝐡𝐞𝐭𝐞𝐫𝐨𝐠𝐞𝐧𝐞𝐨𝐮𝐬 sources: system messages, user queries, tool outputs, ... you name it! 𝘊𝘰𝘯𝘧𝘭𝘪𝘤𝘵𝘴 among these sources may arise naturally, e.g., when a subagent's feedback 𝘤𝘰𝘯𝘵𝘳𝘢𝘥𝘪𝘤𝘵𝘴 a system-level requirement or a tool output conflicts with user preferences. Such conflicts can lead to dire vulnerabilities such as system prompt extraction ( @daphneipp et al.) and indirect prompt injection attacks ( @KGreshake et al.). To resolve this, "𝐈𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐇𝐢𝐞𝐫𝐚𝐫𝐜𝐡𝐲" (IH; @Eric_Wallace_) formalizes how models should resolve conflicts among instructions of different trust levels. IH is therefore a key abstraction for ensuring models behave according to their designer's specifications. In current practice, IH is typically instantiated with a fixed, small set of privilege levels determined during post-training. For example, OpenAI hardcodes a fixed set of roles (e.g., root, system, developer, user). But fixed-/few-tier IH is unlikely to suffice for real-world agents that interact with 𝘮𝘢𝘯𝘺 heterogeneous sources. Here, we propose "𝐌𝐚𝐧𝐲-𝐓𝐢𝐞𝐫 Instruction Hierarchy" (ManyIH), which: (1) motivates scaling up the depth of IHs, and (2) decouples privilege from message role names and instead assigns each instruction its own privilege value. We also introduce ManyIH-Bench🪜 on which the best frontier model (Gemini 3.1 Pro) achieves only 42.7% accuracy! Similarly, GPT-5.4 scores 39.4% here, despite having scored >99% on 2-tier IH evals. The takeaway is that existing models trained on fixed-tier IH do 𝐧𝐨𝐭 immediately generalize to many-tier settings. Help us scale them! 🤗
Daniel Khashabi 🕊️ tweet media
English
0
10
56
7.9K
Minwu Kim retweetledi
AI Native Foundation
AI Native Foundation@AINativeF·
7. Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning 🔑 Keywords: failure-prefix conditioning, Reinforcement Learning, informative failures, token efficiency, robustness 💡 Category: Reinforcement Learning 🌟 Research Objective: - The study aims to enhance reinforcement learning by leveraging failure-prefix conditioning to improve exploration and robustness in saturated problems. 🛠️ Research Methods: - The method reallocates exploration by focusing training on prefixes from rare incorrect reasoning trajectories, allowing models to encounter informative failures effectively and maintaining token efficiency. 💬 Research Conclusions: - Failure-prefix conditioning improves performance similar to training on medium-difficulty problems and enhances robustness, though with a minor compromise on adherence to initial correct reasoning. - An iterative approach to refresh failure prefixes during training provides further performance gains after reaching a plateau. 👉 Paper link: huggingface.co/papers/2601.20…
AI Native Foundation tweet media
English
1
1
0
23
Minwu Kim retweetledi
Google DeepMind
Google DeepMind@GoogleDeepMind·
Our breakthrough AI model AlphaGenome is helping scientists understand our DNA, predict the molecular impact of genetic changes, and drive new biological discoveries. 🧬 Find out more in @Naturegoo.gle/4bXlV6y
Google DeepMind tweet media
English
107
728
3.4K
1.1M
Minwu Kim
Minwu Kim@MinwuKim3·
6) Iteration matters: Failure prefixes themselves become stale as the model improves. By iteratively refreshing failure prefixes, we get additional gains after performance plateaus. Saturated data keeps giving—if you know where to look.
Minwu Kim tweet media
English
1
0
0
29
Minwu Kim
Minwu Kim@MinwuKim3·
🚀 New paper: Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning 📄 arxiv.org/pdf/2601.20829 RLVR works great-until it doesn’t. Training stalls when problems saturate. We propose a way to extend learning from these problems. Details in the 🧵
Minwu Kim tweet media
English
1
1
1
112