



Grace Kim
21 posts

@_grace_kim
First-year NLP PhD student @Penn prev undergrad @UTAustin intern @EPFL













I'll be presenting our work "Probabilistic Soundness Guarantees in LLM Reasoning Chains" at EMNLP 2025 Today (Nov 5) Hall C 14:30-16:00 802-Main Blog: debugml.github.io/ares Paper: arxiv.org/abs/2507.12948 Code: github.com/fallcat/ares





Announcing our NeurIPS paper: Once Upon an Input: Reasoning via Per-Instance Program Synthesis (PIPS) 📝: arxiv.org/abs/2510.22849 Why do LLMs (and LLM agents) still struggle on hard reasoning problems which should be solvable by writing and executing code? We find that the biggest problem with LLM generated “programs” for reasoning is that they don’t compute anything, they just hardcode the answer! PIPS fixes this by 1️⃣ abstracting the input into symbols, 2️⃣ generating code that maps symbols to the answer, and 3️⃣ refining the code with structural feedback. 🧵👇


🚨 Introducing DINCO, a zero-resource calibration method for verbalized LLM confidence. We normalize over self-generated distractors to enforce coherence ➡️ better-calibrated and less saturated (more usable) confidence! ⚠️ Problem: Standard verbalized confidence is overconfident and exhibits confidence saturation (i.e. confidence scores taking on few unique values). We find that overconfidence partly stems from LLMs’ suggestibility when faced with unfamiliar topics, i.e., a model gives more credibility to a claim simply because it is in the context. 💡 Solution: Mitigate suggestibility with Distractor-Normalized Coherence (DINCO) by normalizing over related claims (validator coherence) and combining with generator confidence. 📈 Results: DINCO outperforms existing methods on open-source and closed-source models, applied to short-form (TriviaQA and SimpleQA) and long-form (FactScore) generation domains. 🧵👇


Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using queries 💬and rubrics📋that are mined from survey articles 📚! Website: cylumn.com/ResearchQA Paper: arxiv.org/abs/2509.00496 Dataset: huggingface.co/datasets/reall… Code: github.com/realliyifei/Re…

