Grace Kim

21 posts

Grace Kim

@_grace_kim

First-year NLP PhD student @Penn prev undergrad @UTAustin intern @EPFL

Philadelphia, PA Katılım Kasım 2020

360 Takip Edilen166 Takipçiler

Grace Kim retweetledi

lynnette ng@quarbby·12 May

❤️New Preprint! Here within charts the directions of my next era of research: Multi-Agent Social Systems. Link: arxiv.org/pdf/2605.07069 Current agentic AI systems are designed for optimization. But what is also important is the agent-agent/ agent-human interactions, which collectively results in emergent population-level behavior. I argue that agentic AI systems should be designed with social theory as a structural prior. Social theory's core constructs like role differentiation and co-evolution specify agents collective behavior, perceptions and actions. Formally, I define a Multi-Agent Social System (MASS) as networked environments where heterogeneous agents exchange information and influence each other over time. An MASS has: (1) information exchange function, (2) influence dynamics function and (3) networked interaction structure. An MASS has four structural priors, each drawn directly from social theory's account of how humans interact. 1. Strategic heterogeneity - agents are different, and agents are different network positions influence the overall network differently 2. Network-Constrained Dependence - agents only observe other agents in their local network, yet their collective behavior changes the entire system 3. Co-evolution - agent behavior changes the network, network changes affect agent behavior 4. Distributional Instability - the distribution that one studies (i.e. beliefs, narratives), changes over time because of agent-agent/ agent-agent human interactions. We also demonstrate how these four structural priors play out in MoltBook, and provide a research agenda for modeling, evaluation and governance of MASS. Now, come join me in this new research agenda!!

English

6.8K

Grace Kim retweetledi

Hongli Zhan@HongliZhan·29 Nis

New paper! 🏁 My final one from my PhD at UT Austin. 🦜LLMs sound empathic, but they keep saying the same thing over and over. Not just the same words, the same discourse moves, turn after turn. We found that LLMs repeat the same discourse moves at nearly 2x the rate of human supporters across a multi-turn conversation, and existing metrics don’t catch this. So we built MINT 🌿 (Multi-turn Inter-tactic Novelty Training), the first RL framework to optimize discourse move diversity in multi-turn empathic dialogue. +25% empathy, −26% repetition. w/ @jessyjli @_desmond_ong et al. 📄 arxiv.org/abs/2604.11742

English

9.9K

Grace Kim retweetledi

Adam Stein@adamlsteinl·10 Nis

We found widespread cheating on popular agent benchmarks, affecting 28+ submissions across 9 benchmarks and thousands of agent runs. Surprisingly, the top 3 submissions on Terminal-Bench 2 are all cheating! Here's what we found 🧵

English

613

182.9K

Grace Kim retweetledi

Wenxuan Ding@Wenxuan_Ding_·20 Şub

Agents interact with environments to gather information. But exploration can be expensive. Tool use, retrieval, and user interaction carry latency or monetary cost. Calibrate-Then-Act allows LLM agents to balance exploration with cost: 📐 Estimate uncertainty about the environment 💭 Reason about cost-uncertainty tradeoffs ⚙️ Act accordingly

English

119

12.3K

Grace Kim retweetledi

Yao Tang@tyao923·17 Oca

𝗧𝗵𝗶𝗻𝗸 𝘄𝗶𝗱𝗲𝗿. 𝗧𝗵𝗶𝗻𝗸 𝘀𝗵𝗼𝗿𝘁𝗲𝗿. 🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: token-wise branch-and-merge reasoning for LLMs. 💸 Discrete CoT is costly. 🎛️ Existing continuous tokens often clash with 𝗼𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗥𝗟 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻. 🎥 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, a sampling-based continuous reasoning paradigm:

English

110

815

152.2K

Grace Kim retweetledi

Bowen Jiang (Lauren)@laurenbjiang·22 Ara

🧵(1/5) Personalization becomes one of the next huge waves in artificial super-intelligence 🌊🌊🌊 🚨 We release PersonaMem-v2, the best-quality dataset for LLM personalization, supporting your AI to better understand users and builds a memory that grows with each user over time. 🤗 Data: huggingface.co/datasets/bowen… 📖 Paper: arxiv.org/pdf/2512.06688

English

1.4K

Grace Kim retweetledi

Negar Foroutan@negarforoutan·15 Ara

1/ 🌍 How does mixing data from hundreds of languages affect LLM training? In our new paper "Revisiting Multilingual Data Mixtures in Language Model Pretraining" we revisit core assumptions about multilinguality using 1.1B-3B models trained on up to 400 languages. 🧵👇

English

106

11.3K

Grace Kim retweetledi

Jiayi (Raina) Xin@RainaXin·6 Ara

Sharing our poster for “Improved Therapeutic Antibody Reformatting through Multimodal Machine Learning” 🧬✨ Excited to present this work at @NeurIPSConf workshops this Sunday! (Poster below 👇)

English

629

Grace Kim retweetledi

Weiqiu You@WeiqiuYou·6 Ara

Presenting "Probabilistic Soundness Guarantees in LLM Reasoning Chains" poster today at these workshops at NeurIPS today (Sat Dec 6): 11:30–12:30 — SPIGM (Ballroom 20C) 1:15–2:10 — MLxOR (Ballroom 26AB) 4:15–5:25 — MATH-AI (Ballroom 6A) Come chat about reasoning and stability!

Weiqiu You@WeiqiuYou

I'll be presenting our work "Probabilistic Soundness Guarantees in LLM Reasoning Chains" at EMNLP 2025 Today (Nov 5) Hall C 14:30-16:00 802-Main Blog: debugml.github.io/ares Paper: arxiv.org/abs/2507.12948 Code: github.com/fallcat/ares

English

1.8K

Grace Kim retweetledi

Zayne Sprague ✈️ ICLR Rio@ZayneSprague·4 Ara

RL amplifies existing behaviors. Let’s prime models w/ good behaviors for better RL! Introducing SkillFactory: ✂️Rearrange model traces on a problem to demo verification + retry ⚙️SFT on those traces 🦾RL Result: Learn robust explicit verification + retry across domains 🧵

English

21.1K

Grace Kim retweetledi

Helen Jin 🌟@helenj1n·3 Ara

Excited to take a break from winter and be in sunny San Diego for @NeurIPSConf #NeurIPS2025 Dec 2-7! ☀️ Happy to chat anything related to AI for humanity, AI safety, interpretability!

English

527

Grace Kim retweetledi

Greg Durrett@gregd_nlp·3 Ara

I'm at NeurIPS until Friday! This morning, catch: @LiyanTang4 presenting ChartMuseum, testing if VLMs can do visual reasoning over charts @sebajoed presenting AstroVisBench, testing if coding LLMs can work with real astro data workflows & link in thread if you want to meet!

English

3.7K

Grace Kim retweetledi

Adam Stein@adamlsteinl·3 Ara

Excited to be at NeurIPS this week presenting my recent work with @NeelayV! Find us at 4:30pm at Exhibit Hall C,D,E poster #3717! Come by to see how LLMs struggle to use code for hard reasoning tasks, and how per-instance program synthesis (PIPS) fixes it.

Adam Stein@adamlsteinl

Announcing our NeurIPS paper: Once Upon an Input: Reasoning via Per-Instance Program Synthesis (PIPS) 📝: arxiv.org/abs/2510.22849 Why do LLMs (and LLM agents) still struggle on hard reasoning problems which should be solvable by writing and executing code? We find that the biggest problem with LLM generated “programs” for reasoning is that they don’t compute anything, they just hardcode the answer! PIPS fixes this by 1️⃣ abstracting the input into symbols, 2️⃣ generating code that maps symbols to the answer, and 3️⃣ refining the code with structural feedback. 🧵👇

English

987

Grace Kim retweetledi

Niloofar@niloofar_mire·2 Ara

Join us @WiMLworkshop round tables, lots of fun discussions on AI agents!

English

114

7.5K

Grace Kim retweetledi

Greg Durrett@gregd_nlp·2 Ara

📢 Postdoc position 📢 I’m recruiting a postdoc for my lab at NYU! Topics include LM reasoning, creativity, limitations of scaling, AI for science, & more! Apply by Feb 1. (Different from NYU Faculty Fellows, which are also great but less connected to my lab.) Link in 🧵

English

146

21.8K

Grace Kim retweetledi

Victor Wang@victorwang37·2 Eki

🚨 Announcing a new LLM calibration method, DINCO, which enforces confidence coherence (that probs must sum to 1) by having the LLM verbalize its confidence independently on self-generated distractors, and normalizing by the total confidence. Major gains on long + short-form QA!

Elias Stengel-Eskin@EliasEskin

🚨 Introducing DINCO, a zero-resource calibration method for verbalized LLM confidence. We normalize over self-generated distractors to enforce coherence ➡️ better-calibrated and less saturated (more usable) confidence! ⚠️ Problem: Standard verbalized confidence is overconfident and exhibits confidence saturation (i.e. confidence scores taking on few unique values). We find that overconfidence partly stems from LLMs’ suggestibility when faced with unfamiliar topics, i.e., a model gives more credibility to a claim simply because it is in the context. 💡 Solution: Mitigate suggestibility with Distractor-Normalized Coherence (DINCO) by normalizing over related claims (validator coherence) and combining with generator confidence. 📈 Results: DINCO outperforms existing methods on open-source and closed-source models, applied to short-form (TriviaQA and SimpleQA) and long-form (FactScore) generation domains. 🧵👇

English

3.7K

Grace Kim retweetledi

Liyan Tang@LiyanTang4·19 Eyl

Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track! Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!

Liyan Tang@LiyanTang4

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

English

3.7K

Grace Kim retweetledi

Marc Marone@ruyimarone·9 Eyl

3T tokens, ~1800 languages, 2 models - we’re releasing mmBERT, a modern multilingual encoder model!

English

400

31K

Grace Kim retweetledi

Allen Chang@AllenCChang·4 Eyl

What if survey-derived rubrics 📋 graded ChatGPT instead of vibes? We benchmark LLMs & deep research systems across 75 research fields 🩺🧬🦾⚗️🏛️🎭💹: Perplexity deep research wins > 82% of head-to-heads vs the next best! w/ @realliyifei, @cmalaviya11, and @yatskar

Li S. Yifei@realliyifei

How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using queries 💬and rubrics📋that are mined from survey articles 📚! Website: cylumn.com/ResearchQA Paper: arxiv.org/abs/2509.00496 Dataset: huggingface.co/datasets/reall… Code: github.com/realliyifei/Re…

English

2.1K

Grace Kim retweetledi

Li S. Yifei@realliyifei·4 Eyl

English

9.2K

Keşfet

@jessyjli @_desmond_ong @NeurIPSConf @LiyanTang4 @sebajoed @NeelayV @WiMLworkshop @realliyifei