Grace Kim

21 posts

Grace Kim banner
Grace Kim

Grace Kim

@_grace_kim

First-year NLP PhD student @Penn prev undergrad @UTAustin intern @EPFL

Philadelphia, PA Katılım Kasım 2020
360 Takip Edilen166 Takipçiler
Grace Kim retweetledi
lynnette ng
lynnette ng@quarbby·
❤️New Preprint! Here within charts the directions of my next era of research: Multi-Agent Social Systems. Link: arxiv.org/pdf/2605.07069 Current agentic AI systems are designed for optimization. But what is also important is the agent-agent/ agent-human interactions, which collectively results in emergent population-level behavior. I argue that agentic AI systems should be designed with social theory as a structural prior. Social theory's core constructs like role differentiation and co-evolution specify agents collective behavior, perceptions and actions. Formally, I define a Multi-Agent Social System (MASS) as networked environments where heterogeneous agents exchange information and influence each other over time. An MASS has: (1) information exchange function, (2) influence dynamics function and (3) networked interaction structure. An MASS has four structural priors, each drawn directly from social theory's account of how humans interact. 1. Strategic heterogeneity - agents are different, and agents are different network positions influence the overall network differently 2. Network-Constrained Dependence - agents only observe other agents in their local network, yet their collective behavior changes the entire system 3. Co-evolution - agent behavior changes the network, network changes affect agent behavior 4. Distributional Instability - the distribution that one studies (i.e. beliefs, narratives), changes over time because of agent-agent/ agent-agent human interactions. We also demonstrate how these four structural priors play out in MoltBook, and provide a research agenda for modeling, evaluation and governance of MASS. Now, come join me in this new research agenda!!
lynnette ng tweet medialynnette ng tweet medialynnette ng tweet media
English
2
20
81
6.8K
Grace Kim retweetledi
Hongli Zhan
Hongli Zhan@HongliZhan·
New paper! 🏁 My final one from my PhD at UT Austin. 🦜LLMs sound empathic, but they keep saying the same thing over and over. Not just the same words, the same discourse moves, turn after turn. We found that LLMs repeat the same discourse moves at nearly 2x the rate of human supporters across a multi-turn conversation, and existing metrics don’t catch this. So we built MINT 🌿 (Multi-turn Inter-tactic Novelty Training), the first RL framework to optimize discourse move diversity in multi-turn empathic dialogue. +25% empathy, −26% repetition. w/ @jessyjli @_desmond_ong et al. 📄 arxiv.org/abs/2604.11742
Hongli Zhan tweet media
English
1
12
61
9.9K
Grace Kim retweetledi
Adam Stein
Adam Stein@adamlsteinl·
We found widespread cheating on popular agent benchmarks, affecting 28+ submissions across 9 benchmarks and thousands of agent runs. Surprisingly, the top 3 submissions on Terminal-Bench 2 are all cheating! Here's what we found 🧵
Adam Stein tweet media
English
35
97
613
182.9K
Grace Kim retweetledi
Wenxuan Ding
Wenxuan Ding@Wenxuan_Ding_·
Agents interact with environments to gather information. But exploration can be expensive. Tool use, retrieval, and user interaction carry latency or monetary cost. Calibrate-Then-Act allows LLM agents to balance exploration with cost: 📐 Estimate uncertainty about the environment 💭 Reason about cost-uncertainty tradeoffs ⚙️ Act accordingly
Wenxuan Ding tweet media
English
7
32
119
12.3K
Grace Kim retweetledi
Yao Tang
Yao Tang@tyao923·
𝗧𝗵𝗶𝗻𝗸 𝘄𝗶𝗱𝗲𝗿. 𝗧𝗵𝗶𝗻𝗸 𝘀𝗵𝗼𝗿𝘁𝗲𝗿. 🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: token-wise branch-and-merge reasoning for LLMs. 💸 Discrete CoT is costly. 🎛️ Existing continuous tokens often clash with 𝗼𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗥𝗟 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻. 🎥 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, a sampling-based continuous reasoning paradigm:
English
25
110
815
152.2K
Grace Kim retweetledi
Bowen Jiang (Lauren)
Bowen Jiang (Lauren)@laurenbjiang·
🧵(1/5) Personalization becomes one of the next huge waves in artificial super-intelligence 🌊🌊🌊 🚨 We release PersonaMem-v2, the best-quality dataset for LLM personalization, supporting your AI to better understand users and builds a memory that grows with each user over time. 🤗 Data: huggingface.co/datasets/bowen… 📖 Paper: arxiv.org/pdf/2512.06688
Bowen Jiang (Lauren) tweet media
English
1
7
13
1.4K
Grace Kim retweetledi
Negar Foroutan
Negar Foroutan@negarforoutan·
1/ 🌍 How does mixing data from hundreds of languages affect LLM training? In our new paper "Revisiting Multilingual Data Mixtures in Language Model Pretraining" we revisit core assumptions about multilinguality using 1.1B-3B models trained on up to 400 languages. 🧵👇
English
2
29
106
11.3K
Grace Kim retweetledi
Jiayi (Raina) Xin
Jiayi (Raina) Xin@RainaXin·
Sharing our poster for “Improved Therapeutic Antibody Reformatting through Multimodal Machine Learning” 🧬✨ Excited to present this work at @NeurIPSConf workshops this Sunday! (Poster below 👇)
Jiayi (Raina) Xin tweet media
English
1
1
8
629
Grace Kim retweetledi
Weiqiu You
Weiqiu You@WeiqiuYou·
Presenting "Probabilistic Soundness Guarantees in LLM Reasoning Chains" poster today at these workshops at NeurIPS today (Sat Dec 6): 11:30–12:30 — SPIGM (Ballroom 20C) 1:15–2:10 — MLxOR (Ballroom 26AB) 4:15–5:25 — MATH-AI (Ballroom 6A) Come chat about reasoning and stability!
Weiqiu You@WeiqiuYou

I'll be presenting our work "Probabilistic Soundness Guarantees in LLM Reasoning Chains" at EMNLP 2025 Today (Nov 5) Hall C 14:30-16:00 802-Main Blog: debugml.github.io/ares Paper: arxiv.org/abs/2507.12948 Code: github.com/fallcat/ares

English
1
2
7
1.8K
Grace Kim retweetledi
Zayne Sprague ✈️ ICLR Rio
Zayne Sprague ✈️ ICLR Rio@ZayneSprague·
RL amplifies existing behaviors. Let’s prime models w/ good behaviors for better RL! Introducing SkillFactory: ✂️Rearrange model traces on a problem to demo verification + retry ⚙️SFT on those traces 🦾RL Result: Learn robust explicit verification + retry across domains 🧵
Zayne Sprague ✈️ ICLR Rio tweet mediaZayne Sprague ✈️ ICLR Rio tweet media
English
2
26
69
21.1K
Grace Kim retweetledi
Helen Jin 🌟
Helen Jin 🌟@helenj1n·
Excited to take a break from winter and be in sunny San Diego for @NeurIPSConf #NeurIPS2025 Dec 2-7! ☀️ Happy to chat anything related to AI for humanity, AI safety, interpretability!
English
0
1
5
527
Grace Kim retweetledi
Greg Durrett
Greg Durrett@gregd_nlp·
I'm at NeurIPS until Friday! This morning, catch: @LiyanTang4 presenting ChartMuseum, testing if VLMs can do visual reasoning over charts @sebajoed presenting AstroVisBench, testing if coding LLMs can work with real astro data workflows & link in thread if you want to meet!
Greg Durrett tweet mediaGreg Durrett tweet media
English
4
12
60
3.7K
Grace Kim retweetledi
Adam Stein
Adam Stein@adamlsteinl·
Excited to be at NeurIPS this week presenting my recent work with @NeelayV! Find us at 4:30pm at Exhibit Hall C,D,E poster #3717! Come by to see how LLMs struggle to use code for hard reasoning tasks, and how per-instance program synthesis (PIPS) fixes it.
Adam Stein@adamlsteinl

Announcing our NeurIPS paper: Once Upon an Input: Reasoning via Per-Instance Program Synthesis (PIPS) 📝: arxiv.org/abs/2510.22849 Why do LLMs (and LLM agents) still struggle on hard reasoning problems which should be solvable by writing and executing code? We find that the biggest problem with LLM generated “programs” for reasoning is that they don’t compute anything, they just hardcode the answer! PIPS fixes this by 1️⃣ abstracting the input into symbols, 2️⃣ generating code that maps symbols to the answer, and 3️⃣ refining the code with structural feedback. 🧵👇

English
0
3
5
987
Grace Kim retweetledi
Niloofar
Niloofar@niloofar_mire·
Join us @WiMLworkshop round tables, lots of fun discussions on AI agents!
Niloofar tweet mediaNiloofar tweet media
English
2
6
114
7.5K
Grace Kim retweetledi
Greg Durrett
Greg Durrett@gregd_nlp·
📢 Postdoc position 📢 I’m recruiting a postdoc for my lab at NYU! Topics include LM reasoning, creativity, limitations of scaling, AI for science, & more! Apply by Feb 1. (Different from NYU Faculty Fellows, which are also great but less connected to my lab.) Link in 🧵
Greg Durrett tweet media
English
4
58
146
21.8K
Grace Kim retweetledi
Grace Kim retweetledi
Liyan Tang
Liyan Tang@LiyanTang4·
Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track! Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!
Liyan Tang tweet media
Liyan Tang@LiyanTang4

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

English
1
22
37
3.7K
Grace Kim retweetledi
Marc Marone
Marc Marone@ruyimarone·
3T tokens, ~1800 languages, 2 models - we’re releasing mmBERT, a modern multilingual encoder model!
Marc Marone tweet media
English
11
67
400
31K
Grace Kim retweetledi
Grace Kim retweetledi
Li S. Yifei
Li S. Yifei@realliyifei·
How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using queries 💬and rubrics📋that are mined from survey articles 📚! Website: cylumn.com/ResearchQA Paper: arxiv.org/abs/2509.00496 Dataset: huggingface.co/datasets/reall… Code: github.com/realliyifei/Re…
Li S. Yifei tweet media
English
1
24
61
9.2K