Hangoo Kang

25 posts

Hangoo Kang

Hangoo Kang

@hangoo_kang

MSCS @StanfordAILab | AI research @ Scaling Intelligence Lab with @Azaliamirh Previously @siebelschool @uiuc_focal_lab

Katılım Ocak 2024
94 Takip Edilen103 Takipçiler
Sabitlenmiş Tweet
Hangoo Kang
Hangoo Kang@hangoo_kang·
Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇
Hangoo Kang tweet media
English
4
29
155
14K
Hangoo Kang retweetledi
Jehyeok Yeon @ ICML 2026 🇰🇷
AI R&D agents look great in demos. They write code, fix bugs, and propose research-shaped ideas. But what do researchers actually spend their time doing? Fighting dependency conflicts, noisy metrics, and configs that I’m pretty sure worked 20 minutes ago. Can agents do that? 🧵
GIF
English
11
12
106
42.5K
Hangoo Kang
Hangoo Kang@hangoo_kang·
@blc_16 This is exactly the problem we tackle in our paper TRACE! TRACE contrasts successful vs. failed trajectories to identify specific capabilities the agent lacks on the task and generates synthetic capability-targeted RL environments with denser signals!
Hangoo Kang@hangoo_kang

Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇

English
1
1
4
700
Ben Cohen
Ben Cohen@blc_16·
If you want to understand why RL struggles with long-horizon agent tasks, this is a good explanation. The core issue is that sparse rewards throw away most of the useful information in the trajectory. GEPA tries to learn from the trajectory itself, using reflection in text space, instead of only optimizing on the final reward. GEPA generates textual critiques of trajectories, proposes prompt edits, and then selects updates along a Pareto frontier between exploration and exploitation instead. Instead of collapsing everything into one reward number, it keeps more information about why a run failed and uses that to make legible changes. It’ll be interesting to see where this goes when people combine that kind of trajectory-level reflection with RL, using RL for optimization while preserving a much richer signal about why the agent succeeded or failed.
English
8
20
204
19.1K
Hangoo Kang
Hangoo Kang@hangoo_kang·
Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇
Hangoo Kang tweet media
English
4
29
155
14K
Hangoo Kang retweetledi
Azalia Mirhoseini
Azalia Mirhoseini@Azaliamirh·
Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05
Azalia Mirhoseini tweet media
English
34
114
987
115.8K
Hangoo Kang retweetledi
Tarun Suresh
Tarun Suresh@TarunSures41845·
Great work with @hangoo_kang , @JonSaadFalcon , and @Azaliamirh on a new system for environment-specific LLM agent self-improvement that trains the agent on the underlying capabilities it lacks 🚀
Hangoo Kang@hangoo_kang

Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇

English
1
2
15
2.8K
Hangoo Kang
Hangoo Kang@hangoo_kang·
We thank Debangshu Banerjee, Tanvir Bhathal, Alex Bloom, Andy Dimnaku, Simon Guo, Sid Jha, Hermann Kumbong, Jacky Kwok, Andrew Shi, and Shayan Talaei for their feedback. Thanks to @PrimeIntellect, @LambdaAPI, @GoogleResearch, and @IBMResearch for compute that made this possible! 😊
English
0
0
5
342
Hangoo Kang
Hangoo Kang@hangoo_kang·
And training on capability-targeted environments beats simply optimizing capability descriptions in the prompt. Learning the capability > describing the capability.
Hangoo Kang tweet media
English
1
0
4
348
Hangoo Kang retweetledi
Jacky Kwok
Jacky Kwok@jackyk02·
We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site
Jacky Kwok tweet media
English
10
57
443
54.4K
Hangoo Kang retweetledi
Jon Saad-Falcon
Jon Saad-Falcon@JonSaadFalcon·
Since the initial Intelligence-per-Watt release, we've extended the open-source profiling library to measure the intelligence efficiency of agentic workloads. Most recently, we wanted to calculate how many joules it takes to solve all the queries in TerminalBenchV2. This can help us better understand how much intelligence is delivered per joule and per watt on agentic workloads. Here’s what we found 🧵
Jon Saad-Falcon tweet media
English
5
22
91
7.7K