Hangoo Kang (@hangoo_kang) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Hangoo Kang@hangoo_kang·13 Nis

Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇

English

4

29

155

14K

Hangoo Kang retweetledi

Jehyeok Yeon @ ICML 2026 🇰🇷@jehyeoky248·5d

AI R&D agents look great in demos. They write code, fix bugs, and propose research-shaped ideas. But what do researchers actually spend their time doing? Fighting dependency conflicts, noisy metrics, and configs that I’m pretty sure worked 20 minutes ago. Can agents do that? 🧵

GIF

English

11

12

106

42.5K

Hangoo Kang@hangoo_kang·11 May

@blc_16 This is exactly the problem we tackle in our paper TRACE! TRACE contrasts successful vs. failed trajectories to identify specific capabilities the agent lacks on the task and generates synthetic capability-targeted RL environments with denser signals!

Hangoo Kang@hangoo_kang

Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇

English

1

4

700

Ben Cohen@blc_16·11 May

If you want to understand why RL struggles with long-horizon agent tasks, this is a good explanation. The core issue is that sparse rewards throw away most of the useful information in the trajectory. GEPA tries to learn from the trajectory itself, using reflection in text space, instead of only optimizing on the final reward. GEPA generates textual critiques of trajectories, proposes prompt edits, and then selects updates along a Pareto frontier between exploration and exploitation instead. Instead of collapsing everything into one reward number, it keeps more information about why a run failed and uses that to make legible changes. It’ll be interesting to see where this goes when people combine that kind of trajectory-level reflection with RL, using RL for optimization while preserving a much richer signal about why the agent succeeded or failed.

English

8

20

204

19.1K

Hangoo Kang retweetledi

ikka@Shahules786·22 Nis

(5/n) Employing techniques like TRACE and RIFT will help us scale up our tool-use environments like Tau2-Infinity to be more robust, larger synthetic post-training datasets. arxiv.org/pdf/2604.05336 arxiv.org/pdf/2604.01375 TRACE authors: @hangoo_kang @TarunSures41845 @JonSaadFalcon @Azaliamirh RIFT authors: @qi_zhengyang @pham_derek @amanda_dsouza @ArminPCM @paroma_varma et al

English

0

2

8

457

Hangoo Kang retweetledi

Jon Saad-Falcon@JonSaadFalcon·20 Nis

Say hi to @OpenJarvisAI 👋 If you have issues, want to make a PR, or simply chat, just @OpenJarvisAI in a tweet! This account is itself an OpenJarvis instance: running 24/7 on an NVIDIA DGX Spark, triaging issues + PRs on the repo and serving as a personal assistant for the lab! For personal AI on personal devices, checkout: github.com/open-jarvis/Op… x.com/JonSaadFalcon/…

English

1

8

25

2K

Hangoo Kang@hangoo_kang·14 Nis

@kexun_zhang @TarunSures41845 @JonSaadFalcon @Azaliamirh We're actively adding more benchmarks!

English

0

1

125

Kexun Zhang@kexun_zhang·13 Nis

@hangoo_kang @TarunSures41845 @JonSaadFalcon @Azaliamirh why not terminal bench / swe bench etc

English

1

0

226

Hangoo Kang@hangoo_kang·13 Nis

Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇

English

4

29

155

14K

Hangoo Kang@hangoo_kang·14 Nis

@Shahules786 @TarunSures41845 @JonSaadFalcon @Azaliamirh Sure, can you email me and cc the other authors?

English

0

2

91

ikka@Shahules786·14 Nis

@hangoo_kang @TarunSures41845 @JonSaadFalcon @Azaliamirh would love to chat, DM ?

English

1

0

89

Hangoo Kang@hangoo_kang·14 Nis

@Shahules786 @TarunSures41845 @JonSaadFalcon @Azaliamirh Just fixed it, will be updated soon. Thanks for pointing it out!

English

1

0

1

66

ikka@Shahules786·14 Nis

@hangoo_kang @TarunSures41845 @JonSaadFalcon @Azaliamirh Also the paper link here #scaling-behavior" target="_blank" rel="nofollow noopener">scalingintelligence.stanford.edu/blogs/trace/#s… seems incorrect

English

1

0

154

Hangoo Kang retweetledi

Azalia Mirhoseini@Azaliamirh·14 Nis

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05

English

34

114

987

115.8K

Hangoo Kang retweetledi

Tarun Suresh@TarunSures41845·13 Nis

Great work with @hangoo_kang , @JonSaadFalcon , and @Azaliamirh on a new system for environment-specific LLM agent self-improvement that trains the agent on the underlying capabilities it lacks 🚀

Hangoo Kang@hangoo_kang

Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇

English

1

2

15

2.8K

Hangoo Kang@hangoo_kang·13 Nis

We thank Debangshu Banerjee, Tanvir Bhathal, Alex Bloom, Andy Dimnaku, Simon Guo, Sid Jha, Hermann Kumbong, Jacky Kwok, Andrew Shi, and Shayan Talaei for their feedback. Thanks to @PrimeIntellect, @LambdaAPI, @GoogleResearch, and @IBMResearch for compute that made this possible! 😊

English

0

5

342

Hangoo Kang@hangoo_kang·13 Nis

And training on capability-targeted environments beats simply optimizing capability descriptions in the prompt. Learning the capability > describing the capability.

English

1

0

4

348

Hangoo Kang retweetledi

Jacky Kwok@jackyk02·10 Nis

We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site

English

10

57

443

54.4K

Hangoo Kang retweetledi

Jon Saad-Falcon@JonSaadFalcon·17 Mar

Since the initial Intelligence-per-Watt release, we've extended the open-source profiling library to measure the intelligence efficiency of agentic workloads. Most recently, we wanted to calculate how many joules it takes to solve all the queries in TerminalBenchV2. This can help us better understand how much intelligence is delivered per joule and per watt on agentic workloads. Here’s what we found 🧵

English

5

22

91

7.7K

Hangoo Kang

Keşfet