Adam Łucek

0

4

576

Viv@Vtrivedy10·2d

using a good Skill, a CLI, and seeing Codex’s in-context-learning ability is a magical experience point it to Harbor skills repo, Prime Intellect CLI, gave it an objective of what we wanted to RL and just watched it chug along figuring out the whole setup and debugging weird niche errors us humans get the fun part of interpreting results, thinking through what’s happening, and deciding what to do next agents training agents 🔥 humans guiding the process

English

10

13

126

7.6K

Adam Łucek@AdamRLucek·5d

@Abhilekh_Meda Maybe soon 🤔

English

1

58

Abhilekh Meda@Abhilekh_Meda·5d

@AdamRLucek Would love to read a detailed blog if you ever write one about this

English

0

1

88

Adam Łucek@AdamRLucek·5d

Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇

English

12

22

155

41.5K

Adam Łucek@AdamRLucek·5d

@pothuLabs A real missed opportunity for most!

English

0

74

Pothu@pothuLabs·5d

@AdamRLucek The "if you know what to do with it" is the whole catch. Most teams collect traces and never open them. The value isn't the data, it's the loop: failures become evals, evals catch the next regression. Skip the loop and it's just expensive logging.

English

0

1

171

Adam Łucek@AdamRLucek·5d

@novasarc01 Ah ok so this is testing whether the proposed eval actually catches the failure mode that it’s supposed to. Nice! Need to look deeper into user session coding traces- agreed with the earlier point that standard benchmarks rarely reflect actual user experience

English

0

2

77

λux@novasarc01·5d

yeah exactly, “synthetic ground truth” is a pretty good way to describe part of it, but I’m using it more as a trajectory-level reference/counterfactual than as absolute ground truth. the idea is…from a failed trace, Trace2Eval generates an eval that encodes what should have been invariant in the trajectory. for example before the first implementation edit, the agent should have read the relevant test/source context; after the last edit, it should have run a meaningful verification; after a tool error, it should recover instead of continuing as if nothing happened. then i rerun the eval on two versions of the trajectory: the original trace and a symbolic counterfactual trace where I apply the minimal correction like inserting READ_TEST before EDIT. if the original fails and the counterfactual passes that tells me the eval is actually sensitive to the hypothesized causal decision point…not just randomly flagging the trace. so I don’t treat it as “this proves the task would have succeeded.” it’s more like causal support at the trace level…like did fixing the suspected local mistake remove the eval violation? over many traces that becomes useful for seeing which failure patterns are real/recurrent versus just noisy artifacts of agent behavior.

English

0

1

110

Adam Łucek@AdamRLucek·5d

@novasarc01 Interesting! So you also generate what could be called, for lack of better terms, a "synthetic ground truth" based on what should be expected/have happened/ideal state with the eval. What does rerunning the eval on that tell you? Or do you use as more a target/reference

English

0

1

122

λux@novasarc01·5d

@AdamRLucek also outputs are of lower quality when the rate limit is just about to hit…i read somewhere this is termed as context anxiety…the model just tries to sum up everything…i guess better harness engineering would solve this.

English

0

1

127

Adam Łucek@AdamRLucek·5d

@MikroJaxi @LangChain Benchmark from a dataset of production mistakes 🤔🤔🤔

English

0

29

Kailash@MikroJaxi·5d

@LangChain @AdamRLucek The real dataset for agents isn't benchmarks It's production mistakes

English

0

1

46

LangChain@LangChain·5d

.@AdamRLucek on how we use traces to build evals for production agents.

Adam Łucek@AdamRLucek

Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇

English

6

5

19

6.2K

Adam Łucek@AdamRLucek·5d

@LeoTava8 Felt on redesigning orchestration- it’s not always a prompt change but can be a limitation of the whole harness/environment itself

English

325

Leo Tavares@LeoTava8·5d

@AdamRLucek The underrated part: traces expose the gap between your intended control flow and what the agent actually does. That divergence is the most valuable signal for redesigning orchestration, not just debugging.

English

0

4

406

Adam Łucek@AdamRLucek·5d

@novasarc01 That’s sick! Any interesting evals/findings coming from Claude/codex traces? Have been ideating around what might actually be useful to know/considered a failure when it comes to analyzing personal coding agent usage

English

0

1

505

λux@novasarc01·5d

this is super cool! currently i am working on a similar project (trace2eval)…it converts failed codex/claude agent traces into deterministic regression evals by normalizing raw logs into action timelines…detecting failure motifs like premature intervention/no verification/ignored tool errors and extracting compact causal slices…i’m bullish on trace → eval bcoz failed agent runs are too information-dense to leave as dead logs. i think turning them into evals makes the exact failure mechanism reusable.

English

0

16

1.1K

Adam Łucek@AdamRLucek·5d

@Vtrivedy10 Its literally that easy 📈📈📈

English

2

111

Viv@Vtrivedy10·5d

Building agents (like software) is a deeply iterative process --> this is why we try to supply as much easy to use tooling and infra as possible so builders can start on their agent improvement journey asap storing and understanding traces is the fastest onramp to understand agent behavior and detect where your agents are messing up today. things like: - what are people asking my agents? - what actions are resulting in failed tool calls? - what kind of work is expensive to run today for our agents? can it be cheaper? once you mine trace data, you can "do something" to improve your agents - harness engineering - making evals - swapping models - post-training these are all levers we have to improve agent dev, but pretty much all of this is downstream of gathering and understanding traces at scale ship a v1 turn on tracing understand the traces and errors experiments loop

Adam Łucek@AdamRLucek

Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇

English

7

9

74

9.1K

Adam Łucek@AdamRLucek·5d

So… turns out SWE hasn't escaped testing with AI, rather it's more important than ever!

English

5

718

Adam Łucek@AdamRLucek·5d

Of course, combinations of these evals can cover a wide range of behaviors, scenarios, and edge cases. With both end-to-end and behavioral coverage, the eval suite can be used in some unique ways. The obvious one is traditional regression testing: making sure a change to your prompt or harness doesn't break existing behavior. But more interestingly, these evals can also serve as targets for optimization. I.e. a good suite can show strengths and weaknesses across model families, and pinpoint exactly where, say, a prompting change may let an open source model perform as well as a frontier closed model in a given scenario.

English

2

0

5

819

Adam Łucek@AdamRLucek·23 May

@palashshah @BraceSproul The sixth human sense is detecting ai slop

English

5

106

Palash Shah@palashshah·23 May

to everyone that's using ai to generate their tweets. we know. i know you think you're being lowkey. i know you might be switching around some words, and removing em dashes. but it's so obvious. anyone that has ever used an llm will immediately be able to tell.

English

47

5

108

10.3K

Adam Łucek@AdamRLucek·22 May

@palashshah @Dhavalsingh7 Literally bar for bar

English

1

29

Palash Shah@palashshah·22 May

dude we're literally doing the same thing. @AdamRLucek is currently working on benchmarking one of our long running agents with deepseek flash as a subagent. and he saw basically the same thing; initially the costs weren't that different because deepseek was looping a ton. but after some iteration we've squeezed a lot of improvement out of it!

English

0

10

480

Dhaval singh@Dhavalsingh7·22 May

100% agree. I was tinkering with deep seek flash for our sub agents, ran our internal benchmark of 32 tasks and it was 4x more expensive and ds did like 3x more steps. I saw the results, assumed the model is shit. Went back and looked at the traces, turns out it was bad at tool calling when the tool as optional args and couldn't pass null for some argument in the tool and just went in loop. Fixed that(created a simpler tool for it for the same task) added some more guidance. Same benchmark, half the cost and steps. Its not easy, but once you get a hang of it, i think its worth it.

Palash Shah@palashshah

i feel like there's a general misunderstanding about open source models. most people use a frontier model, switch the api request to open source model, see poor performance, and then churn off. this will never work. you have to spend the time to handhold these models in the tasks you're trying to accomplish. basically every coding agent that you use is tuned to output prompts in the format that these frontier models except, and perform best on. if you invest the time to add custom prompting for these OSS models, you'll see the improvement performance, but it'll never work out of the box.

English

1

15

2.3K

Adam Łucek@AdamRLucek·22 May

@Vtrivedy10 @harborframework Literally the method for free on X

English

4

208

Viv@Vtrivedy10·22 May

On Evals - getting messages on “ok so how do I actually start learning this?” there is no better way than by just doing so you can copy this to Claude Code and get started today 1. Go look up the @harborframework and the Terminal Bench 2.0 dataset. Go look up the Harbor Skills GitHub repo for help. Pick 1 Task in the dataset and explain every single piece that’s in that task folder 2. Explain what my agent sees when it does the task, what it has to output, and how we know if it got the problem right? 3. Now let’s actually run a Task using the built in Claude Code integration, it’s just a flag 4. Once that’s done let’s read the ATIF file that was produced together and help me understand what just happened. Did we pass the task? If not can we dig into why it failed? Go check the verifier logic to see what went wrong. 5. Ok let’s try to improve our agent by adjusting the prompt. And let’s rerun on a few tasks? Is this helping? 6. Ok we’re doing evals! Using this same format, help me make my own. Let’s do this together … Spend a few days reading a bunch of traces, actually running evals, understanding traces, internalizing agent failure modes, and being super in the loop of what the agent sees and does Have fun! Evals are super important, they don’t have to be scary. DM if I can help or just tweet out what you’re doing, someone will help I promise, we’re all learning

English

23

29

334

21K

Adam Łucek@AdamRLucek·21 May

@bsampera97 Any tips?

English

0

62

bernat sampera@bsampera97·21 May

@AdamRLucek Classic context hierarchy conflict. When the orchestrator's instructions and the subagent's system prompt both claim authority, there's no scoping rule for which context wins. The fix is explicit precedence in the context stack.

English

0

1

100

Adam Łucek@AdamRLucek·21 May

Do agents listen to you… or themselves? While evaling subagent behavior in deep agent systems, we noticed an interesting quirk in our agents' alignment with hand-written system prompts vs. the instructions given by the orchestrator 1/4 🧵

English

5

8

38

19K

Adam Łucek@AdamRLucek·21 May

@DylSwanepoel @hwchase17 What’s old is new and what’s new is old

English

1

21

Dylan Swanepoel@DylSwanepoel·21 May

@AdamRLucek @hwchase17 Agent infrastructure is quietly recreating the entire enterprise security stack from scratch, except this time the employee is software.

English

0

1

54

Adam Łucek@AdamRLucek·21 May

This is real agent security alpha 🔐

Harrison Chase@hwchase17

x.com/i/article/2057…

English