Adam Łucek

250 posts

Adam Łucek banner
Adam Łucek

Adam Łucek

@AdamRLucek

i like making things | Applied AI @LangChain

NYC Se unió Nisan 2017
226 Siguiendo543 Seguidores
Adam Łucek
Adam Łucek@AdamRLucek·
@Vtrivedy10 Reading the “LLMs are few shot learners” paper radicalized me
English
1
0
4
576
Viv
Viv@Vtrivedy10·
using a good Skill, a CLI, and seeing Codex’s in-context-learning ability is a magical experience point it to Harbor skills repo, Prime Intellect CLI, gave it an objective of what we wanted to RL and just watched it chug along figuring out the whole setup and debugging weird niche errors us humans get the fun part of interpreting results, thinking through what’s happening, and deciding what to do next agents training agents 🔥 humans guiding the process
English
10
13
126
7.6K
Abhilekh Meda
Abhilekh Meda@Abhilekh_Meda·
@AdamRLucek Would love to read a detailed blog if you ever write one about this
English
1
0
1
88
Adam Łucek
Adam Łucek@AdamRLucek·
Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇
Adam Łucek tweet media
English
12
22
155
41.5K
Pothu
Pothu@pothuLabs·
@AdamRLucek The "if you know what to do with it" is the whole catch. Most teams collect traces and never open them. The value isn't the data, it's the loop: failures become evals, evals catch the next regression. Skip the loop and it's just expensive logging.
English
1
0
1
171
Adam Łucek
Adam Łucek@AdamRLucek·
@novasarc01 Ah ok so this is testing whether the proposed eval actually catches the failure mode that it’s supposed to. Nice! Need to look deeper into user session coding traces- agreed with the earlier point that standard benchmarks rarely reflect actual user experience
English
1
0
2
77
λux
λux@novasarc01·
yeah exactly, “synthetic ground truth” is a pretty good way to describe part of it, but I’m using it more as a trajectory-level reference/counterfactual than as absolute ground truth. the idea is…from a failed trace, Trace2Eval generates an eval that encodes what should have been invariant in the trajectory. for example before the first implementation edit, the agent should have read the relevant test/source context; after the last edit, it should have run a meaningful verification; after a tool error, it should recover instead of continuing as if nothing happened. then i rerun the eval on two versions of the trajectory: the original trace and a symbolic counterfactual trace where I apply the minimal correction like inserting READ_TEST before EDIT. if the original fails and the counterfactual passes that tells me the eval is actually sensitive to the hypothesized causal decision point…not just randomly flagging the trace. so I don’t treat it as “this proves the task would have succeeded.” it’s more like causal support at the trace level…like did fixing the suspected local mistake remove the eval violation? over many traces that becomes useful for seeing which failure patterns are real/recurrent versus just noisy artifacts of agent behavior.
English
1
0
1
110
Adam Łucek
Adam Łucek@AdamRLucek·
@novasarc01 Interesting! So you also generate what could be called, for lack of better terms, a "synthetic ground truth" based on what should be expected/have happened/ideal state with the eval. What does rerunning the eval on that tell you? Or do you use as more a target/reference
English
1
0
1
122
λux
λux@novasarc01·
@AdamRLucek also outputs are of lower quality when the rate limit is just about to hit…i read somewhere this is termed as context anxiety…the model just tries to sum up everything…i guess better harness engineering would solve this.
English
1
0
1
127
Adam Łucek
Adam Łucek@AdamRLucek·
@LeoTava8 Felt on redesigning orchestration- it’s not always a prompt change but can be a limitation of the whole harness/environment itself
English
0
0
0
325
Leo Tavares
Leo Tavares@LeoTava8·
@AdamRLucek The underrated part: traces expose the gap between your intended control flow and what the agent actually does. That divergence is the most valuable signal for redesigning orchestration, not just debugging.
English
1
0
4
406
Adam Łucek
Adam Łucek@AdamRLucek·
@novasarc01 That’s sick! Any interesting evals/findings coming from Claude/codex traces? Have been ideating around what might actually be useful to know/considered a failure when it comes to analyzing personal coding agent usage
English
1
0
1
505
λux
λux@novasarc01·
this is super cool! currently i am working on a similar project (trace2eval)…it converts failed codex/claude agent traces into deterministic regression evals by normalizing raw logs into action timelines…detecting failure motifs like premature intervention/no verification/ignored tool errors and extracting compact causal slices…i’m bullish on trace → eval bcoz failed agent runs are too information-dense to leave as dead logs. i think turning them into evals makes the exact failure mechanism reusable.
λux tweet media
English
3
0
16
1.1K
Viv
Viv@Vtrivedy10·
Building agents (like software) is a deeply iterative process --> this is why we try to supply as much easy to use tooling and infra as possible so builders can start on their agent improvement journey asap storing and understanding traces is the fastest onramp to understand agent behavior and detect where your agents are messing up today. things like: - what are people asking my agents? - what actions are resulting in failed tool calls? - what kind of work is expensive to run today for our agents? can it be cheaper? once you mine trace data, you can "do something" to improve your agents - harness engineering - making evals - swapping models - post-training these are all levers we have to improve agent dev, but pretty much all of this is downstream of gathering and understanding traces at scale ship a v1 turn on tracing understand the traces and errors experiments loop
Adam Łucek@AdamRLucek

Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇

English
7
9
74
9.1K
Adam Łucek
Adam Łucek@AdamRLucek·
So… turns out SWE hasn't escaped testing with AI, rather it's more important than ever!
English
0
0
5
718
Adam Łucek
Adam Łucek@AdamRLucek·
Of course, combinations of these evals can cover a wide range of behaviors, scenarios, and edge cases. With both end-to-end and behavioral coverage, the eval suite can be used in some unique ways. The obvious one is traditional regression testing: making sure a change to your prompt or harness doesn't break existing behavior. But more interestingly, these evals can also serve as targets for optimization. I.e. a good suite can show strengths and weaknesses across model families, and pinpoint exactly where, say, a prompting change may let an open source model perform as well as a frontier closed model in a given scenario.
English
2
0
5
819
Palash Shah
Palash Shah@palashshah·
to everyone that's using ai to generate their tweets. we know. i know you think you're being lowkey. i know you might be switching around some words, and removing em dashes. but it's so obvious. anyone that has ever used an llm will immediately be able to tell.
English
47
5
108
10.3K
Palash Shah
Palash Shah@palashshah·
dude we're literally doing the same thing. @AdamRLucek is currently working on benchmarking one of our long running agents with deepseek flash as a subagent. and he saw basically the same thing; initially the costs weren't that different because deepseek was looping a ton. but after some iteration we've squeezed a lot of improvement out of it!
English
3
0
10
480
Dhaval singh
Dhaval singh@Dhavalsingh7·
100% agree. I was tinkering with deep seek flash for our sub agents, ran our internal benchmark of 32 tasks and it was 4x more expensive and ds did like 3x more steps. I saw the results, assumed the model is shit. Went back and looked at the traces, turns out it was bad at tool calling when the tool as optional args and couldn't pass null for some argument in the tool and just went in loop. Fixed that(created a simpler tool for it for the same task) added some more guidance. Same benchmark, half the cost and steps. Its not easy, but once you get a hang of it, i think its worth it.
Palash Shah@palashshah

i feel like there's a general misunderstanding about open source models. most people use a frontier model, switch the api request to open source model, see poor performance, and then churn off. this will never work. you have to spend the time to handhold these models in the tasks you're trying to accomplish. basically every coding agent that you use is tuned to output prompts in the format that these frontier models except, and perform best on. if you invest the time to add custom prompting for these OSS models, you'll see the improvement performance, but it'll never work out of the box.

English
3
1
15
2.3K
Viv
Viv@Vtrivedy10·
On Evals - getting messages on “ok so how do I actually start learning this?” there is no better way than by just doing so you can copy this to Claude Code and get started today 1. Go look up the @harborframework and the Terminal Bench 2.0 dataset. Go look up the Harbor Skills GitHub repo for help. Pick 1 Task in the dataset and explain every single piece that’s in that task folder 2. Explain what my agent sees when it does the task, what it has to output, and how we know if it got the problem right? 3. Now let’s actually run a Task using the built in Claude Code integration, it’s just a flag 4. Once that’s done let’s read the ATIF file that was produced together and help me understand what just happened. Did we pass the task? If not can we dig into why it failed? Go check the verifier logic to see what went wrong. 5. Ok let’s try to improve our agent by adjusting the prompt. And let’s rerun on a few tasks? Is this helping? 6. Ok we’re doing evals! Using this same format, help me make my own. Let’s do this together … Spend a few days reading a bunch of traces, actually running evals, understanding traces, internalizing agent failure modes, and being super in the loop of what the agent sees and does Have fun! Evals are super important, they don’t have to be scary. DM if I can help or just tweet out what you’re doing, someone will help I promise, we’re all learning
English
23
29
334
21K
bernat sampera
bernat sampera@bsampera97·
@AdamRLucek Classic context hierarchy conflict. When the orchestrator's instructions and the subagent's system prompt both claim authority, there's no scoping rule for which context wins. The fix is explicit precedence in the context stack.
English
1
0
1
100
Adam Łucek
Adam Łucek@AdamRLucek·
Do agents listen to you… or themselves? While evaling subagent behavior in deep agent systems, we noticed an interesting quirk in our agents' alignment with hand-written system prompts vs. the instructions given by the orchestrator 1/4 🧵
English
5
8
38
19K
Dylan Swanepoel
Dylan Swanepoel@DylSwanepoel·
@AdamRLucek @hwchase17 Agent infrastructure is quietly recreating the entire enterprise security stack from scratch, except this time the employee is software.
English
1
0
1
54