Adam Łucek

257 posts

Adam Łucek banner
Adam Łucek

Adam Łucek

@AdamRLucek

i like making things | Applied AI @LangChain

NYC Katılım Nisan 2017
227 Takip Edilen556 Takipçiler
Adam Łucek
Adam Łucek@AdamRLucek·
@fantopy Lowkey fear it may literally be folks asking Claude to write copy
English
0
0
0
9
Fantopy
Fantopy@fantopy·
@AdamRLucek nnah you're spot on, half these ads read like they were written by someone who's never talked to a real person. just say what it does lol
English
1
0
1
4
Adam Łucek
Adam Łucek@AdamRLucek·
One personal gripe I have with current ai product advertising is that many displays/billboards seem strangely… verbose? Like lots of shoehorned text awkwardly worldbuilding niche scenarios to get their use case across Am I just not the target audience? does this not seem counterintuitive to traditional brand/product marketing?
Adam Łucek tweet media
English
1
0
2
77
Palash Shah
Palash Shah@palashshah·
a good job will make you feel like you know nothing internally, when you are actually considered an expert externally
English
6
0
24
786
Adel Bucetta
Adel Bucetta@adelbucetta·
@AdamRLucek the reason most people don't appreciate the complexity of our agents is that they assume the magic happens at training, not deployment
English
1
0
1
29
Adam Łucek
Adam Łucek@AdamRLucek·
@BraceSproul Actually? That’s the smoking gun. And genuinely a belt and suspenders insight
English
0
0
3
63
Brace
Brace@BraceSproul·
i see "genuinely" generated a lot by the newer models (2026->). i feel like i never used to see it, but now it comes up all the time... who sold the "genuinely" dataset lol
English
1
0
11
563
Viv
Viv@Vtrivedy10·
@AdamRLucek chadam cooked super hard on this one 🐐
English
1
0
3
748
Adam Łucek
Adam Łucek@AdamRLucek·
@Vtrivedy10 Reading the “LLMs are few shot learners” paper radicalized me
English
1
0
4
584
Viv
Viv@Vtrivedy10·
using a good Skill, a CLI, and seeing Codex’s in-context-learning ability is a magical experience point it to Harbor skills repo, Prime Intellect CLI, gave it an objective of what we wanted to RL and just watched it chug along figuring out the whole setup and debugging weird niche errors us humans get the fun part of interpreting results, thinking through what’s happening, and deciding what to do next agents training agents 🔥 humans guiding the process
English
10
13
128
7.7K
Abhilekh Meda
Abhilekh Meda@Abhilekh_Meda·
@AdamRLucek Would love to read a detailed blog if you ever write one about this
English
1
0
1
91
Adam Łucek
Adam Łucek@AdamRLucek·
Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇
Adam Łucek tweet media
English
12
22
154
41.7K
Pothu
Pothu@pothuLabs·
@AdamRLucek The "if you know what to do with it" is the whole catch. Most teams collect traces and never open them. The value isn't the data, it's the loop: failures become evals, evals catch the next regression. Skip the loop and it's just expensive logging.
English
1
0
1
177
Adam Łucek
Adam Łucek@AdamRLucek·
@novasarc01 Ah ok so this is testing whether the proposed eval actually catches the failure mode that it’s supposed to. Nice! Need to look deeper into user session coding traces- agreed with the earlier point that standard benchmarks rarely reflect actual user experience
English
1
0
2
77
λux
λux@novasarc01·
yeah exactly, “synthetic ground truth” is a pretty good way to describe part of it, but I’m using it more as a trajectory-level reference/counterfactual than as absolute ground truth. the idea is…from a failed trace, Trace2Eval generates an eval that encodes what should have been invariant in the trajectory. for example before the first implementation edit, the agent should have read the relevant test/source context; after the last edit, it should have run a meaningful verification; after a tool error, it should recover instead of continuing as if nothing happened. then i rerun the eval on two versions of the trajectory: the original trace and a symbolic counterfactual trace where I apply the minimal correction like inserting READ_TEST before EDIT. if the original fails and the counterfactual passes that tells me the eval is actually sensitive to the hypothesized causal decision point…not just randomly flagging the trace. so I don’t treat it as “this proves the task would have succeeded.” it’s more like causal support at the trace level…like did fixing the suspected local mistake remove the eval violation? over many traces that becomes useful for seeing which failure patterns are real/recurrent versus just noisy artifacts of agent behavior.
English
1
0
1
112
Adam Łucek
Adam Łucek@AdamRLucek·
@novasarc01 Interesting! So you also generate what could be called, for lack of better terms, a "synthetic ground truth" based on what should be expected/have happened/ideal state with the eval. What does rerunning the eval on that tell you? Or do you use as more a target/reference
English
1
0
1
122
λux
λux@novasarc01·
@AdamRLucek also outputs are of lower quality when the rate limit is just about to hit…i read somewhere this is termed as context anxiety…the model just tries to sum up everything…i guess better harness engineering would solve this.
English
1
0
1
128
Adam Łucek
Adam Łucek@AdamRLucek·
@LeoTava8 Felt on redesigning orchestration- it’s not always a prompt change but can be a limitation of the whole harness/environment itself
English
0
0
0
327
Leo Tavares
Leo Tavares@LeoTava8·
@AdamRLucek The underrated part: traces expose the gap between your intended control flow and what the agent actually does. That divergence is the most valuable signal for redesigning orchestration, not just debugging.
English
1
0
4
415
Adam Łucek
Adam Łucek@AdamRLucek·
@novasarc01 That’s sick! Any interesting evals/findings coming from Claude/codex traces? Have been ideating around what might actually be useful to know/considered a failure when it comes to analyzing personal coding agent usage
English
1
0
1
512
λux
λux@novasarc01·
this is super cool! currently i am working on a similar project (trace2eval)…it converts failed codex/claude agent traces into deterministic regression evals by normalizing raw logs into action timelines…detecting failure motifs like premature intervention/no verification/ignored tool errors and extracting compact causal slices…i’m bullish on trace → eval bcoz failed agent runs are too information-dense to leave as dead logs. i think turning them into evals makes the exact failure mechanism reusable.
λux tweet media
English
3
0
16
1.1K
Viv
Viv@Vtrivedy10·
Building agents (like software) is a deeply iterative process --> this is why we try to supply as much easy to use tooling and infra as possible so builders can start on their agent improvement journey asap storing and understanding traces is the fastest onramp to understand agent behavior and detect where your agents are messing up today. things like: - what are people asking my agents? - what actions are resulting in failed tool calls? - what kind of work is expensive to run today for our agents? can it be cheaper? once you mine trace data, you can "do something" to improve your agents - harness engineering - making evals - swapping models - post-training these are all levers we have to improve agent dev, but pretty much all of this is downstream of gathering and understanding traces at scale ship a v1 turn on tracing understand the traces and errors experiments loop
Adam Łucek@AdamRLucek

Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇

English
7
9
74
9.2K
Adam Łucek
Adam Łucek@AdamRLucek·
So… turns out SWE hasn't escaped testing with AI, rather it's more important than ever!
English
0
0
5
748
Adam Łucek
Adam Łucek@AdamRLucek·
Of course, combinations of these evals can cover a wide range of behaviors, scenarios, and edge cases. With both end-to-end and behavioral coverage, the eval suite can be used in some unique ways. The obvious one is traditional regression testing: making sure a change to your prompt or harness doesn't break existing behavior. But more interestingly, these evals can also serve as targets for optimization. I.e. a good suite can show strengths and weaknesses across model families, and pinpoint exactly where, say, a prompting change may let an open source model perform as well as a frontier closed model in a given scenario.
English
2
0
5
865