Adam Łucek (@AdamRLucek) - Twitter Profili | Zamantika Mersobahis Locabet

Adam Łucek@AdamRLucek·5m

@fantopy Lowkey fear it may literally be folks asking Claude to write copy

English

0

9

Fantopy@fantopy·18m

@AdamRLucek nnah you're spot on, half these ads read like they were written by someone who's never talked to a real person. just say what it does lol

English

1

0

1

4

Adam Łucek@AdamRLucek·7h

One personal gripe I have with current ai product advertising is that many displays/billboards seem strangely… verbose? Like lots of shoehorned text awkwardly worldbuilding niche scenarios to get their use case across Am I just not the target audience? does this not seem counterintuitive to traditional brand/product marketing?

English

1

0

2

77

Adam Łucek@AdamRLucek·6h

@palashshah Dawg if you know nothing then what the hell do I know

English

1

0

154

Palash Shah@palashshah·6h

a good job will make you feel like you know nothing internally, when you are actually considered an expert externally

English

6

0

24

786

Adam Łucek@AdamRLucek·18h

@adelbucetta Magic everywhere lowkey

English

0

1

10

Adel Bucetta@adelbucetta·20h

@AdamRLucek the reason most people don't appreciate the complexity of our agents is that they assume the magic happens at training, not deployment

English

1

0

1

29

Adam Łucek@AdamRLucek·1d

One of the most technically impressive agents I’ve had the honor of working on 🚒

LangChain@LangChain

Stop manually triaging agent failures. Let LangSmith Engine fix it.

English

2

6

32

8.4K

Adam Łucek@AdamRLucek·20h

@BraceSproul Actually? That’s the smoking gun. And genuinely a belt and suspenders insight

English

0

3

63

Brace@BraceSproul·21h

i see "genuinely" generated a lot by the newer models (2026->). i feel like i never used to see it, but now it comes up all the time... who sold the "genuinely" dataset lol

English

1

0

11

563

Adam Łucek@AdamRLucek·1d

@Vtrivedy10

QME

1

0

2

110

Viv@Vtrivedy10·1d

@AdamRLucek chadam cooked super hard on this one 🐐

English

1

0

3

748

Adam Łucek@AdamRLucek·3d

@Vtrivedy10 Reading the “LLMs are few shot learners” paper radicalized me

English

1

0

4

584

Viv@Vtrivedy10·3d

using a good Skill, a CLI, and seeing Codex’s in-context-learning ability is a magical experience point it to Harbor skills repo, Prime Intellect CLI, gave it an objective of what we wanted to RL and just watched it chug along figuring out the whole setup and debugging weird niche errors us humans get the fun part of interpreting results, thinking through what’s happening, and deciding what to do next agents training agents 🔥 humans guiding the process

English

10

13

128

7.7K

Adam Łucek@AdamRLucek·6d

@Abhilekh_Meda Maybe soon 🤔

English

0

1

60

Abhilekh Meda@Abhilekh_Meda·6d

@AdamRLucek Would love to read a detailed blog if you ever write one about this

English

1

0

1

91

Adam Łucek@AdamRLucek·27 May

Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇

English

12

22

154

41.7K

Adam Łucek@AdamRLucek·6d

@pothuLabs A real missed opportunity for most!

English

3

0

79

Pothu@pothuLabs·6d

@AdamRLucek The "if you know what to do with it" is the whole catch. Most teams collect traces and never open them. The value isn't the data, it's the loop: failures become evals, evals catch the next regression. Skip the loop and it's just expensive logging.

English

1

0

1

177

Adam Łucek@AdamRLucek·6d

@novasarc01 Ah ok so this is testing whether the proposed eval actually catches the failure mode that it’s supposed to. Nice! Need to look deeper into user session coding traces- agreed with the earlier point that standard benchmarks rarely reflect actual user experience

English

1

0

2

77

λux@novasarc01·6d

yeah exactly, “synthetic ground truth” is a pretty good way to describe part of it, but I’m using it more as a trajectory-level reference/counterfactual than as absolute ground truth. the idea is…from a failed trace, Trace2Eval generates an eval that encodes what should have been invariant in the trajectory. for example before the first implementation edit, the agent should have read the relevant test/source context; after the last edit, it should have run a meaningful verification; after a tool error, it should recover instead of continuing as if nothing happened. then i rerun the eval on two versions of the trajectory: the original trace and a symbolic counterfactual trace where I apply the minimal correction like inserting READ_TEST before EDIT. if the original fails and the counterfactual passes that tells me the eval is actually sensitive to the hypothesized causal decision point…not just randomly flagging the trace. so I don’t treat it as “this proves the task would have succeeded.” it’s more like causal support at the trace level…like did fixing the suspected local mistake remove the eval violation? over many traces that becomes useful for seeing which failure patterns are real/recurrent versus just noisy artifacts of agent behavior.

English

1

0

1

112

Adam Łucek@AdamRLucek·6d

@novasarc01 Interesting! So you also generate what could be called, for lack of better terms, a "synthetic ground truth" based on what should be expected/have happened/ideal state with the eval. What does rerunning the eval on that tell you? Or do you use as more a target/reference

English

1

0

1

122

λux@novasarc01·6d

@AdamRLucek also outputs are of lower quality when the rate limit is just about to hit…i read somewhere this is termed as context anxiety…the model just tries to sum up everything…i guess better harness engineering would solve this.

English

1

0

1

128

Adam Łucek@AdamRLucek·6d

@MikroJaxi @LangChain Benchmark from a dataset of production mistakes 🤔🤔🤔

English

1

0

29

Kailash@MikroJaxi·6d

@LangChain @AdamRLucek The real dataset for agents isn't benchmarks It's production mistakes

English

1

0

1

46

LangChain@LangChain·27 May

.@AdamRLucek on how we use traces to build evals for production agents.

Adam Łucek@AdamRLucek

Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇

English

6

5

19

6.3K

Adam Łucek@AdamRLucek·6d

@LeoTava8 Felt on redesigning orchestration- it’s not always a prompt change but can be a limitation of the whole harness/environment itself

English

0

327

Leo Tavares@LeoTava8·6d

@AdamRLucek The underrated part: traces expose the gap between your intended control flow and what the agent actually does. That divergence is the most valuable signal for redesigning orchestration, not just debugging.

English

1

0

4

415

Adam Łucek@AdamRLucek·6d

@novasarc01 That’s sick! Any interesting evals/findings coming from Claude/codex traces? Have been ideating around what might actually be useful to know/considered a failure when it comes to analyzing personal coding agent usage

English

1

0

1

512

λux@novasarc01·6d

this is super cool! currently i am working on a similar project (trace2eval)…it converts failed codex/claude agent traces into deterministic regression evals by normalizing raw logs into action timelines…detecting failure motifs like premature intervention/no verification/ignored tool errors and extracting compact causal slices…i’m bullish on trace → eval bcoz failed agent runs are too information-dense to leave as dead logs. i think turning them into evals makes the exact failure mechanism reusable.

English

3

0

16

1.1K

Adam Łucek@AdamRLucek·27 May

@Vtrivedy10 Its literally that easy 📈📈📈

English

0

2

114

Viv@Vtrivedy10·27 May

Building agents (like software) is a deeply iterative process --> this is why we try to supply as much easy to use tooling and infra as possible so builders can start on their agent improvement journey asap storing and understanding traces is the fastest onramp to understand agent behavior and detect where your agents are messing up today. things like: - what are people asking my agents? - what actions are resulting in failed tool calls? - what kind of work is expensive to run today for our agents? can it be cheaper? once you mine trace data, you can "do something" to improve your agents - harness engineering - making evals - swapping models - post-training these are all levers we have to improve agent dev, but pretty much all of this is downstream of gathering and understanding traces at scale ship a v1 turn on tracing understand the traces and errors experiments loop

Adam Łucek@AdamRLucek

Trace data is literally worth its weight in gold these days, if you know what to do with it! As has been established, creating effective agents requires shipping early, observing behavior, and iterating quickly. At the core of this are your agent traces capturing exact inputs, outputs, steps, and metadata along the way. Analyzing traces helps surface inefficiencies and areas for improvement, but they can also be used in more sophisticated ways to set up robust evaluations. Here's two of the ways we use traces to build evals for production agents 👇

English

7

9

74

9.2K

Adam Łucek@AdamRLucek·27 May

So… turns out SWE hasn't escaped testing with AI, rather it's more important than ever!

English

0

5

748

Adam Łucek@AdamRLucek·27 May

Of course, combinations of these evals can cover a wide range of behaviors, scenarios, and edge cases. With both end-to-end and behavioral coverage, the eval suite can be used in some unique ways. The obvious one is traditional regression testing: making sure a change to your prompt or harness doesn't break existing behavior. But more interestingly, these evals can also serve as targets for optimization. I.e. a good suite can show strengths and weaknesses across model families, and pinpoint exactly where, say, a prompting change may let an open source model perform as well as a frontier closed model in a given scenario.

English

2

0

5

865

Adam Łucek

Keşfet