Annabell Schaefer

127 posts

Annabell Schaefer

@annabellschfr

thinking about agents @langfuse

Berlin, San Francisco Katılım Nisan 2023

361 Takip Edilen177 Takipçiler

Sabitlenmiş Tweet

Annabell Schaefer@annabellschfr·22 Nis

Had a great time sharing about observability and evals at AWS Summit in London today 🪢

Ghulam@ghulamio

@langfuse at AWS summit today

English

1.8K

Annabell Schaefer@annabellschfr·2h

@geraldrsterling And beyond that you can also get a lot of signal from the interaction with humans/ the real world! User disagreeing or ranting, suddenly raging in all caps, or silently abandoning are all great places to start investigating what happens

English

Gerald Sterling@geraldrsterling·3h

@annabellschfr Production traces are where agents stop being mythology. The useful signal is not average success. It is weird recoveries, skipped tool calls, and places where the model knew the next move but quietly walked into a rake.

English

Annabell Schaefer@annabellschfr·9h

x.com/i/article/2055…

ZXX

3.4K

Annabell Schaefer@annabellschfr·3h

Online evals, Monitoring, Signal detection, … the terminology varies Key take is the same: finding ‘interesting’ traces from production AI systems is one of the most underrated levers for improving your agents

Annabell Schaefer@annabellschfr

x.com/i/article/2055…

English

Annabell Schaefer@annabellschfr·1d

Tracing is the foundation for all efforts of improving your AI system systematically check part two of our series 👇 x.com/lotte_verheyde…

Lotte@lotte_verheyden

x.com/i/article/2054…

English

202

Annabell Schaefer@annabellschfr·2d

x.com/i/article/2054…

ZXX

64K

Annabell Schaefer@annabellschfr·1d

Tracing is the foundation for any effort of leveling-up your AI system. It's about logging the right info at the right time, so you are able to investigate when and how your system fails. Check out the second piece in our series 👇

Lotte@lotte_verheyden

x.com/i/article/2054…

English

210

Annabell Schaefer@annabellschfr·2d

Building AI systems is hard, but the best teams continuously improve their systems by closing the loop between production and development. launching Langfuse Academy with this first piece on the AI Engineering Loop 👇

Annabell Schaefer@annabellschfr

x.com/i/article/2054…

English

367

Annabell Schaefer@annabellschfr·2d

@PromptAssay @langfuse Agreed! Stay tuned for the Monitoring piece

English

Prompt Assay · AI Primitives Workbench@PromptAssay·2d

@langfuse One thing I'd add to any loop like this: rubric drift. Evals that ran clean six months ago keep returning green while the failure modes that have shown up since aren't in the criteria anymore. Versioning the rubric as carefully as the prompt is the unsexy half.

English

langfuse.com@langfuse·2d

Building high-quality AI systems is hard. At Langfuse we see the best AI teams converging on a process to get complex AI systems to production. We call it the AI Engineering Loop. Check out the first piece of our series and find out more in our academy

Annabell Schaefer@annabellschfr

x.com/i/article/2054…

English

56.3K

Annabell Schaefer retweetledi

langfuse.com@langfuse·2d

Introducing Langfuse Academy Our open take on the AI engineering lifecycle: tracing, monitoring, datasets, experiments, evaluation, and how the pieces connect. Link in comments.

English

352.3K

Annabell Schaefer retweetledi

langfuse.com@langfuse·8 May

new langfuse.com, new brand. same mission: open source LLM engineering platform. s/o to @altalogy for the great work.

English

22K

Annabell Schaefer@annabellschfr·7 May

@sama Apple in September will be like

English

166

Sam Altman@sama·7 May

people are really starting to use voice to interact with AI, especially when they have a lot of context to dump. GPT-Realtime-2 comes to the API today; it is a pretty big step forward. (we are working on improvements to voice in chat.)

English

876

289

7.1K

482.2K

Annabell Schaefer@annabellschfr·7 May

@Miles_Brundage Also Claude Coat

English

133

Miles Brundage@Miles_Brundage·7 May

Claude Coat

English

120

118

2.9K

190.4K

Annabell Schaefer@annabellschfr·7 May

@Vtrivedy10 @hwchase17 Love how fine tuning gets a revival for very specific tasks! Good news for team prime intellect @jannik_stra

English

143

Viv@Vtrivedy10·7 May

Open Models Make Agentic Batch Processing Economically Viable A lot of world’s work looks like “Do X for EVERY Y” - read every trace - respond to every email - deep dive into every document - enrich every lead This is the domain of Agentic Batch Computing Much of the world’s work does not need peak frontier intelligence, it needs carefully shaped intelligence pointed at specific tasks The holy grail here is having a tailored agent run on every single one of these data points + tasks The world is producing more data than ever before. To understand and process it at scale, we’re going to have to point Intelligence and Compute at this Open Models are a fantastic tool here - they’re often an order of magnitude cheaper - they can be finetuned (SFT & RL) to fit to your exact task distribution and outperform frontier models for your task As companies contend with rising AI costs, Open Models and specialized models become incredibly important to make sure there’s good ROI on AI spend here to help as you let the token machine rip without busting the bank (and with better results) 🫡

English

7.3K

Annabell Schaefer@annabellschfr·7 May

@Vtrivedy10 @hwchase17 If that’s broad or narrow is philosophical I guess :D

English

Viv@Vtrivedy10·7 May

@annabellschfr @hwchase17 totally, i’m with it 🫡

English

Viv@Vtrivedy10·6 May

Strong Opinions, Loosely Held on Agent + Harness Engineering: 1. You can outperform any default harness+model (including codex & claude code) on pretty much any Task by engineering the harness around it. Using the exact same model, curate prompts, tools, skills, hooks for that Task. This harness optimization process is becoming much more agent driven with humans reviewing and curating evals/rewards to hill climb on. “Just say what you want”. 2. A “general purpose” agent/harness doesn’t really exist, it’s a tradeoff between time spent on customizing the agent and performance (cost, latency, accuracy) on a Task. I don’t exactly follow what a general purpose means tbh. Who decides what’s general and what’s not? 3. But if the “general purpose” agent/harness existed, it would look like a good coding agent 4. Building a Task specific harness will most likely converge to good prompt & tool design (probably packaged up as a Skill) as models become smarter and better at in-context learning 5. Evals are a moat and thus data to produce evals is a moat. Especially true for vertical agent companies. This is because agents can fit to most Eval sets today. If Evals measurably encode all the good behavior your agent needs to do, then this signal can be hill climbed to improve your agent 6. Frontier closed models are far too expensive for the large majority of tasks the world needs to do. As teams start mapping costs to ROI, Open Model Harness Engineering will take off even more. It is almost always worth the investment to at least try to get a potential 20x+ cost reduction 7. A large chunk of design decisions around Task decomposition and context engineering exist solely because our usable context window is 50-100k. Agents that become excellent at breaking down tasks, applying compaction appropriately, and orchestrating subagents as sub-task workers will be the most delightful products to do real work. 8. We’re entering an Age of Unbundled (& Rebundled) Agents where Subagents exposed as Tools do a ton of domain specific work on behalf of an orchestrator agent. The Harness becomes a box that gets populated with the exact set of tools, skills, and subagents needed to solve that task or sub-task. Examples include WarpGrep (search), Chroma Context-1 (search), Nemotron 3 Omni (small multimodal), etc. Bespoke agents that rock at narrow tasks orchestrated as tools. This also applies to software as tools that are used by agents via Skills like Remotion or Blender. Different harnesses bundle together the tooling needed to complete that narrow task. End of opinions, these may change by the time this tweet goes out or may double down and expand on these in an article

English

787

66.2K

Annabell Schaefer@annabellschfr·7 May

@Vtrivedy10 @hwchase17 Yes that’s what I mean! Not trying to architect the whole thing at once, but adding layers/specific tools needs based

English

Viv@Vtrivedy10·7 May

yes! skill learning by using traces + evals or good old fashioned dog goofing are great ways to tune a harness/agent in a task when you say “starting broad”, did you mean starting narrow, as in a small set of harness primitives? i think that probably reflects our journey building deepagents, started with a very simple base primitive (create_agent ReAct loop) and layered on optional tooling to get capabilities that lacked in the base harness like fs-ops, compaction, context offloading, memory, etc

English

588

Annabell Schaefer@annabellschfr·7 May

@ashwingop Great write-up! Seeing the advanced setup I am assuming you are also using skills in some form in your company, how are you going about content duplication between skills and company brain?

English

282

Ashwin Gopinath@ashwingop·6 May

x.com/i/article/2052…

ZXX

501

59.2K

Annabell Schaefer@annabellschfr·30 Nis

@marcklingen @langfuse @ClickHouseDB Low efficiency

English

Marc Klingen@marcklingen·30 Nis

@annabellschfr @langfuse @ClickHouseDB Lots of energy burned by these agents

English

147

Annabell Schaefer@annabellschfr·30 Nis

Traced some Langfuse agents in Tokyo during @langfuse/@ClickHouseDB offsite. Clear issue: they were running in a loop. Burned some tokens Decent latency

English

273

Annabell Schaefer retweetledi

Clemens Rawert@rawert·27 Nis

🇯🇵 We're launching @langfuse Cloud Japan data region today. Happy about committing further to this market. It's been rewarding to spend time here & partner with our amazing Japanese community (@langfusejp) and customers. Very excited to keep pushing. What else can we do for you? + still in Tokyo until Friday, DMs open!

English

4.9K

Keşfet

@geraldrsterling @PromptAssay @langfuse @altalogy @sama @Miles_Brundage @Vtrivedy10 @hwchase17