Annabell Schaefer

127 posts

Annabell Schaefer banner
Annabell Schaefer

Annabell Schaefer

@annabellschfr

thinking about agents @langfuse

Berlin, San Francisco Katılım Nisan 2023
361 Takip Edilen177 Takipçiler
Sabitlenmiş Tweet
Annabell Schaefer
Annabell Schaefer@annabellschfr·
Had a great time sharing about observability and evals at AWS Summit in London today 🪢
Ghulam@ghulamio

@langfuse at AWS summit today

English
1
1
15
1.8K
Annabell Schaefer
Annabell Schaefer@annabellschfr·
@geraldrsterling And beyond that you can also get a lot of signal from the interaction with humans/ the real world! User disagreeing or ranting, suddenly raging in all caps, or silently abandoning are all great places to start investigating what happens
English
0
0
0
12
Gerald Sterling
Gerald Sterling@geraldrsterling·
@annabellschfr Production traces are where agents stop being mythology. The useful signal is not average success. It is weird recoveries, skipped tool calls, and places where the model knew the next move but quietly walked into a rake.
English
1
0
1
16
Annabell Schaefer
Annabell Schaefer@annabellschfr·
Tracing is the foundation for any effort of leveling-up your AI system. It's about logging the right info at the right time, so you are able to investigate when and how your system fails. Check out the second piece in our series 👇
Lotte@lotte_verheyden

x.com/i/article/2054…

English
0
0
3
210
Prompt Assay · AI Primitives Workbench
@langfuse One thing I'd add to any loop like this: rubric drift. Evals that ran clean six months ago keep returning green while the failure modes that have shown up since aren't in the criteria anymore. Versioning the rubric as carefully as the prompt is the unsexy half.
English
1
0
0
60
langfuse.com
langfuse.com@langfuse·
Building high-quality AI systems is hard. At Langfuse we see the best AI teams converging on a process to get complex AI systems to production. We call it the AI Engineering Loop. Check out the first piece of our series and find out more in our academy
Annabell Schaefer@annabellschfr

x.com/i/article/2054…

English
1
4
22
56.3K
Annabell Schaefer retweetledi
langfuse.com
langfuse.com@langfuse·
Introducing Langfuse Academy Our open take on the AI engineering lifecycle: tracing, monitoring, datasets, experiments, evaluation, and how the pieces connect. Link in comments.
langfuse.com tweet media
English
2
11
38
352.3K
Annabell Schaefer retweetledi
langfuse.com
langfuse.com@langfuse·
new langfuse.com, new brand. same mission: open source LLM engineering platform. s/o to @altalogy for the great work.
English
7
10
80
22K
Sam Altman
Sam Altman@sama·
people are really starting to use voice to interact with AI, especially when they have a lot of context to dump. GPT-Realtime-2 comes to the API today; it is a pretty big step forward. (we are working on improvements to voice in chat.)
English
876
289
7.1K
482.2K
Viv
Viv@Vtrivedy10·
Open Models Make Agentic Batch Processing Economically Viable A lot of world’s work looks like “Do X for EVERY Y” - read every trace - respond to every email - deep dive into every document - enrich every lead This is the domain of Agentic Batch Computing Much of the world’s work does not need peak frontier intelligence, it needs carefully shaped intelligence pointed at specific tasks The holy grail here is having a tailored agent run on every single one of these data points + tasks The world is producing more data than ever before. To understand and process it at scale, we’re going to have to point Intelligence and Compute at this Open Models are a fantastic tool here - they’re often an order of magnitude cheaper - they can be finetuned (SFT & RL) to fit to your exact task distribution and outperform frontier models for your task As companies contend with rising AI costs, Open Models and specialized models become incredibly important to make sure there’s good ROI on AI spend here to help as you let the token machine rip without busting the bank (and with better results) 🫡
Viv tweet media
English
10
15
93
7.3K
Viv
Viv@Vtrivedy10·
Strong Opinions, Loosely Held on Agent + Harness Engineering: 1. You can outperform any default harness+model (including codex & claude code) on pretty much any Task by engineering the harness around it. Using the exact same model, curate prompts, tools, skills, hooks for that Task. This harness optimization process is becoming much more agent driven with humans reviewing and curating evals/rewards to hill climb on. “Just say what you want”. 2. A “general purpose” agent/harness doesn’t really exist, it’s a tradeoff between time spent on customizing the agent and performance (cost, latency, accuracy) on a Task. I don’t exactly follow what a general purpose means tbh. Who decides what’s general and what’s not? 3. But if the “general purpose” agent/harness existed, it would look like a good coding agent 4. Building a Task specific harness will most likely converge to good prompt & tool design (probably packaged up as a Skill) as models become smarter and better at in-context learning 5. Evals are a moat and thus data to produce evals is a moat. Especially true for vertical agent companies. This is because agents can fit to most Eval sets today. If Evals measurably encode all the good behavior your agent needs to do, then this signal can be hill climbed to improve your agent 6. Frontier closed models are far too expensive for the large majority of tasks the world needs to do. As teams start mapping costs to ROI, Open Model Harness Engineering will take off even more. It is almost always worth the investment to at least try to get a potential 20x+ cost reduction 7. A large chunk of design decisions around Task decomposition and context engineering exist solely because our usable context window is 50-100k. Agents that become excellent at breaking down tasks, applying compaction appropriately, and orchestrating subagents as sub-task workers will be the most delightful products to do real work. 8. We’re entering an Age of Unbundled (& Rebundled) Agents where Subagents exposed as Tools do a ton of domain specific work on behalf of an orchestrator agent. The Harness becomes a box that gets populated with the exact set of tools, skills, and subagents needed to solve that task or sub-task. Examples include WarpGrep (search), Chroma Context-1 (search), Nemotron 3 Omni (small multimodal), etc. Bespoke agents that rock at narrow tasks orchestrated as tools. This also applies to software as tools that are used by agents via Skills like Remotion or Blender. Different harnesses bundle together the tooling needed to complete that narrow task. End of opinions, these may change by the time this tweet goes out or may double down and expand on these in an article
English
51
70
787
66.2K
Annabell Schaefer
Annabell Schaefer@annabellschfr·
@Vtrivedy10 @hwchase17 Yes that’s what I mean! Not trying to architect the whole thing at once, but adding layers/specific tools needs based
English
1
0
1
86
Viv
Viv@Vtrivedy10·
yes! skill learning by using traces + evals or good old fashioned dog goofing are great ways to tune a harness/agent in a task when you say “starting broad”, did you mean starting narrow, as in a small set of harness primitives? i think that probably reflects our journey building deepagents, started with a very simple base primitive (create_agent ReAct loop) and layered on optional tooling to get capabilities that lacked in the base harness like fs-ops, compaction, context offloading, memory, etc
English
1
0
0
588
Annabell Schaefer
Annabell Schaefer@annabellschfr·
@ashwingop Great write-up! Seeing the advanced setup I am assuming you are also using skills in some form in your company, how are you going about content duplication between skills and company brain?
English
0
0
1
282
Annabell Schaefer
Annabell Schaefer@annabellschfr·
Traced some Langfuse agents in Tokyo during @langfuse/@ClickHouseDB offsite. Clear issue: they were running in a loop. Burned some tokens Decent latency
Annabell Schaefer tweet media
English
1
0
10
273
Annabell Schaefer retweetledi
Clemens Rawert
Clemens Rawert@rawert·
🇯🇵 We're launching @langfuse Cloud Japan data region today. Happy about committing further to this market. It's been rewarding to spend time here & partner with our amazing Japanese community (@langfusejp) and customers. Very excited to keep pushing. What else can we do for you? + still in Tokyo until Friday, DMs open!
Clemens Rawert tweet media
English
1
13
25
4.9K