kwindla

6.3K posts

kwindla banner
kwindla

kwindla

@kwindla

Infrastructure and developer tools for real-time voice, video, and AI. @trydaily // ᓚᘏᗢ // @pipecat_ai

San Francisco, CA Katılım Eylül 2008
3.9K Takip Edilen14.5K Takipçiler
kwindla
kwindla@kwindla·
@chelcietay @trydaily Always do much fun and so interesting talking to you! Thanks for inviting me to do it “on the record.”
English
0
0
0
331
Chelcie Taylor
Chelcie Taylor@chelcietay·
I do most of my work by talking now. So I was excited to sit down to have a conversation with @kwindla. Kwin built @trydaily real-time infrastructure for a decade. Voice teams quietly solved the hardest agent problems before anyone else had to. Healthcare is now one of the fastest AI adopters. And he asserts that every person on earth will talk to voice agents multiple times a day.
English
2
0
4
428
kwindla
kwindla@kwindla·
> Every person on earth will talk to voice agents multiple times a day. In 2023, when we first started showing people the voice agents we were building using the (then) new GPT-4, we almost always got the same reaction. People would say, in quick succession: 1. Wow 2. That can't be real 3. I don't want to talk to "an AI" We got this reaction from engineers in San Francisco! We got this reaction from venture investors in San Francisco! And my reaction to that reaction was: I'm 100% sure that every person on earth will talk to LLMs multiple times a day, every day. I've told that story before to @chelcietay. Chelcie is a venture investor who definitely understood what was happening with voice AI very early. We sat down to record a conversation about what we've both seen over the past three years, as voice AI has become mainstream.
Notable Capital@notablecap

Every person on earth will talk to voice agents multiple times a day. @kwindla, creator of @Pipecat_ai & CEO of @trydaily, isn't hedging on that. @chelcietay talks with Kwin on how a decade of real-time infrastructure became the backbone of the voice AI moment, why healthcare is now one of the fastest AI adopters & what it's like to build a company alongside your spouse for ten years.

English
8
3
35
5.2K
kwindla
kwindla@kwindla·
This is a great summary of where we are in the transition to writing code almost entirely using natural language: it’s still programming; maybe the most important mental shift is to spend a *lot* of effort on constraints/tests/guardrails/cross checks; we are so, so early and the new tools are so primitive (but amazingly powerful).
Christina Cacioppo@christinacaci

I’ve been building with Claude Cowork lately, and I figured I’d share my thoughts so far. I chose Cowork over Claude Code to see what it’s like to write code but never read it. Some early thoughts: 1. I’m using natural language, instead of code, but it’s still software development. I don’t need to worry about syntax, but the concepts underlying “traditional” software still matter: observability, client-server splits, pipelines, databases, logging, etc. Non-developers still need to think like engineers to be successful vibe coders – the syntax is gone, but the judgment isn’t (yet?) 2. The primary difference between vibe coding and “traditional” software is non-determinism. Traditional software teaches deterministic thinking – you can run the same function with the same inputs as many times as you want, and the output will always be the same. (When I was learning to code, I found this one of the stranger things; most other experiences in the world aren’t so deterministic!) Agent-driven software is not deterministic – ask an LLM the same question several times, and you will get several answers. The models’ laziness and tendency to cut corners – especially when inference is scarce – adds another twist. Understanding where to demand determinism from your software agents requires judgment. 3. I borrowed the concept of zero trust from security – the principle that an actor or system is only trusted after verification and never by default – to figure out when to demand determinism. I presumed model outputs unreliable until verified with human-in-the-loop checkpoints, retry logic with comparison (LLM as judge), or (deterministic) verification scripts. I realized observability, logging, and verification have to be first-class features of a vibe-coded system if it is to be reliable. Memory doesn’t cut it. 4. I did a lot of debugging through the models’ reasoning traces, which wasn't ideal. I imagine these interactions will improve a lot. If I were building an IDE for agentic coding, I’d start here. 5. What counts as a database? Cowork uses the local file system as the database, which is lazy and convenient and makes for fast prototyping, and comes with tradeoffs: the file system doesn’t enforce data schemas, and laziness means corners get cut. The input and output checks I built at each pipeline stage were repetitive but functional. If I kept pushing, I’d end up rebuilding versions of data integrity, normalization, indexing, query optimization, etc. over the local file system. 6. The tight coupling of the local machine as the client, server, database, etc. works when developing for yourself. To share work with others, you need a better client <> server split. When I asked Cowork to port my system to the cloud, it immediately suggested moving everything to Google Drive – swapping my local file system for a cloud file system. I had to coax it toward the architecture of a simple web app. 7. Cowork doesn’t use or expect version control, which makes tracking changes and multi-agent work near impossible. Developing without version control makes clear why we invented it. Version control is alive and well in the age of software agents. It feels like there’s a new iteration of systems design to be uncovered when building with agents. It’s not wholly new, and most of the primitives and principles of the past are still useful, but they need to be reassembled when some steps are deterministic and others aren’t. I’m excited to see what we uncover.

English
1
3
23
6.2K
kwindla
kwindla@kwindla·
@giffmana @demishassabis @antigravity @GeminiApp Same for my task agent benchmarks. 3.5 Flash more expensive than all other models, now. Crazy change from previous Gemini pricing strategy. TPS is great, though.
kwindla@kwindla

Gemini 3.5 Flash is out today. Here are numbers from my main voice and task agent benchmarks. Some notes: All the Gemini 3 models so far are too slow to work well for voice agents. Gemini 2.5 Flash was a *great* model for voice agents, when it was SOTA. It was fast and good at instruction following. Its big weakness was tool calling. It was quite difficult to prompt Gemini 2.5 Flash to perform tool calling reliably in long context, multi-turn use cases. With Gemini 3, Google improved the tool calling issues a lot. But time to first token is ~1s. We really need TTFT down below 700ms. Google isn't alone in this. All the SOTA models released this year have been reasoning models that aren't optimized for low latency. Claude Haiku 4.5 (released last October) remains the best-performing model with a TTFT under 700ms. Gemini 3.5 Flash is the first Flash model in the 3 family to be released as "generally available." It's quite different from gemini-3-flash-preview, which was released last December. That model actually scored a bit better on my voice agent benchmark. This new model is the new overall top scorer on my task agent benchmark. This benchmark tests a multi-turn task, requiring that models achieve a P50 turn execution time faster than four seconds. Gemini 3.5 Flash with a "high" thinking budget scores significantly better than any other model I've tested. So even though the TTFT isn't what we'd like to see from this model, the overall generation speed makes up for it, and allows us to use the "high" thinking budget and still achieve a per-turn P50 under two seconds. Very impressive. This performance costs money, though. I had become accustomed to thinking of Gemini models as aggressively priced. But Gemini 3.5 Flash is actually more expensive than GPT-5.4 and Claude Sonnet 4.6 on this benchmark. Also note that lower reasoning settings don't always save money. Gemini 3.5 Flash "minimal" costs more, on this benchmark, than "high," because it makes more mistakes, so it uses more tokens to complete the task. Please note that performance of this model on your benchmarks might be very different. My voice agent and task agent results are often wildly out of line with the reported results on standard benchmarks in the model cards and release notes. The voice agent benchmark is 30 turns, and heavily tests tool calling in a long-context scenario. The task agent benchmark injects large streams of structured data events into the context, all tool calls are asynchronous, and the test task takes at least 32 turns to complete. (My motto for evals is "30 turns or it didn't happen.") Make your own benchmarks! (And post the source code and the results for different models, if you can.)

English
0
0
6
1.1K
Demis Hassabis
Demis Hassabis@demishassabis·
Gemini 3.5 Flash is amazing! - Performs better than 3.1 Pro on coding & agentic tasks - 4x faster than other frontier models - 12x faster in @antigravity - 800 tokens/sec! - Often at less than half the cost And Pro to come… Try it in @antigravity, @GeminiApp & more - enjoy!
Demis Hassabis tweet media
English
307
258
3.2K
247.3K
kwindla
kwindla@kwindla·
@finstratege Interesting. I didn't love the results when I tested 3.6 sparse, but if it's working well for you I should spend more time with it. Can you talk about the use cases where that model is doing well, for you?
English
1
0
0
41
Martin Gale
Martin Gale@finstratege·
@kwindla for my cases I find it hard to justify maintaining anything else..
Martin Gale tweet media
English
1
0
1
53
kwindla
kwindla@kwindla·
Gemini 3.5 Flash is out today. Here are numbers from my main voice and task agent benchmarks. Some notes: All the Gemini 3 models so far are too slow to work well for voice agents. Gemini 2.5 Flash was a *great* model for voice agents, when it was SOTA. It was fast and good at instruction following. Its big weakness was tool calling. It was quite difficult to prompt Gemini 2.5 Flash to perform tool calling reliably in long context, multi-turn use cases. With Gemini 3, Google improved the tool calling issues a lot. But time to first token is ~1s. We really need TTFT down below 700ms. Google isn't alone in this. All the SOTA models released this year have been reasoning models that aren't optimized for low latency. Claude Haiku 4.5 (released last October) remains the best-performing model with a TTFT under 700ms. Gemini 3.5 Flash is the first Flash model in the 3 family to be released as "generally available." It's quite different from gemini-3-flash-preview, which was released last December. That model actually scored a bit better on my voice agent benchmark. This new model is the new overall top scorer on my task agent benchmark. This benchmark tests a multi-turn task, requiring that models achieve a P50 turn execution time faster than four seconds. Gemini 3.5 Flash with a "high" thinking budget scores significantly better than any other model I've tested. So even though the TTFT isn't what we'd like to see from this model, the overall generation speed makes up for it, and allows us to use the "high" thinking budget and still achieve a per-turn P50 under two seconds. Very impressive. This performance costs money, though. I had become accustomed to thinking of Gemini models as aggressively priced. But Gemini 3.5 Flash is actually more expensive than GPT-5.4 and Claude Sonnet 4.6 on this benchmark. Also note that lower reasoning settings don't always save money. Gemini 3.5 Flash "minimal" costs more, on this benchmark, than "high," because it makes more mistakes, so it uses more tokens to complete the task. Please note that performance of this model on your benchmarks might be very different. My voice agent and task agent results are often wildly out of line with the reported results on standard benchmarks in the model cards and release notes. The voice agent benchmark is 30 turns, and heavily tests tool calling in a long-context scenario. The task agent benchmark injects large streams of structured data events into the context, all tool calls are asynchronous, and the test task takes at least 32 turns to complete. (My motto for evals is "30 turns or it didn't happen.") Make your own benchmarks! (And post the source code and the results for different models, if you can.)
kwindla tweet mediakwindla tweet mediakwindla tweet media
English
14
9
113
14K
kwindla
kwindla@kwindla·
@LazyCoda Most of the production voice agents in enterprise use cases are still on GPT 4.1!
English
1
0
0
74
Liza
Liza@LazyCoda·
@kwindla Thanks for sharing - saves me diving in. What do you recommend today for agents? Using the qwen35-397b-a17b at the moment on 11labs - it’s not bad. Ttfb p50 452ms p95 710ms What do you think is best available today? Realtime models like OpenAI?
English
1
0
0
93
kwindla
kwindla@kwindla·
@finstratege That’s pretty good. 27b dense or 35B sparse? The 27b version seems to me like it performs a lot better. But it’s more expensive to serve at scale.
English
1
0
0
215
Martin Gale
Martin Gale@finstratege·
@kwindla Qwen3.5 is so fast on 11labs.. reported sub 300ms..
English
2
0
0
241
Scott (Human)
Scott (Human)@Dorizzdt·
Holy shit does Gemini eat tokens … for me it’s pretty much same same as Claude / GPT for tasks. It just burns tokens for 2-3x the cost though. .. Anyone else getting this or is it a skill issue on my part
English
2
0
1
422
kwindla
kwindla@kwindla·
@nilesh__hirani GPT-4.1, Haiku 4.5. Or for simple workflows you can prompt very cleanly, self-hosted Nemotron, Qwen, or Gemma 4.
English
1
0
0
249
Nilesh Hirani
Nilesh Hirani@nilesh__hirani·
@kwindla Which LLM do you recommend for latency sensitive workloads currently? achieving sub 800ms total latency is hard if the LLM takes ~700 ms
English
1
0
0
252
kwindla
kwindla@kwindla·
@priyankinfinnov Me, too! I think open weights models will have to fill the "not quite SOTA, lower priced" gap, maybe.
English
0
0
1
105
Priyank
Priyank@priyankinfinnov·
@kwindla The trend of pricing going higher and old models being removed very frequently worries me
English
1
0
0
116
kwindla
kwindla@kwindla·
@KavakErkam Only for the Grok speech-to-speech model. I haven't tried the ones that aren't speech-to-speech. See the speech-to-speech table here for the Grok model numbers: github.com/kwindla/aiewf-…
English
1
0
1
176
erkamkavak
erkamkavak@KavakErkam·
@kwindla I dont have a benchmark code to share but grok models seems pretty good at TTFT and TPS(also they support nonreasoning). Do you have any benchmark run for them?
English
1
0
0
215
kwindla
kwindla@kwindla·
@letsbuildmore Here's a docs page about metrics: docs.pipecat.ai/pipecat/fundam… Let me know if there are still gaps. I'm not sure about logfire, but if it records everything in the MetricsFrame structures, you should have the TTFT somewhere in the logs, I think.
English
0
0
1
59
letsbuildmore
letsbuildmore@letsbuildmore·
@kwindla use pipecat didnt know how to enable TTFT numbers? Log stuff via logfire, so if u tell me where its recorded and i can get it logged and check
English
1
0
1
65
kwindla
kwindla@kwindla·
@vibecoder_dc These are two different benchmarks. Voice agents need very fast TTFT. Task agents need a combination of TTFT and througput (depending on the specifics of the task).
English
0
0
2
307
DC
DC@vibecoder_dc·
@kwindla TTFT < 700ms obsession is like complaining about elevator speed in a building where the stairs are the main attraction. If turn P50 < 2s, latency shifts from 'is it broken?' to 'is it thinking?'. That's a feature, not a bug.
English
1
0
1
505
kwindla
kwindla@kwindla·
@letsbuildmore Do you have TTFT numbers? I'm curious what you're seeing for your use case. (If you're using Pipecat and have metrics enabled, the TTFT numbers are reported in the MetricsFrames.)
English
1
0
0
161
letsbuildmore
letsbuildmore@letsbuildmore·
@kwindla i have been using gemini-3.1-flash preview for all voice ai stuff.. been pretty good and response is quite fast too
English
1
0
1
173
kwindla
kwindla@kwindla·
The overall message was "Build stuff. And post about it." To get value from this, as part of go-to-market, my "rules" are: - Sample code (almost always) - It’s about the ecosystem - You just have to put stuff out there - Be zen about it - But treat Twitter / LinkedIn as a job you do every day I say "rules," but I mean just things that I've learned work for me.
kwindla tweet media
English
0
0
1
48
Joe Heitzeberg
Joe Heitzeberg@jheitzeb·
@kwindla @AITinkerers It was great to see you again. Thanks for presenting your amazing work. You shared four principles. One of them was Zen. Can you share those here?
English
1
0
1
62
kwindla
kwindla@kwindla·
Thread about the @AITinkerers meetup last night in SF around the theme of building "go to market" tools with AI. The AI Tinkerers events are so, so good, because they really are 100% "show us what you've been building." I said at the beginning of my talk that the AI tinkerers events remind me powerfully of being a self-taught programmer, arriving at graduate school surrounded by *very good* engineers, and realizing that the next few years would basically be an apprenticeship to learn the craft of full stack software development. (Really, really full stack: from designing hardware all the way up through new user interfaces that enabled applications that hadn't previously been possible.) I mean, I did research and wrote papers, too. But mostly I sat next to great programmers and learned how they thought about what they were doing, what tools they used, and how they worked step-by-step on very ambitious projects. Everything is new, right now! We're all figuring this stuff out together. Watching someone live demo, in Claude Code, a workflow they've figured out is amazing.
netto.eth (in SF, then NYC)@alextnetto

Last night I went to the GTM Engineers event at @AItinkerers in SF. 5 founders showing how they are doing it. 5 very different approaches. Here are the insights:

English
3
1
20
1.7K
kwindla
kwindla@kwindla·
Voice agents hackathon at @ycombinator in SF on May 30th. Prizes include a guaranteed YC interview, and special awards from sponsors @cekuraAi, @NVIDIAAI, @AWS, and @twilio. Learn to build agents that work at scale, in production. Use tooling from Cekura to simulate and auto-improve your agents. Handle accents, noisy environments, interruptions, and customers who don't follow the expected script! Build with NVIDIA Nemotron open source models, running on AWS infrastructure. Integrate with Twilio's telephony platform. Leverage the Pipecat developer ecosystem. Join us for fun, learning, conversations with the engineers building all of the above tools, food, and prizes.
kwindla tweet media
English
18
32
236
27.1K