Svein Y. Willassen

552 posts

Svein Y. Willassen banner
Svein Y. Willassen

Svein Y. Willassen

@sventy

CEO, Sonett AS. Previously CEO and co-founder @ Confrere, previously CEO and co-founder @ https://t.co/32y3iVd9nv

Oslo, Norway Katılım Nisan 2008
344 Takip Edilen497 Takipçiler
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
Voice agents hackathon at @ycombinator in SF on May 30th. Prizes include a guaranteed YC interview, and special awards from sponsors @cekuraAi, @NVIDIAAI, @AWS, and @twilio. Learn to build agents that work at scale, in production. Use tooling from Cekura to simulate and auto-improve your agents. Handle accents, noisy environments, interruptions, and customers who don't follow the expected script! Build with NVIDIA Nemotron open source models, running on AWS infrastructure. Integrate with Twilio's telephony platform. Leverage the Pipecat developer ecosystem. Join us for fun, learning, conversations with the engineers building all of the above tools, food, and prizes.
kwindla tweet media
English
13
31
223
24.5K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
I care a lot about 1) latency, and 2) good benchmarks. The voice models from @GradiumAI occupy in the top spot on the @covaldev voice AI benchmark leaderboard, which tests both latency and accuracy. We've made the Gradium models the default in our open source, massively multi-player, LLM game. You play the game by talking to your "ship AI." Of course, humans aren't the only talkers in the universe, these days. So @chadbailey59 has been building OpenClaw bots that play the game, too! Here's a video of an OpenClaw agent playing the game by talking to the agent's ship AI. Chad created specific personalities for both the OpenClaw bot and the ship AI, here. Gradium's voice models are very emotive, and they support voice cloning and customization. I love how morose and pessimistic the ship AI is, and how excitable and cheerful the OpenClaw bot is. This is not scripted at all. These are two voice AI agents just doing their thing.
English
9
10
70
6.3K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
Last week at the SF Voice AI Meetup, I moderated a panel about multi-modal model training, with Jagadeesh Balam who works on speech models at @NVIDIAAI, Fabian Seipel of @ai_coustics, and @code_brian from Tavus. I always really enjoy the opportunity to hear from people working on models (small, large, text, audio, pixels, transformer-based, diffusion, etc)! Some notes: - Brian said "latency is solved," if you're thinking about latency as a mechanical problem. Humans take ~700ms to think about things before they respond in conversation. Current STT->LLM->TTS pipelines can beat that. What's missing is the higher-level architecture for "thinking": queuing what to talk about next, deciding what to say first and how, tracking emotional tone, etc. - Jagadesh said that as we do more and more interesting things with the models, the bar for performance goes up. Transcription was "solved" for non-realtime use cases, but now voice agents need fast and accurate transcription of very tricky strings like email addresses and mixed alphanumeric account numbers. And for speech-to-speech models, we have to clear the bar of performing well in long, multi-turn conversations. Part of the challenge here is generating very good training data. "Data simulation for training is unsolved. If it were solved, all our model roadmaps would be done by now!" I appreciate this viewpoint, because I don't think we talk enough about the challenge of having large amounts of *exactly* the right training data. - Fabian talked about how ai|coustics generates data for training very fast, very specialized audio models that improve the performance of voice agents. His team includes people who spend a lot of their time simulating room geometries, mic frequency responses, WebRTC processing artifacts, and many other things. He calls them "professional audio destroyers."
English
5
12
73
4K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
✨ Voice AI, open models, and next-generation evals hackathon at @ycombinator in SF on May 30th. ✨ We're co-hosting with @cekuraAi , and we've pulled in our friends at @NVIDIAAIDev, @AWS, and @twilio for expertise and mentoring. We'll help you build state of the art voice agents using: - NVIDIA Nemotron models - AWS SageMaker and Bedrock inference - Twilio telephony - Cekura evaluation tooling - Pipecat orchestration and Pipecat Cloud agent hosting Up for grabs: - A guaranteed YC interview - Special judges' prizes from NVIDIA, AWS, and Twilio for the most impactful and technically impressive projects Join us to learn from engineers who built all the tools you're using, compare notes with other voice AI developers, and show off your ideas! Space is limited. Apply below.
kwindla tweet media
English
21
37
253
52.7K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) @mark_backman made a @pipecat_ai PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.
English
39
29
369
26.5K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
If you're getting started with voice agents and Android, the Pipecat Android demo client has all the core components a client-side voice AI app needs: voice input and output, device control, and network transport. Marcus just updated the code, which now supports two WebRTC transports. The Pipecat SmallWebRTCTransport for zero-dependency, peer-to-peer connections. And the Daily WebRTC transport for large-scale production use. The demo bot also sends a video stream, which the app renders. You can actually use this code to connect to any voice AI service that implements the RTVI standard, too, not just Pipecat. The Pipecat client-side SDKs (Javascript, React, React Native, Swift, Kotlin, and C++) are part of the Pipecat ecosystem but don't depend on any server-side Pipecat components and are completely open source.
English
2
1
9
876
Svein Y. Willassen retweetledi
NVIDIA AI
NVIDIA AI@NVIDIAAI·
“When AI is open, it proliferates everywhere.” Jensen Huang explains why open models are fueling the AI revolution, activating innovation across industries, startups, researchers, students, and countries worldwide. Learn more about our open models → nvda.ws/49pHhXj
English
84
156
708
52K
Svein Y. Willassen retweetledi
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
Just launched #CES2026, the new open-source NVIDIA Nemotron Speech ASR model is here to solve latency drift and redundant compute. Its cache-aware streaming architecture eliminates the need for buffered inference, giving you stable, sub-100ms latency (24ms median T-T-F) and up to 3x more throughput on your GPU. 🤗 Read the technical blog with real-world results from @trydaily and @modal on @HuggingFace: nvda.ws/3Lt8m3Q
English
20
142
791
138.3K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with @modal and @pipecat_ai cloud, or run locally on an @nvidia DGX Spark or RTX 5090.
English
84
453
3.6K
274K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
New Gemini Live (speech-to-speech) model release today. Using the Google AI Studio API, the model name is: gemini-2.5-flash-native-audio-preview-12-2025 The model is also GA (general availability, so not considered a beta/preview release) on Google Cloud Vertex under this model name: gemini-live-2.5-flash-native-audio Try it out on the @pipecat_ai landing page.
kwindla tweet media
English
16
41
341
24.8K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
The team at @langchain built voice AI support into their agent debugging and monitoring tool, LangSmith. LangSmith is built around the concept of "tracing." If you've used OpenTelemetery for application logging, you're already familiar with tracing. If you haven't, think about it like this: a trace is a record of an operation that an application performs. Here's a very nice video from @_tanushreeeee that walks you through building and debugging a voice agent with full conversation tracing. Using the LangSmith interface you can find a specific agent session, then dig into what happened during each turn of the conversation. What did the user say and how was that processed by each model you're using in your voice agent? What was the latency for each inference operation? What audio and text was actually sent back to the user? Today's production voice agents are complex, multi-model, multi-modal, multi-turn systems! Tracing gives you leverage to understand what your agents are doing. This saves time during development. And it's critical in production. Tanushree shows using a local (on-device) model for transcription, then switching to using the OpenAI speech-to-text model running in the cloud. You can see the difference in accuracy. (Using Pipecat, switching between different models is a single-line code change.) Also, the video is fun! It's a French tutor. Which is a voice agent I definitely need.
kwindla tweet mediakwindla tweet mediakwindla tweet media
English
3
15
42
2.4K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
.@davitb , CEO of @krispHQ, publishes a must-read weekly Voice AI Newsletter and hosts a regular podcast. I joined Davit and @klemensimonic, co-founder and CEO of @soniox_ai, to talk about the current state of real-time AI transcription. It's relatively easy to build a voice agent proof of concept, today. But we often see product teams get stuck on the path from POC to production. Many voice agent products *are* scaling rapidly. I think of the POC-to-production challenges primarily as "best practices" problems. Which models work best for real-world voice agents? How do you evaluate agent performance? How do you deal with noisy environments? What kind of context management do you need to build on top of your basic transcription->LLM->voice loop to maximize success rates? How do you integrate with existing systems (customer databases, support knowledge bases, telephony stacks)? What does production infrastructure look like? We touched on all of these topics in the Davit's podcast, plus latency, accuracy, and moving from "transcription" to "speech understanding."
kwindla tweet media
English
3
4
25
1.5K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
Pipecat Thanksgiving day release. 🦃 Some highlights: Deepgram AWS SageMaker realtime speech-to-text support, improved text aggregation, simplified and more powerful error handling, new MiniMax Speech 2.6 HD and Turbo models. SageMaker is AWS's AI platform for deploying and using machine learning models at scale. AWS has brand new support for streaming data in and out of models hosted on SageMaker, which is great for voice AI use cases. This Pipecat release includes a generic base class for SageMaker "bidirectional streaming," plus a new `DeepgramSageMakerSTTService` class. Text aggregation and error handling are important fundamental jobs that a realtime agent framework needs to do well for the widest possible range of models, APIs, and use cases. Different APIs chunk streaming text differently. For different use cases, you might want different aggregation strategies. (For example, feed one sentence of LLM output at a time to your voice generation service.) And managing multi-turn context as accurately as possible requires different strategies depending on what the APIs you are using can do. (For example, whether your TTS model can give you word-level timestamps or not). Good error handling requires both managing the very different approaches to error handling that different services have, and giving developers good application-level ways to catch, respond to, and log errors. The more services Pipecat supports, and the more different kinds of things people use Pipecat for, the more work these abstraction layers need to do! This Pipecat release includes several new text aggregation and error handling frame types and methods. The goal of these improvements is to make common use cases work better with less application-level code required, while also making it easier to build robust error handling for complex applications. Finally, the MiniMax Speech models are getting great reviews. Thank you to the MiniMax team for the implementation!
kwindla tweet media
English
2
1
21
2.2K
Svein Y. Willassen retweetledi
Aleix Conchillo Flaqué
Aleix Conchillo Flaqué@aconchillo·
Smart Turn is @trydaily's open source AI model to detect when a user is *really done* talking. Today, we are announcing Smart Turn v3. 8MB model, 12ms CPU inference and 23 languages! This is just huge! 🔥🚀 daily.co/blog/announcin… Soon to be available in @pipecat_ai !
English
3
7
14
1K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
Blog post with more details about the new v3 version of Smart Turn: daily.co/blog/announcin… Training and inference code on GitHub: github.com/pipecat-ai/sma… Model weights and all data sets are on @huggingface: huggingface.co/pipecat-ai The Krisp Turn-Taking model, integrated into their suite of voice AI models: krisp.ai/blog/turn-taki… The Ultravox context-aware endpointing (turn detection) model: ultravox.ai/blog/ultravad-…
English
1
1
3
625
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
Voice-only programming with the new OpenAI Realtime API ... I spend a lot of time these days pair programming with LLMs. Often I'm talking rather than typing. This "voice dictation" use case has become an important vibe benchmark for me. Being able to create text input just by talking, flexibly, in a context dependent way, with tool calling, is a *hard* problem for today's models. Natural language dictation requires a very high degree of contextual intelligence, instruction following accuracy, and tool calling reliability. Today's new gpt-realtime model is quite good at this hard problem. The original realtime model release last year was impressive. Seeing what a speech-to-speech model could do got a lot of people excited about the possibilities of voice AI. The improvements since that first release are equally impressive. I can use this new model, now, for real world tasks that were past the edge of the "jagged frontier" before. Here's a video showing a couple of fun (and tricky) modes of voice input.
English
16
28
293
50.8K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
GPT-5 is out in the world! Here's a single-file voice agent powered by GPT-5. All you need is an OpenAI API key and Python. ``` export OPENAI_API_KEY=sk_proj-... uv run gpt-5-voice-agent .py ``` The first time you run this, it will take about 30 seconds to install all the dependencies, accept connections, and begin processing audio and video. For voice AI use cases, you probably want these parameter settings for GPT-5. service_tier: priority reasoning_effort: minimal verbosity: low Note that using the "priority" service tier doubles the cost per token. Having this option is great for latency sensitive, conversational voice applications.
English
17
18
152
9.1K
Svein Y. Willassen retweetledi
kwindla
kwindla@kwindla·
A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.
English
52
119
1.3K
202.1K