Aleix Conchillo Flaqué

1.7K posts

Aleix Conchillo Flaqué banner
Aleix Conchillo Flaqué

Aleix Conchillo Flaqué

@aconchillo

a tiny schemer. engineering @trydaily and @pipecat_ai core maintainer.

Greater Los Angeles Area Katılım Kasım 2009
66 Takip Edilen489 Takipçiler
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
NVIDIA Nemotron 3 Super launches today! We've been building voice agents with Super's pre-release checkpoints and running all our various tests and benchmarks. Nemotron 3 Super matches both GPT-5.4 and GPT-4.1 in tool calling and instruction following performance on our realtime conversation, long context, real-world benchmarks. GPT-4.1 is the most widely used LLM today for production voice agents. So an open model that performs as well as GPT-4.1 on hard, voice-specific benchmarks is a big deal. (Side note: we don't think a benchmark "tells the story" about a model's voice agent performance unless it tests model correctness across at least 20 human/agent conversation turns.) The Nemotron models are *fully* open: weights, data sets, training code, inference code. Nemotron 3 Super is 120B params, with a hybrid Mamba-Transformer MoE architecture for efficient inference. You can run it on NVIDIA data center hardware or on a DGX Spark mini-desktop machine. 1M token context. Blog post with full benchmarks, thinking budget notes, inference setup on @Modal, and where we think this goes next. 👇
kwindla tweet media
English
13
34
230
19.1K
Aleix Conchillo Flaqué
Aleix Conchillo Flaqué@aconchillo·
Hi @Microsoft ! Can you help me recover my son's account? He can't sign-in to Minecraft anymore which (as you can imagine) is a big deal. Still waiting to hear back from account.live.com/acsr. I’d really appreciate any help. Happy to continue via DM.
English
10
0
1
93
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
Voice-controlled UI. This is an agent design pattern I'm calling EPIC, "explicit prompting for implicit coordination." Feel free to suggest a better name. :-) In the video, I'm navigating around a map, conversationally, pulling in information dynamically from tool calls and realtime streamed events. There are two separate agents (inference loops) here: a voice agent and a UI control agent. They know about each other (at the prompt level) but they work independently.
English
19
29
407
13.8K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
Benchmarking LLMs for voice agent use cases. New open source repo, along with a deep dive into how we think about measuring LLM performance. The headline results: - The newest SOTA models are all *really* good, but too slow for production voice agents. GPT-4.1 and Gemini 2.5 Flash are still the most widely used models in production. The benchmark shows why. - Ultravox 0.7 shows that it's possible to close the "intelligence gap" between speech-to-speech models and text-mode LLMs. This is a big deal! - Open weights models are climbing up the capability curve. Nemotron 3 Nano is almost as capable as GPT-4o. (And achieves this with only 30B parameters.) GPT-4o was the most widely used model for voice agents until quite recently, so a small open weights model scoring this well is a strong indication that production use of open weights models will grow this year. Voice agents are a moderately "out of distribution" use case for all of our SOTA LLMs today. Literally, in the sense that there's not enough long, multi-turn conversation data in the training sets. Everyone who builds voice agents knows this intuitively, from doing lots of manual testing. (Vibes-based evals!) This benchmark scores LLMs quantitatively on instruction following, tool calling, and knowledge retrieval in long-context, multi-turn conversations.
kwindla tweet media
English
8
19
98
7.1K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
Voice-only programming with Claude Code ... I've been playing with @aconchillo's MCP server that lets you talk to Claude Code from anywhere, today. I always have multiple Claudes running, and I often want to check in on them when I'm not in front of a computer. Here's a video of Claude doing some front-end web testing, hitting an issue and getting input from me, and then reporting that the test passed. In the video the Pipecat bot is using Deepgram for transcription and Cartesia for the voice. (Note: I sped up the web testing clickety-click sections of the video.) The code for the MCP server and the Claude skill are in the repo and Aleix wrote a really good README.md. You can use any of Pipecat's network transports: generally WebRTC, but you could set this up so you can call Claude on the phone if you wanted to. There's screen capture support, too, so you can view the Claude code window remotely. That's still a little experimental. Because this is an MCP server, it's not specific to Claude Code. Try it in other environments! It should work in Clawdbot, Codex, etc ...
English
7
5
35
2K
kwindla
kwindla@kwindla·
Pipecat 0.0.99 is a pretty big release! 25 items in the "Added" section, including vision (image input) support for OpenAI Realtime, word-level timestamps in AzureTTSService, the @krisp_ai VIVA turn detection model, and Grok Realtime voice-to-voice. There's also a fundamental new abstraction in this release: turn and interruption "strategies." We started working on Pipecat in 2023. (!) In those early days, we had just a few STT, TTS, and LLM models we could use for voice agents. The only turn detection option was Silero VAD. We were building fairly simple pipelines and targeting fairly simple use cases. All of that has changed. There are more than 90 services now in Pipecat core. Speech-to-text (transcription/ASR) models increasingly do much more than transcription, including turn detection, and with widely differing configuration options and API events. You can build Pipecat pipelines with speech-to-speech models, with STT->LLM->TTS cascades, or even using both in the same agent. 0.0.99 introduces a new way to configure and develop the "user turn start," "user turn stop," and "user mute" code in your pipelines. As always in Pipecat, the goals are: to make things things work consistently no matter what services you're using in your pipeline, to provide standard components that do things most people want to do, and for it to be easy to extend these standard components to do things that are unique to your application. Try out 0.0.99's turn strategies and let us know what you think of these new building blocks.
kwindla tweet media
English
4
4
33
1.7K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
Pipecat Cloud is @trydaily's enterprise hosting platform for open source voice agents. Today, after a 9-month beta period, we're promoting Pipecat Cloud to General Availability! With Pipecat Cloud, you build your voice agent on @pipecat_ai’s open source, vendor neutral core, add your custom code and agent logic, and then “docker push” to Pipecat Cloud. As with everything we do, Pipecat Cloud is engineered to give you flexibility, to not lock you into any service, including Pipecat Cloud itself. Any code that you can host on Pipecat Cloud you can self-host with no changes at all. We've focused on delivering: - fast agent start times (P99 <1s) - multi-region hosting - optimized global network transport - direct connectivity to Twilio, Telnyx, Plivo, Exotel and other telephony providers. - built-in @krispHQ VIVA models for noise reduction and turn detection - integrations with all the AI services, observability tools, and everything else supported by Pipecat You can sign up and "pipecat cloud deploy" immediately. We also have enterprise support contracts and can work with you to deploy a single-tenant, enterprise version of Pipecat Cloud in your VPC. Feel free to contact us if you have questions.
kwindla tweet mediakwindla tweet mediakwindla tweet mediakwindla tweet media
English
4
6
29
1.3K
Aleix Conchillo Flaqué retweetledi
Daily
Daily@trydaily·
Pipecat Cloud is now generally available. Pipecat Cloud is a managed, vendor-neutral platform for deploying and scaling open source voice agents, with ultra-low latency, multi-region support, and enterprise-grade realtime infrastructure. Thank you to the more than 1,000 teams that built and scaled with Pipecat Cloud during the platform beta.
Daily tweet mediaDaily tweet mediaDaily tweet mediaDaily tweet media
English
1
3
9
305
Aleix Conchillo Flaqué retweetledi
Daily
Daily@trydaily·
🎉 We are proud to support @nvidia's new Nemotron models, announced today at CES2026. We've been building high-performance voice agents with the new NVIDIA Nemotron Speech ASR model and integrating this model into Pipecat. Nemotron Speech ASR is completely open (weights, training data, inference tools), designed from the ground up for low-latency use cases like voice agents, and scores very well on our benchmarks. It also runs cost-effectively at large scale. Congratulations to the NVIDIA team on their open model breakthroughs and stay tuned for news all week from CES. Learn More: blogs.nvidia.com/blog/open-mode…
English
0
2
7
921
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
This robot assistant from the NVIDIA CES Keynote on Monday is going viral. @NaderLikeLadder explains all the hottest emerging AI trends in one demo: AI applications in 2026 will be multi-model, multi-modal, hybrid cloud/local, use open source models as well as proprietary models, control robots and embedded devices in the physical world, and have voice interfaces. (And the demo had a cute robot *and* a cute dog. Gold.) The demo was built with @pipecat_ai. NVIDIA posted a really nice technical walk-through and complete code. The Reachy Mini robot from @huggingface is open source hardware. (You can order it now, I have one!). You can run the assistant locally on your own hardware, in the cloud, or both.
English
27
101
503
48.6K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with @modal and @pipecat_ai cloud, or run locally on an @nvidia DGX Spark or RTX 5090.
English
84
456
3.6K
273.2K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
I've been playing with the new Lemon Slice realtime video avatar model that launched today. Here's a clip of a couple of avatars I created: a cartoon astronaut and a guide for the space game side project I've been hacking on. The guide avatar supports the Lemon Slice /imagine command, which changes the video on the fly. You can see my type "/imagine a working space suit with tools and velcro patches and stuff" and see what the Lemon Slice model does with that prompt! The idea for the astronaut character was to create something that felt like a fully realized cartoon animation. I used Nano Banana to create the character image, then used that image as the basis for the Lemon Slice avatar. I'm a big fan of models that can do cartoon and non-photorealistic avatars really well. I think there's a lot of interesting terrain to explore in this direction and would love to see talented designers create environments that emphasize imagination rather than "virtual reality." For the second character, I fired up Claude Code in the repo for the Gradient Bang game, and asked it to create an LLM prompt for a guide for newbies: > Create a prompt for an LLM that will guide new players in the Gradient Bang game universe. Include basics about the game, and good strategies for players who are just starting out. Include enough detail that you can answer questions about game mechanics and strategy. Make the prompt about 15 paragraphis long. Lots more information about what the model can do, in the launch thread below ...
English
4
4
39
3.5K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
New Gemini Live (speech-to-speech) model release today. Using the Google AI Studio API, the model name is: gemini-2.5-flash-native-audio-preview-12-2025 The model is also GA (general availability, so not considered a beta/preview release) on Google Cloud Vertex under this model name: gemini-live-2.5-flash-native-audio Try it out on the @pipecat_ai landing page.
kwindla tweet media
English
16
42
342
24.8K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
The team at @langchain built voice AI support into their agent debugging and monitoring tool, LangSmith. LangSmith is built around the concept of "tracing." If you've used OpenTelemetery for application logging, you're already familiar with tracing. If you haven't, think about it like this: a trace is a record of an operation that an application performs. Here's a very nice video from @_tanushreeeee that walks you through building and debugging a voice agent with full conversation tracing. Using the LangSmith interface you can find a specific agent session, then dig into what happened during each turn of the conversation. What did the user say and how was that processed by each model you're using in your voice agent? What was the latency for each inference operation? What audio and text was actually sent back to the user? Today's production voice agents are complex, multi-model, multi-modal, multi-turn systems! Tracing gives you leverage to understand what your agents are doing. This saves time during development. And it's critical in production. Tanushree shows using a local (on-device) model for transcription, then switching to using the OpenAI speech-to-text model running in the cloud. You can see the difference in accuracy. (Using Pipecat, switching between different models is a single-line code change.) Also, the video is fun! It's a French tutor. Which is a voice agent I definitely need.
kwindla tweet mediakwindla tweet mediakwindla tweet media
English
3
15
42
2.2K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
Pipecat 0.0.97 release. Some highlights: Support for @GradiumAI's new speech-to-text and text-to-speech models. Gradium is a voice-focused AI lab that spun out of the non-profit Kyutai Labs, which has been doing architecturally innovative work on neural codecs and speech-language models for the last two years. Continued improvements in the core text aggregator and interruption handling classes, both to fix small corner cases and to make behavior as configurable as possible. This is the kind of often-invisible work that underpins Pipecat's ability to support a wide range of models and pipeline "shapes." Models stream (or don't stream) tokens differently. Different use cases need to make different engineering trade-offs in the service of natural, low-latency interactions. Similarly, continued steps towards full support of reasoning models. Mostly, reasoning models haven't been used in voice AI pipelines, because we are generally prioritizing low latency. But, increasingly, we are using multiple models in parallel in voice agents. Thinking fast and slow, as it were. Using reasoning models requires updating `LLMContext` abstractions to thread thought signatures into the conversation context, and handling function call internals slightly differently. Access to word timestamps from the @cartesia speech-to-text model. The Smart Turn model service now defaults to the new v3.1 weights and uses the full current utterance rather than only the most recent fragment.
kwindla tweet media
English
4
10
53
4.8K
Aleix Conchillo Flaqué retweetledi
kwindla
kwindla@kwindla·
Smart Turn v3.1. Smart Turn is a completely open source, open data, open training code turn detection model for voice AI, trained on audio data across 23 languages. The model operates on the input audio in a voice agent pipeline. Each time the user pauses briefly, this model runs and returns a binary decision about whether the user has finished speaking or not. The 3.1 release has two big improvements: 1. New data sets for English and Spanish, collected and labeled by contributors Liva AI, Midcentury, and MundoAI. The majority of the training data for the Smart Turn model is synthetically generated. Using synthetic data makes it possible to scale up training for a model like this. We've done a lot of work on the synthetic data pipeline to emulate as much of the natural variability of human speech as possible. But accurately labeled human data is very valuable and has a measurable impact on model quality. The 3.1 training run incorporates three new human data sets. 2. An unquantized, GPU-oriented version of the model alongside the ONNX version intended to run on CPUs. The Smart Turn ONNX quant delivers a result in 12ms on my laptop and 70ms on a typical cloud vCPU. That's fast! Because this is an audio model, you can run it in parallel with transcription and it will generally give you a result before the transcription final chunks are available. But if you have GPUs in your fleet, you can run the model even faster. (Or, more to the point, very scalably.) Inference runs in ~2ms on an NVIDIA L40S. Read the launch blog post if you're interested in more details. And if you're running this model yourself, see notes in the blog post about ONNX runtime optimization. The Smart Turn model is fully integrated into Pipecat, and available in Pipecat Cloud.
kwindla tweet media
English
12
39
300
35.4K
Aleix Conchillo Flaqué retweetledi
Maxim Makatchev
Maxim Makatchev@maxipesfix·
Last week, just a day before Gemini 3 was released, @susuROBO helped run a 1-hour version of the @pipecat_ai x @GeminiApp hackathon at Osaka College of High Technology @osaka_hightech 大阪ハイテクノロジー専門学校. We used Pipecat's SmallWebRTC Prebuilt repo as a starting point (thanks @kwindla and @aconchillo!), which allowed even freshmen to finish the hour running a multimodal voice agent on their laptops. Ironically but not surprisingly, what they've built in under an hour was in a few ways more advanced than the Alexa they got as prizes.
Maxim Makatchev tweet mediaMaxim Makatchev tweet media
日本語
0
2
5
3.2K
Aleix Conchillo Flaqué retweetledi
Sarvam for Developers
Sarvam for Developers@SarvamForDevs·
Build production-ready real-time voice agents using @pipecat_ai and @SarvamAI Join our hands-on workshop and learn how to ship low-latency, scalable voice experiences in minutes. When: Monday, November 25th | 7:00 PM IST Register Here: luma.com/anmvmr9d
Sarvam for Developers tweet media
English
3
7
30
10.4K