Pipecat AI

223 posts

Pipecat AI

@pipecat_ai

100% open source framework for realtime voice and multimodal AI. Maintained by @trydaily engineering team with support from the Pipecat developer community.

Katılım Mayıs 2024

3 Takip Edilen4.5K Takipçiler

Pipecat AI retweetledi

Daily@trydaily·1d

GTC! Head on over to AWS Booth 921, Kiosk 3. Check out building realtime AI with @NVIDIAAI Nemotron, @awscloud and @pipecat_ai

kwindla@kwindla

Come by and see @EvanGrenda at the AWS booth at GTC. @tavus video avatars, voice agents built with NVIDIA Nemotron models, and new realtime AI architecture patterns in @pipecat_ai!

English

379

Pipecat AI@pipecat_ai·12 Mar

@marcuswquinn @ShaneMac 🔥

QME

Marcus Quinn 🤫@marcuswquinn·7 Mar

@ShaneMac amazing — could totally make a personal and private Grok with that do calls as well, and we can use @pipecat_ai

English

Shane Mac@ShaneMac·7 Mar

What if there wasn’t even an API? What if Language was the API? x.com/shanemac/statu…

11AM w/ Seed Club@11AMdotclub

INVERTED API KEYS “Where x402 is uniquely useful is the concept of inverted API keys. Instead of you giving the API key to the developer, the developer gives you the API key.” – @programmer of @CoinbaseDev

English

477

Pipecat AI retweetledi

Daily@trydaily·11 Mar

Today's @NVIDIA Nemotron 3 Super launch is an exciting development for voice AI developers. We’re proud to be a launch partner, with day-0 @pipecat_ai support. Developers now have a meaningful open stack for realtime voice, with @NVIDIAAI — Nemotron 3 Nano, Nemotron Speech ASR, Nemotron 3 Super. Open models, open training data. Review how Nemotron 3 Super matches proprietary models in our long-conversation voice agent benchmarks. Happy building, with open source!!

English

Pipecat AI retweetledi

kwindla@kwindla·11 Mar

NVIDIA Nemotron 3 Super launches today! We've been building voice agents with Super's pre-release checkpoints and running all our various tests and benchmarks. Nemotron 3 Super matches both GPT-5.4 and GPT-4.1 in tool calling and instruction following performance on our realtime conversation, long context, real-world benchmarks. GPT-4.1 is the most widely used LLM today for production voice agents. So an open model that performs as well as GPT-4.1 on hard, voice-specific benchmarks is a big deal. (Side note: we don't think a benchmark "tells the story" about a model's voice agent performance unless it tests model correctness across at least 20 human/agent conversation turns.) The Nemotron models are *fully* open: weights, data sets, training code, inference code. Nemotron 3 Super is 120B params, with a hybrid Mamba-Transformer MoE architecture for efficient inference. You can run it on NVIDIA data center hardware or on a DGX Spark mini-desktop machine. 1M token context. Blog post with full benchmarks, thinking budget notes, inference setup on @Modal, and where we think this goes next. 👇

English

230

19.1K

Pipecat AI@pipecat_ai·3 Mar

x.com/trydaily/statu…

Daily@trydaily

11a PT - @NVIDIAAI Nemotron Labs livestream starts in a couple minutes! How to build voice agents with @nvidia Nemotron open models, using @modal @pipecat_ai @trydaily @kwindla

ZXX

435

Pipecat AI@pipecat_ai·3 Mar

🔥 Streaming now, @NVIDIAAI Nemotron Labs session. How to build voice agents on @nvidia open modals, using @modal @trydaily @pipecat_ai cc @kwindla

English

516

Pipecat AI@pipecat_ai·3 Mar

New STT realtime release: @AssemblyAI Universal-3-Pro, supported in Pipecat

AssemblyAI@AssemblyAI

Real-time transcription just got a significant upgrade. Universal-3-Pro is now available for streaming — bringing AssemblyAI's most accurate speech model to live audio for the first time. Developers building voice agents, live captioning tools, and real-time analytics pipelines now get three things they've been asking for: 🔹 Best-in-class word error and entity detection across streaming ASR benchmarks 🔹 Real-time speaker labels — know who said what, as it happens 🔹 Superior entity detection for names, places, orgs, and specialized terminology in real-time 🔹 Code-switching and global language coverage built-in

English

2.6K

Pipecat AI retweetledi

Daily@trydaily·2 Mar

Join @NVIDIAAI for a Nemotron Labs livestream on building voice agents with open models. Tuesday, 3/3, link in thread, @modal @pipecat_ai @nvidia @kwindla Ben Shababo

English

1.1K

Pipecat AI@pipecat_ai·26 Şub

Norsk

379

Pipecat AI@pipecat_ai·26 Şub

Bo Xie of @OpenAI joins our Thursday voice AI meetup. He'll talk with us about training the new gpt-realtime-1.5 speech-to-speech model, @OpenAIDevs. SF + livestream link in thread.

OpenAI Developers@OpenAIDevs

Voice workflows just got stronger with gpt-realtime-1.5 in the Realtime API. The model offers more reliable instruction following, tool calling, and multilingual accuracy. Demo with @charlierguo

English

Pipecat AI retweetledi

kwindla@kwindla·24 Şub

One of my 2026 predictions is that we're going to see a lot of interesting new experiments with LLM-powered games. There are just so, so many possibilities. The main barrier is inference cost. But that's dropping fast. My friends Vanessa and Sunah have been tinkering with a voice game called Crush Quest. Crush Quest has multiple characters, a bunch of really good prompting, and you can play on the web or (clone the repo and) wire up a telephone number. It's, you know, totally open source and that's radical. As you can maybe tell from my hip use of slang, Crush Quest is set in the early 1990s. It's an homage to a classic electronic board game called Dream Phone. Check out the thread below for a link to the most perfectly 1991 TV commercial for Dream Phone. I can taste the Lucky Charms when I watch this commercial. h/t to @chelcietay who I had a great conversation with recently about our 2026 predictions and where social and gaming is going.

English

Pipecat AI retweetledi

kwindla@kwindla·23 Şub

Brand new speech-to-speech model from @OpenAIDevs today! GPT Realtime 1.5 achieves a very nice jump in tool calling and instruction following performance on our voice agent benchmarks. @charlierguo's demo video shows a great example of perfect performance on a hard end-to-end audio understanding and speech production task: the model captures a seven-character order number (mixed digits and numbers), and repeats it back. The demo video made me hungry. I definitely need some Inference Chips with my OpenAI Neural Net Burger.

OpenAI Developers@OpenAIDevs

Voice workflows just got stronger with gpt-realtime-1.5 in the Realtime API. The model offers more reliable instruction following, tool calling, and multilingual accuracy. Demo with @charlierguo

English

108

16.7K

Pipecat AI@pipecat_ai·23 Şub

🔥 Try gpt-realtime-1.5 for free, at pipecat.ai

OpenAI Developers@OpenAIDevs

Voice workflows just got stronger with gpt-realtime-1.5 in the Realtime API. The model offers more reliable instruction following, tool calling, and multilingual accuracy. Demo with @charlierguo

English

991

Pipecat AI retweetledi

Daily@trydaily·23 Şub

Benchmarking Claude Sonnet 4.6 for Voice AI. @kwindla mentions we'll be talking benchmarks and voice AI at our Thursday meetup - link in thread.

kwindla@kwindla

Claude Sonnet 4.6 scores 100%, with a median TTFT of 850ms, on our standard LLM Voice Agent performance benchmark. It's currently the fastest model that saturates this benchmark. I also re-ran the numbers for the whole leaderboard, and Claude Haiku 4.5 scored 98% with a TTFT of 637ms. This puts Haiku in front of GPT 5.1 in the rankings, and a bit better in "intelligence" than GPT 4.1, but 100ms slower. This is the first time we've had an Anthropic model that's a really good fit for most of our voice agent use cases. And now we have two! Claude models have always had great instruction following, tool calling, and conversational dynamics. But they've been slower than the other SOTA models. That's changed. One reason to re-run a benchmark like this is that latency changes. We continuously monitor latency for all the models we regularly use. But a specific run of a long-format benchmark like this is a bit different than our standard monitoring. Another reason, though, is that models like Claude, Gemini, and the GPT family are hosted systems and they evolve. A good rule of thumb is that changes in model behavior are probably your own code rather than real changes on the provider side. But that's not always true. And this performance jump for Claude Haiku 4.5 over the past two months is dramatic. I recently fixed some corner cases in tool call handling and improved the judging prompts in this benchmark. So I'll re-run Claude Haiku 4.5 against the benchmark code from 2 months ago, at some point, because I'd like to understand whether I previously had bugs that unfairly penalized Haiku. But either way, whether the model has gotten better or we've ironed out some issues with the benchmark, Haiku is impressive and is worth experimenting with if you are a voice AI developer.

English

969

Pipecat AI@pipecat_ai·19 Şub

RSVP, luma.com/8jv5zur5

English

164

Pipecat AI@pipecat_ai·19 Şub

🔥 Voice AI Meetup, next Thursday, 2/26! @Speechmatics @tavus @trydaily

English

726

Pipecat AI retweetledi

kwindla@kwindla·17 Şub

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) @mark_backman made a @pipecat_ai PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

English

365

26.1K

Pipecat AI retweetledi

kwindla@kwindla·14 Şub

Wake up, babe. New Pareto frontier chart just dropped. Benchmarking STT for voice agents: we just published one of the internal benchmarks we use to measure latency and real-world performance of transcription models. - Median, P95, and P99 "time to final transcript" numbers for hosted STT APIs. - A standardized "Semantic Word Error Rate" metric that measures transcription accuracy in the context of a voice agent pipeline. - We worked with all the model providers to optimize the configurations and @pipecat_ai implementations so that the benchmark is as fair and representative as we can possibly make it. Entirely open source. You can run the benchmark yourself and reproduce the results.

English

146

11.8K

Pipecat AI retweetledi

Daily@trydaily·8 Şub

🙌 New @SarvamAI TTS model, fully integrated with @pipecat_ai s/o @SarvamForDevs

Pratyush Kumar@pratykumar

Drop 5/14: Introducing Bulbul V3, our latest text-to-speech model. It raises the bar for how human it sounds, while being super robust. In an independent third-party human listening study, Bulbul V3 delivers the highest listener preference, and low error rates across use-cases and languages. See details in our blog, but first watch the video. sarvam.ai/blogs/bulbul-v3

English

4.5K

Pipecat AI retweetledi

Rime@rimelabs·4 Şub

Today we're thrilled to announce our newest flagship TTS model, Arcana v3! - 120ms latency - Multilingual in 10+ languages - Word-level timestamps - 100+ concurrency - Cloud + on-prem Start building with it today!

English

2.2K

Keşfet

@NVIDIAAI @awscloud @marcuswquinn @ShaneMac @NVIDIA @Modal @nvidia @modal