Mark Backman

25 posts

Mark Backman

Mark Backman

@mark_backman

Los Angeles, CA Katılım Haziran 2011
55 Takip Edilen130 Takipçiler
Gustavo Garcia
Gustavo Garcia@anarchyco·
@mark_backman @livekit @pipecat_ai @TenFramework Tried to fix those points, thank you for the feedback 🙇. Let me know if you want me to rewrite something else (here or dm). Regarding orchestration my theory is that it can be done w a simpler HTTP service w/o realtime agents frameworks. I'll try to share my thoughts soon.
English
1
0
0
97
Gustavo Garcia
Gustavo Garcia@anarchyco·
I wrote some notes about different RealTime AI Agents frameworks that I've been looking at recently (@livekit Agents, @pipecat_ai and @TenFramework ): @ggarciabernardo/realtime-ai-agents-frameworks-bb466ccb2a09" target="_blank" rel="nofollow noopener">medium.com/@ggarciabernar… Hopefully it is interesting for somebody else. Feedback welcomed.
Gustavo Garcia tweet media
English
6
1
22
1.5K
Mark Backman retweetledi
kwindla
kwindla@kwindla·
Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.
English
16
75
607
64.1K
Mark Backman retweetledi
Joe Heitzeberg
Joe Heitzeberg@jheitzeb·
Correctly identifying guitar sounds with Google Gemini 2 voice-to-voice
Joe Heitzeberg@jheitzeb

Great to have this repo! thanks @kwindla! Gemini 2 Voice-to-Voice = the fastest voice interface? Pipecat means fast transport. It's clearly hearing non-verbal audio but struggles to describe it, and can't sing or laugh. Great for practical apps. This repo rocks!

English
4
6
55
4.9K
Mark Backman
Mark Backman@mark_backman·
@trydaily and @Google have partnered to help you build real-time AI agents with the new Gemini Multimodal Live API. Really proud of the team working on @pipecat_ai! Get started here: docs.pipecat.ai/guides/feature….
Daily@trydaily

Gemini 2.0 launched today. Amazing multimodal capabilities, long context windows, fast response times, built-in tools, and top-of-the-leaderboards reasoning capabilities. Plus a new API — the Multimodal Live API — for conversational AI applications, like voice agents and multimodal copilots. @Google and Daily have partnered to build Multimodal Live API support into the @pipecat_ai Open Source SDKs for Web, Android, iOS and C++. The Pipecat SDKs come with echo cancellation and noise reduction, device management, event abstractions, React hooks, and more. They support both direct connections to the Gemini WebSocket API, and WebRTC routing on Daily's global ultra-low latency network. Build realtime voice agents with Gemini, Pipecat, and Daily. Links to docs and starter kits in the thread below (1/4)...

English
0
0
4
35
Mark Backman
Mark Backman@mark_backman·
@kwindla I was just thinking about this yesterday! I’ll have to fire it up now to see how it stacks up.
English
0
0
0
107
kwindla
kwindla@kwindla·
Thinking today about how far voice AI has come over the past year. On Dec 5 last year we released Talk to Santa Cat - iOS and Android voice AI apps that let you (or maybe a child in your life) talk to a "cat who lives in Santa's workshop." As far as I know, Santa Cat was the first talk-to-an-AI-Santa experience. The cat theme was a result of iterating on Santa voices and keyframe animations ... and feeling like every attempt to ship "Santa" fell into the uncanny valley. Just weird enough to be too weird. Not quite weird enough to be interesting. But an 8-bit Santa's Cat with a squeaky voice — pretty fun! And GPT-4 has always been great at performing gentle sassiness and puns, so we leaned into that with the Santa Cat prompting. (Props to @petehawkes for the design work, including all the many iterations required to get to the sweet spot.) This was state-of-the-art voice AI at the time. And every kid who tried it loved it. But compared to voice agents today, everything in the Santa Cat app feels so much slower and clunkier. The core components are not necessarily different from what you'd use to build a production conversational voice app today. The voice was @ElevenLabs. The LLM was @OpenAI GPT-4 (there was no 4o, yet). The speech-to-text was @DeepgramAI . The orchestration layer was an early version of @pipecat_ai. However: - No interruption handling. - VAD is significantly worse. - Latency was just bad enough that after a bunch of user testing with kids we decided the app needed audio cues to signal “bot thinks you stopped speaking” and “bot stopped speaking and expects you to talk now.” All the models have gotten so, so, so much better over the past year! Faster. More capabilities. More predictable behavior across a wider range of use cases. More choices/competition in every category. The app is still in the app store. You can download it if you want to try a blast from the (not too distant) voice AI past ...
English
2
2
19
1.5K
Mark Backman retweetledi
Aleix Conchillo Flaqué
Aleix Conchillo Flaqué@aconchillo·
We have achieved so much in @pipecat_ai! Thank you all! The community is amazing and keeps growing (1445 on Discord)! Pretty sure it's probably the most complete and powerful conversational AI orchestration framework out there (and this list is even missing a few things!):
Aleix Conchillo Flaqué tweet media
English
0
1
5
1.8K
Mark Backman
Mark Backman@mark_backman·
I had a lot of fun building out this demo. Check it out!
kwindla@kwindla

Talk to (a bootleg) virtual @benthompson [Meta-note: I recorded this video in a Waymo. So you're watching an AI experience inside an AI experience.] We did an internal voice AI hackathon a couple of weeks ago at @trydaily. Several of us are long-time @stratechery fans; @mark_backman had the idea of creating a "talk to Ben Thompson" toy demo. This kind of project is a really nice testbed for combining RAG with voice. I'll put some notes about building voice + RAG below, but if you just want to jump to a live demo, there's a link further down in this thread. The tech stack here breaks down into two parts: preparing and indexing the data, and running the live experience. There are lots and lots of choices right now for chunking, embedding, and storage/retrieval tooling. Mark used these: - @spacy_io for semantic chunking - @OpenAI text-embedding-3-small - @pinecone to store the embeddings The live app uses: - @OpenAI gpt-4-o mini - function calls trigger a @langchain query - the voice is a @cartesia clone - @pipecat_ai does the low latency phrase endpointing, interruption handling, context management, and orchestration - @trydaily Daily Bots voice transport and Pipecat hosting - the demo app is hosted on @vercel A link to the full source code of the app (but not the copyrighted Stratechery content) is in the thread below. Several things about building this are tricky: - Latency really matters, and it's hard to make function calling + RAG fast. This was an experiment, not a production app, but Mark was still able to get the median total (voice-to-voice) response time down below 1.5s. In general, we aim for ~800ms for conversational voice AI response times, so this is slower than we want these experiences to be. But the median here isn't terrible. The outliers do feel too slow, though. - RAG is complicated to get right. Mark did a lot of experimenting with chunking and embeddings. I think this definitely clears the 80/20 bar of being an interesting demo. I'm interested in what you think if you try it! For a production app, we'd want to do significantly more work on the retrieval subsystem. The quality of the data fetch heavily influences the quality of the conversational output. I'm convinced that talking to an LLM "personality" is going to be a very, very common thing in the near future. Sometimes, we'll talk to personalities that are slices of real peoples' public personals. Like this one. I also think there will be hugely popular personas that are "natively AI," personalities that are not based on a specific, real person. These new apps pose interesting, interrelated questions about copyright, user expectations and desires, and UI design. We trained this app on copyrighted Stratechery content and cloned a real person's voice. This is clearly a copyright violation and of course we'll take the demo down if Ben Thompson objects to it being publicly available. Note that it's not possible to retrieve the copyrighted material, here. It's only possible to get GPT-4o-mini's "remix" of the content. We needed a "behind a paywall" corpus of content to build an interesting RAG demo, because today's large LLMs are trained on much of the freely accessible information on the Internet. There are two things to note about that: 1. You can build a decent "clone" of a public person's personality just by creatively prompting a state-of-the-art LLM. You won't necessarily get the specific content grounding you probably want, but writing style usually shines through pretty nicely. 2. Almost all of the content used for training state-of-the-art LLMs is copyrighted content, even when it's not behind a paywall. Courts haven't yet ruled on whether mixing *a lot* of copyrighted content from many sources together constitutes a legal use of copyrighted content. Perhaps this falls under the category of "fair use." Perhaps not. I went in person to see the Eldred vs Ashcroft oral argument at the Supreme Court in 2002. The court's decision in that case upheld the Digital Millenium Copyright Act. That felt momentous at the time — and wrongly decided. It seems certain that there will be an even more momentous case about how copyright law applies to large model training. Perhaps our highly polarized congress will find a way to pass new laws that extend and clarify copyright for this new era. If so, we should hope that corporate lobbyists aren't the primary authors of that law, as they were with the DMCA. We did not use any of the Stratechery podcast content for this demo, because adding multi-modal, multi-person content was beyond the scope of a hackathon project. But it sure seems like you'd want to add all of that great audio source material to a bigger, production-quality, authorized version of an app like this. It's less obvious to me whether you would want to try to add in non-Stratechery content from Ben Thompson. (NBA commentary and analysis!) Thompson maintains a "no-tech" X account — @NoTechBen. Should this content separation that makes sense on X also port over to the new generative AI personality world? Anyway, this is now a very long post ... so go play with the demo if you're so inclined. Link in the next tweet.

English
0
0
2
162
Mark Backman retweetledi
kwindla
kwindla@kwindla·
Talk to (a bootleg) virtual @benthompson [Meta-note: I recorded this video in a Waymo. So you're watching an AI experience inside an AI experience.] We did an internal voice AI hackathon a couple of weeks ago at @trydaily. Several of us are long-time @stratechery fans; @mark_backman had the idea of creating a "talk to Ben Thompson" toy demo. This kind of project is a really nice testbed for combining RAG with voice. I'll put some notes about building voice + RAG below, but if you just want to jump to a live demo, there's a link further down in this thread. The tech stack here breaks down into two parts: preparing and indexing the data, and running the live experience. There are lots and lots of choices right now for chunking, embedding, and storage/retrieval tooling. Mark used these: - @spacy_io for semantic chunking - @OpenAI text-embedding-3-small - @pinecone to store the embeddings The live app uses: - @OpenAI gpt-4-o mini - function calls trigger a @langchain query - the voice is a @cartesia clone - @pipecat_ai does the low latency phrase endpointing, interruption handling, context management, and orchestration - @trydaily Daily Bots voice transport and Pipecat hosting - the demo app is hosted on @vercel A link to the full source code of the app (but not the copyrighted Stratechery content) is in the thread below. Several things about building this are tricky: - Latency really matters, and it's hard to make function calling + RAG fast. This was an experiment, not a production app, but Mark was still able to get the median total (voice-to-voice) response time down below 1.5s. In general, we aim for ~800ms for conversational voice AI response times, so this is slower than we want these experiences to be. But the median here isn't terrible. The outliers do feel too slow, though. - RAG is complicated to get right. Mark did a lot of experimenting with chunking and embeddings. I think this definitely clears the 80/20 bar of being an interesting demo. I'm interested in what you think if you try it! For a production app, we'd want to do significantly more work on the retrieval subsystem. The quality of the data fetch heavily influences the quality of the conversational output. I'm convinced that talking to an LLM "personality" is going to be a very, very common thing in the near future. Sometimes, we'll talk to personalities that are slices of real peoples' public personals. Like this one. I also think there will be hugely popular personas that are "natively AI," personalities that are not based on a specific, real person. These new apps pose interesting, interrelated questions about copyright, user expectations and desires, and UI design. We trained this app on copyrighted Stratechery content and cloned a real person's voice. This is clearly a copyright violation and of course we'll take the demo down if Ben Thompson objects to it being publicly available. Note that it's not possible to retrieve the copyrighted material, here. It's only possible to get GPT-4o-mini's "remix" of the content. We needed a "behind a paywall" corpus of content to build an interesting RAG demo, because today's large LLMs are trained on much of the freely accessible information on the Internet. There are two things to note about that: 1. You can build a decent "clone" of a public person's personality just by creatively prompting a state-of-the-art LLM. You won't necessarily get the specific content grounding you probably want, but writing style usually shines through pretty nicely. 2. Almost all of the content used for training state-of-the-art LLMs is copyrighted content, even when it's not behind a paywall. Courts haven't yet ruled on whether mixing *a lot* of copyrighted content from many sources together constitutes a legal use of copyrighted content. Perhaps this falls under the category of "fair use." Perhaps not. I went in person to see the Eldred vs Ashcroft oral argument at the Supreme Court in 2002. The court's decision in that case upheld the Digital Millenium Copyright Act. That felt momentous at the time — and wrongly decided. It seems certain that there will be an even more momentous case about how copyright law applies to large model training. Perhaps our highly polarized congress will find a way to pass new laws that extend and clarify copyright for this new era. If so, we should hope that corporate lobbyists aren't the primary authors of that law, as they were with the DMCA. We did not use any of the Stratechery podcast content for this demo, because adding multi-modal, multi-person content was beyond the scope of a hackathon project. But it sure seems like you'd want to add all of that great audio source material to a bigger, production-quality, authorized version of an app like this. It's less obvious to me whether you would want to try to add in non-Stratechery content from Ben Thompson. (NBA commentary and analysis!) Thompson maintains a "no-tech" X account — @NoTechBen. Should this content separation that makes sense on X also port over to the new generative AI personality world? Anyway, this is now a very long post ... so go play with the demo if you're so inclined. Link in the next tweet.
English
2
1
21
4.3K
Mark Backman
Mark Backman@mark_backman·
RT @kwindla: Motivated by #Twilio's announcement that Twilio Video is going away, I've been spending some time digging into what the latest…
English
0
2
0
11
Mark Backman
Mark Backman@mark_backman·
My team has been working on a fun holiday project—Talk to Santa Cat. Santa Cat is a kid-friendly, AI-powered virtual character who is designed to have Christmas themed conversations with people of all ages, especially kids. santacat.ai
English
0
0
2
45
Stitch by Google
Stitch by Google@stitchbygoogle·
The moment is here! Galileo AI is opening up sign-ups for our private beta. If you want immediate access and to start generating designs for free, reply to this thread to skip the line. Be more like Alex and use Galileo AI to crush your deadlines. 👇
English
556
36
272
82K
Mark Backman retweetledi
Daily
Daily@trydaily·
📣 We’re excited to release color theming, our first major customization option for Daily Prebuilt, our embeddable #WebRTC call interface. Developers can now customize the colors of the video call UI, including the background, text, and icons. Read on ⏬
English
1
3
12
0
Mark Backman retweetledi
Christian Stuff
Christian Stuff@Regaddi·
🎶 Do you wanna build a frontend? C‘mon let’s go & code! We’ll work together remotely, and then you’ll see, it’s a happy place to be! We’re gonna be best buddies, but now we’re not, cause I don’t know you yet. Do you wanna build a frontend? Or are you more into fullstack? Ok bye
kwindla@kwindla

@tranhelen We are hiring remote full stack, front end, devops, video, and support engineers at @trydaily. Roles are described here: daily.co/jobs (but please reach out if you're interested in what we're doing and don't see a role that's an exact fit).

English
2
2
4
0