francesco rosciano

63 posts

francesco rosciano

@frank__rosh

Building Patter — the open-source SDK that gives any AI agent a phone number. Twilio + STT/TTS + barge-in handled for you. Building in public.

Katılım Kasım 2023

15 Takip Edilen26 Takipçiler

francesco rosciano@frank__rosh·36m

@AndrewK404 Solid diagram. The boxes hide the hard part though: TTS→user has to be killable mid-playback the second someone talks. Cut on the mic = false-positive clips; cut on the playback queue = clean. That's most of what barge-in is in Patter. github.com/PatterAI/Patter

English

Andrew Kuncevich@AndrewK404·10 May

The architecture of almost any AI voice agent Lately I’ve been dealing a lot with their design. And in practice, almost all of them follow the same pattern: (User Voice =>) VAD => STT => Agent Runtime => TTS (=> Agent Voice) The image shows a slightly more complete version of the stack. - VAD figures out when the user started and stopped speaking. It’s needed so the agent doesn’t interrupt, doesn’t freeze in awkward pauses, and knows how to stop if it gets interrupted. - STT turns speech into text. This is where all the classic problems begin: noise, accents, names, numbers, addresses, and domain-specific terms. - Agent Runtime is the agent’s brain. Basically, it’s a classic AI agent runtime (with the added constraint that it has to be fast). It understands intent, maintains context, calls tools, queries CRM/DB/Calendar/RAG, and decides what to do next. - TTS turns the response back into speech. Here, the important things are speed, naturalness, emotion, and the ability to stop speaking quickly during barge-in. - And around all of this, you almost always end up with: an Audio Gateway, guardrails, PII/compliance, logs, traces, evals, and telephony via Twilio/SIP/PSTN. And from what I can tell, even so-called realtime models are solving roughly the same tasks under the hood: listen => detect the end of the turn => think => reply with speech. It’s just that part of the pipeline is hidden inside a single realtime session. The cost of this kind of agent is usually something like: Cost = STT minutes + LLM tokens + TTS characters/minutes + telephony minutes + infra/logs Very roughly, for something like a support agent, it might look like this: - LLM: 30% - TTS: 30% - STT: 15% - Telephony: 15% - Infra/logs: 10% VAD => STT => Agent Runtime => TTS

English

francesco rosciano@frank__rosh·38m

@NickSpisak_ @retellai @twilio Step 2 is the part most people skip — gating who's even allowed on the call before Annie says hi. Cost line that scales vs one that doesn't. The webhook→structured-transcript is the real product; voice's just the collection layer. I built Patter to own that flow in code.

English

Nick Spisak@NickSpisak_·19 May

Here's how I built a voice agent to do discovery calls. Her name is Annie. She is our assistant. She has a phone number on the @retellai platform which is backed by @twilio. Annie picks up the phone, talks to a prospect for about 15 minutes, and a fully written-up transcript is the final output. This is what's happening under the hood, in six steps. 1. The call comes in Annie picks up in about a second. Her voice is ElevenLabs Cimo, which sounds like a real person who's glad you called, not a robot reading a script. Her brain is GPT-5.2. I used 5.2 as the model that got the job done. A good balance between intelligence and cost. 2. She checks if you should even be on the call Before she says hi, the system looks up your phone number to make sure you actually booked an assessment. → No assessment on file ... she politely says goodbye → Assessment on file ... she uses one credit and moves on. This keeps our costs manageable. 3. She introduces herself Thirty seconds, tops. She says her name, tells you the call will take about 15 minutes, confirms who you are and what you do, and asks if you're ready to start. Sounds small. It's not. Setting the frame upfront is the difference between a focused call and a rambling one. 4. She listens. Wide angle. Five minutes of broad discovery. What's your business. What's your role. What tools do you use every day. Where does the friction live. She's told to listen, not diagnose. Most humans jump to solutions in minute two. Annie doesn't. She just learns your world and quietly clocks your top pain points. 5. She goes deep on the two biggest pain points Now she zeroes in. Five to seven minutes on the top two things she heard. How many hours a week does this cost you? What would "better" actually look like? She picks two and goes hard, instead of skimming ten. This is where the call earns its keep. By the end, we know exactly where AI would move the needle and roughly what it's worth in time. 6. She wraps up and writes the transcript analysis Annie gives you a plain-English recap of what she heard. Sets clear next steps. Ends the call. Then GPT-5.2 takes the transcript and pulls it apart into structured data: → Top pain points → Hours wasted per week → Tools in play → Sentiment Analysis The data is sent via webhook where we kick off an AI Tools Assessment skill that we custom wrote. That's how we tactically use AI. You don't need to build a rocket ship... Just know how to connect a couple pieces together and solve a pain point or two...

English

1.4K

francesco rosciano retweetledi

francesco rosciano@frank__rosh·1d

Spin up an AI agent with a real phone number this weekend. Open-source SDK. Python + TypeScript.

English

10.1K

francesco rosciano retweetledi

francesco rosciano@frank__rosh·1d

Build an AI receptionist in an afternoon. Open-source SDK + phone number.

English

francesco rosciano@frank__rosh·14h

@NaisuBanana why?

Ronnie🍌@NaisuBanana·1d

I hate this for so many reasons...

francesco rosciano@frank__rosh

Call +1 (920) 7-PATTER. Hear an AI phone agent answer in 412 ms. Open-source SDK. MIT licensed.

English

francesco rosciano@frank__rosh·14h

@nayabgauhar_07 This is the exact stack I got tired of rewiring — Twilio + Deepgram + an LLM + TTS, barge-in, pluggable providers, every project. That's why I open-sourced Patter: same pipeline, swap any provider in one line. Nice work shipping it live.

English

Nayab Gauhar@nayabgauhar_07·19 May

Built a real-time AI voice agent — deployed live at Trump Tower Noida. Twilio → Deepgram → Llama 3.3-70B → Sarvam TTS Barge-in. Memory. Post-call summaries. Pluggable providers. 🎥 loom.com/share/1ba80150… #AIEngineering #VoiceAI #BuildInPublic

English

francesco rosciano@frank__rosh·14h

@apaarmeet @rumik_ai @lets_dig_deeper Interruptions humbled me the most. What finally worked: handle barge-in on the playback queue, not the mic — kill buffered audio mid-chunk so the agent goes silent the moment the caller speaks. Two clocks, not one. Wiring that into Patter is what made calls feel human.

English

Apaarmeet@apaarmeet·1d

What an amazing experience. Seeing realtime Voice AI in action completely changed the way I think about AI products — latency, conversation flow, interruptions, orchestration… it’s a whole different engineering challenge. Absolutely amazing model by @rumik_ai @lets_dig_deeper

English

757

francesco rosciano@frank__rosh·14h

@nyk_builderz This is the real spec. The demo→prod gap is almost entirely barge-in, replayable logs, and tool access gated by state. First two are pure plumbing you shouldn't rebuild per project — why Patter ships barge-in + OTel tracing + a dashboard out of the box. getpatter.com

English

Nyk 🌱@nyk_builderz·2d

Most “voice AI products” are still polished demos. Fast model ≠ production system. If your stack can’t handle: - <700ms turns - true barge-in - tool access by state - retrieval with replayable logs it will break the moment real callers show up. This article hits because it’s built from failures in production, not benchmark screenshots. If you’re serious about voice in 2026, read this first.

Avid@Av1dlive

x.com/i/article/2053…

English

1.1K

francesco rosciano@frank__rosh·15h

I can make a voice agent demo look perfect. Prod is where it bites — a call goes sideways and you can't tell if it was STT, the LLM, TTS, or the carrier. Patter ships OpenTelemetry tracing + a dashboard so every call has a timeline. github.com/PatterAI/Patter

English

francesco rosciano@frank__rosh·17h

I kept being forced to pick: speech-to-speech or pipeline. Realtime feels better; pipeline is cheaper. So Patter ships both from the same API. Swap the engine= arg, same agent code, same carrier. github.com/PatterAI/Patter

English

299

francesco rosciano@frank__rosh·19h

@mrbese @openclaw @Tailscale Nice — the personal/local case is underserved. The gap I kept hitting every project was the reverse: real callers reaching the agent via an actual phone number. Kept wiring the same Twilio glue from scratch. Eventually just open-sourced it as Patter.

English

Omer Bese@mrbese·5d

No more paid voice-agent setups. Introducing my first @openclaw skill: Call-My-Agent. No Twilio. No phone number. Just a simple private voice line between you and your agent, running locally over @Tailscale Small step for mankind, one giant leap for me. xoxo, @ClawHub: clawhub install call-my-agent GitHub: github.com/mrbese/call-my…

English

140

francesco rosciano@frank__rosh·19h

@kwindla The telephony stack eats into that budget too. Twilio's media routing adds ~50ms on top, so for phone-based agents the real ceiling is closer to 600ms. Building Patter on top of Twilio and Telnyx, Haiku 4.5 is exactly what I default to for the same reason.

English

kwindla@kwindla·6d

Gemini 3.5 Flash is out today. Here are numbers from my main voice and task agent benchmarks. Some notes: All the Gemini 3 models so far are too slow to work well for voice agents. Gemini 2.5 Flash was a *great* model for voice agents, when it was SOTA. It was fast and good at instruction following. Its big weakness was tool calling. It was quite difficult to prompt Gemini 2.5 Flash to perform tool calling reliably in long context, multi-turn use cases. With Gemini 3, Google improved the tool calling issues a lot. But time to first token is ~1s. We really need TTFT down below 700ms. Google isn't alone in this. All the SOTA models released this year have been reasoning models that aren't optimized for low latency. Claude Haiku 4.5 (released last October) remains the best-performing model with a TTFT under 700ms. Gemini 3.5 Flash is the first Flash model in the 3 family to be released as "generally available." It's quite different from gemini-3-flash-preview, which was released last December. That model actually scored a bit better on my voice agent benchmark. This new model is the new overall top scorer on my task agent benchmark. This benchmark tests a multi-turn task, requiring that models achieve a P50 turn execution time faster than four seconds. Gemini 3.5 Flash with a "high" thinking budget scores significantly better than any other model I've tested. So even though the TTFT isn't what we'd like to see from this model, the overall generation speed makes up for it, and allows us to use the "high" thinking budget and still achieve a per-turn P50 under two seconds. Very impressive. This performance costs money, though. I had become accustomed to thinking of Gemini models as aggressively priced. But Gemini 3.5 Flash is actually more expensive than GPT-5.4 and Claude Sonnet 4.6 on this benchmark. Also note that lower reasoning settings don't always save money. Gemini 3.5 Flash "minimal" costs more, on this benchmark, than "high," because it makes more mistakes, so it uses more tokens to complete the task. Please note that performance of this model on your benchmarks might be very different. My voice agent and task agent results are often wildly out of line with the reported results on standard benchmarks in the model cards and release notes. The voice agent benchmark is 30 turns, and heavily tests tool calling in a long-context scenario. The task agent benchmark injects large streams of structured data events into the context, all tool calls are asynchronous, and the test task takes at least 32 turns to complete. (My motto for evals is "30 turns or it didn't happen.") Make your own benchmarks! (And post the source code and the results for different models, if you can.)

English

113

14K

francesco rosciano@frank__rosh·1d

@KaiXCreator Built awesome-claude-call entirely inside Claude Code — it's a stop-hook plugin that calls your phone when a long task finishes, rings you with a spoken summary you can talk back to. Meta experience: shipping a phone-call product using an AI that doesn't have a phone.

English

221

Kaito@KaiXCreator·1d

Can you call yourself a founder if your entire product was built by Claude?

English

377

257

34.3K

francesco rosciano@frank__rosh·1d

@launch_llama Building Patter — open-source SDK that gives any AI agent a real phone number. Handles Twilio carrier auth, STT, TTS, barge-in, AMD. About 4 lines of Python replaces ~50 of glue code. github.com/PatterAI/Patter

English

Tom Otto@launch_llama·1d

𝗦𝗵𝗼𝘄 𝘂𝘀 𝘄𝗵𝗮𝘁 𝘆𝗼𝘂'𝗿𝗲 𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴.  Drop your product in the comments.  If we love it, 𝗪𝗲'𝗹𝗹 𝗹𝗶𝘀𝘁 𝗶𝘁 𝗼𝗻 𝗟𝗮𝘂𝗻𝗰𝗵 𝗟𝗹𝗮𝗺𝗮 𝗗𝗶𝗿𝗲𝗰𝘁𝗼𝗿𝗶𝗲𝘀. 🦙✨

English

178

5.5K

francesco rosciano@frank__rosh·1d

github.com/PatterAI/Patter

ZXX

francesco rosciano@frank__rosh·1d

@falsely_flagged guess what you can let scammer get mad at a very lazy annoying agent :)

English

solatticus@falsely_flagged·1d

Indian scammers rejoice

francesco rosciano@frank__rosh

Call +1 (920) 7-PATTER. Hear an AI phone agent answer in 412 ms. Open-source SDK. MIT licensed.

English

francesco rosciano@frank__rosh·1d

@NanoCodesAI Cool methodology — though it measures training-data saturation as much as quality. More public docs = higher rank, because coding agents default to familiar. Missing category: provider-agnostic SDKs where you swap carrier without touching agent code. github.com/PatterAI/Patter

English

Nano@NanoCodesAI·3d

We ran the Morphiq Bench voice-agent benchmark. The question: when a coding agent is asked to build a voice-agent feature, which provider does it actually choose? We tested Claude Code and Codex across 1,340 repositories and 3 development intents: • inbound call • outbound call • voice web widget Top providers selected: 1. @Vapi_AI 2. @livekit 3. @ElevenLabs 4. @retellai 5. @pipecat_ai 6. @usebland This isn’t a generic “best voice agent company” ranking. It measures which providers agents are most likely to pick when implementing voice-agent workflows in code. Would be happy to walk through the methodology or share what drove the results with any of the teams here.

English

410

francesco rosciano@frank__rosh·1d

@pavitarsaini The `webrtc_url` flag is interesting — native WebRTC from iOS cuts latency vs. WebSocket relay, which matters in booking flows. Hardest UX problem isn't STT accuracy — it's catching 'wait, actually cancel that' mid-confirmation. Learned this building Patter.

English

pav@pavitarsaini·4d

Looks like Uber is building a voice agent into their iOS app that can book you a ride, using OpenAI Realtime API. Flags: rider_booking_agent_openai_realtime_url rider_booking_agent_openai_webrtc_url rider_booking_agent_tts_message_command ...

English

1.4K

francesco rosciano@frank__rosh·1d

I got tired of walking back to my laptop to find out if a Claude Code task succeeded. So now it just calls me. /claude-call:notify-me +1555... Built on Patter. github.com/PatterAI/aweso…

English

363

francesco rosciano@frank__rosh·1d

@blenderskool @ElevenLabs The parallel negotiation architecture is where it gets real. Going from hackathon demo to production telephony — real phone numbers, AMD, deal state across callbacks — is where most teams stall. That’s exactly what I built Patter to bridge.

English

Akash Hamirwasia@blenderskool·1d

That's a wrap! It was lovely seeing the kind of voice agents people built on @ElevenLabs in just 3 hours! The winning project by Avishek Jha was a truck leasing negotiation voice agent. Loved his idea of having a batch of agents making realtime negotiations in parallel 🚀

English

717

Keşfet

@AndrewK404 @NickSpisak_ @retellai @twilio @NaisuBanana @nayabgauhar_07 @apaarmeet @rumik_ai