Wesley

20 posts

Wesley

@SumHern27200

thinking deeply, posting unseriously

San Francisco Katılım Ocak 2024

16 Takip Edilen23 Takipçiler

Wesley@SumHern27200·3d

@Cartoonhari Prompt caching: long system prompts, stop reprocessing every turn Model routing: Haiku for simple fields, Sonnet only when reasoning complexity justified it End-of-turn signaling to Deepgram: reduce STT/inference overlap

English

Kk@Cartoonhari·4d

@SumHern27200 What optimizations did you make to Claude to achieve 78% latency drop?

English

Wesley@SumHern27200·4d

I rebuilt a healthcare AI voice agent from GPT-4o → Claude. The latency dropped 78%. Cost dropped 85%. Here's what actually drove those numbers

English

139

Wesley@SumHern27200·3d

@voiceflipme This is the exact wall we hit next. Static knowledge works until a patient asks about a specific clinic's hours or insurance accepted. What's your current approach scheduled scraping or live tool calls during the conversation? Curious how you're handling latency impact.

English

VoiceFlip.me@voiceflipme·4d

@SumHern27200 The GPT-4o to Claude jump is real - we got the same latency/cost win. The next wall wasn't speed, it was knowledge. 6 months on VoiceFlip: what moved bookings was the agent reading the live site - today's prices and hours - so it sells, not just answers fast.

English

Wesley@SumHern27200·3d

@beknabdik "Fair! the orchestration did the structural lifting. But the model swap wasn't neutral. Tool use consistency under conversational variance improved meaningfully, especially on address disambiguation. The compression win matters less if structured data extraction is dropping calls

English

Bek@beknabdik·4d

@SumHern27200 headline credits claude but your own thread answers it. the win was the orchestration work, not the model. compression alone cut the 40-turn cost. the swap mostly forced the rebuild that did the real lifting.

English

108

Wesley@SumHern27200·4d

Code is open source: github.com/sumtzehern/hea… Full writeup on LinkedIn. Building voice agents for enterprise? The architecture decisions that matter aren't in most tutorials. Happy to talk.

English

Wesley@SumHern27200·4d

What I'd do differently: Build the eval harness first. I built a p50/p99 benchmarking suite measuring inference latency, token throughput, and per-call cost, but I built it after the product. Should have been the first thing. You can't optimize what you can't measure.

English

Wesley@SumHern27200·4d

What surprised me most: Tool use consistency under conversational variance matters more than raw speed. In healthcare, missing a function call isn't a UX issue. It's a data integrity issue. Claude's structured data collection held up better than expected under real conditions.

English

Wesley@SumHern27200·4d

The cost math: Without compression: $0.78 per 40-turn call With compression: $0.12 per 40-turn call 85% cost reduction. At enterprise call volumes, this is the difference between viable and not.

English

Wesley@SumHern27200·4d

Optimization 3: Context compression A full intake = 30-40 turns. Without compression, context balloons and costs scale linearly. Solution: structured turn summaries. After each intake section, raw turns collapse into a structured summary. Context stays under 10K tokens per call.

English

Wesley@SumHern27200·4d

Optimization 2: Prompt caching Healthcare intake system prompts are long. Conversation flow, field definitions, validation rules, fallback behaviors. With prompt caching, that context is cached after turn 1. Every subsequent turn is dramatically faster.

English

Wesley@SumHern27200·4d

Optimization 1: Model routing Not every turn needs the same reasoning depth. → Claude Haiku: simple field collection (name, DOB, phone) → Claude Sonnet: complex reasoning (insurance disambiguation, address correction) Result: per-turn latency 3.6s → 0.8s

English

Wesley@SumHern27200·4d

The core problem with voice AI at enterprise scale: Perceived latency above ~3 seconds = patients think the call dropped. Every millisecond in the pipeline matters. STT + inference + TTS all compound.

English

Wesley@SumHern27200·4d

The agent: Alexis Real-time healthcare appointment scheduling over phone calls. 10-step patient intake. Address validation. Confirmation emails. Stack: Pipecat + Claude + Deepgram Nova-3 + ElevenLabs + Twilio

English

Wesley@SumHern27200·13 Mar

@TencentGame

QME

Wesley@SumHern27200·6 May

still refreshing my inbox like it’s a situationship 😂

Wesley@SumHern27200

failing a final round interview is worse than a breakup: 🥹 starts with “we really like your background” 👻 ends with ghosting so hard even your dad wouldn’t recognize you 10 toxic exes? light work. 1 recruiter on “we’ll get back to you”? absolutely soul-crushing 💔

English

Wesley@SumHern27200·6 May

English

Wesley@SumHern27200·1 May

especially in CS, the best way to learn is by building if i could do it again, i’d still get the degree, but i’d spend even more time exploring things outside of class there’s so much you can learn online for free. don’t waste your freedom.

Wesley@SumHern27200

as i wrap up my CS degree, here’s my biggest takeaway: the degree teaches problem solving, sure, but what it really gives you is 4 years to learn, build, and try things without heavy responsibility that freedom is the real value

English

Wesley@SumHern27200·1 May

English

Wesley@SumHern27200·1 May

CS students not touching grass not attending classes not sleeping so what even is CS anymore? just LeetCode, burnout, and resume-based ego boosts?

English

216

Wesley@SumHern27200·30 Nis

If you're a student or new grad starting to doubt yourself in this job market, try this: Upload your resume to Google NotebookLM, create a podcast "Tell me about this candidate and what they're capable of." the way it hypes you up?? instant ego boost lol

English

Keşfet

@Cartoonhari @voiceflipme @beknabdik @TencentGame @elonmusk @BarackObama @taylorswift13 @cristiano