Wesley

20 posts

Wesley

Wesley

@SumHern27200

thinking deeply, posting unseriously

San Francisco Katılım Ocak 2024
16 Takip Edilen23 Takipçiler
Wesley
Wesley@SumHern27200·
@Cartoonhari Prompt caching: long system prompts, stop reprocessing every turn Model routing: Haiku for simple fields, Sonnet only when reasoning complexity justified it End-of-turn signaling to Deepgram: reduce STT/inference overlap
English
0
0
0
11
Kk
Kk@Cartoonhari·
@SumHern27200 What optimizations did you make to Claude to achieve 78% latency drop?
English
1
0
0
42
Wesley
Wesley@SumHern27200·
I rebuilt a healthcare AI voice agent from GPT-4o → Claude. The latency dropped 78%. Cost dropped 85%. Here's what actually drove those numbers
English
13
0
1
139
Wesley
Wesley@SumHern27200·
@voiceflipme This is the exact wall we hit next. Static knowledge works until a patient asks about a specific clinic's hours or insurance accepted. What's your current approach scheduled scraping or live tool calls during the conversation? Curious how you're handling latency impact.
English
1
0
0
4
VoiceFlip.me
VoiceFlip.me@voiceflipme·
@SumHern27200 The GPT-4o to Claude jump is real - we got the same latency/cost win. The next wall wasn't speed, it was knowledge. 6 months on VoiceFlip: what moved bookings was the agent reading the live site - today's prices and hours - so it sells, not just answers fast.
English
1
0
0
14
Wesley
Wesley@SumHern27200·
@beknabdik "Fair! the orchestration did the structural lifting. But the model swap wasn't neutral. Tool use consistency under conversational variance improved meaningfully, especially on address disambiguation. The compression win matters less if structured data extraction is dropping calls
English
0
0
0
2
Bek
Bek@beknabdik·
@SumHern27200 headline credits claude but your own thread answers it. the win was the orchestration work, not the model. compression alone cut the 40-turn cost. the swap mostly forced the rebuild that did the real lifting.
English
1
0
1
108
Wesley
Wesley@SumHern27200·
Code is open source: github.com/sumtzehern/hea… Full writeup on LinkedIn. Building voice agents for enterprise? The architecture decisions that matter aren't in most tutorials. Happy to talk.
English
0
0
0
21
Wesley
Wesley@SumHern27200·
What I'd do differently: Build the eval harness first. I built a p50/p99 benchmarking suite measuring inference latency, token throughput, and per-call cost, but I built it after the product. Should have been the first thing. You can't optimize what you can't measure.
English
0
0
0
16
Wesley
Wesley@SumHern27200·
What surprised me most: Tool use consistency under conversational variance matters more than raw speed. In healthcare, missing a function call isn't a UX issue. It's a data integrity issue. Claude's structured data collection held up better than expected under real conditions.
English
0
0
0
20
Wesley
Wesley@SumHern27200·
The cost math: Without compression: $0.78 per 40-turn call With compression: $0.12 per 40-turn call 85% cost reduction. At enterprise call volumes, this is the difference between viable and not.
English
0
0
0
11
Wesley
Wesley@SumHern27200·
Optimization 3: Context compression A full intake = 30-40 turns. Without compression, context balloons and costs scale linearly. Solution: structured turn summaries. After each intake section, raw turns collapse into a structured summary. Context stays under 10K tokens per call.
English
0
0
0
14
Wesley
Wesley@SumHern27200·
Optimization 2: Prompt caching Healthcare intake system prompts are long. Conversation flow, field definitions, validation rules, fallback behaviors. With prompt caching, that context is cached after turn 1. Every subsequent turn is dramatically faster.
English
0
0
0
15
Wesley
Wesley@SumHern27200·
Optimization 1: Model routing Not every turn needs the same reasoning depth. → Claude Haiku: simple field collection (name, DOB, phone) → Claude Sonnet: complex reasoning (insurance disambiguation, address correction) Result: per-turn latency 3.6s → 0.8s
English
0
0
0
34
Wesley
Wesley@SumHern27200·
The core problem with voice AI at enterprise scale: Perceived latency above ~3 seconds = patients think the call dropped. Every millisecond in the pipeline matters. STT + inference + TTS all compound.
English
0
0
0
23
Wesley
Wesley@SumHern27200·
The agent: Alexis Real-time healthcare appointment scheduling over phone calls. 10-step patient intake. Address validation. Confirmation emails. Stack: Pipecat + Claude + Deepgram Nova-3 + ElevenLabs + Twilio
English
0
0
0
40
Wesley
Wesley@SumHern27200·
failing a final round interview is worse than a breakup: 🥹 starts with “we really like your background” 👻 ends with ghosting so hard even your dad wouldn’t recognize you 10 toxic exes? light work. 1 recruiter on “we’ll get back to you”? absolutely soul-crushing 💔
English
0
0
0
95
Wesley
Wesley@SumHern27200·
especially in CS, the best way to learn is by building if i could do it again, i’d still get the degree, but i’d spend even more time exploring things outside of class there’s so much you can learn online for free. don’t waste your freedom.
Wesley@SumHern27200

as i wrap up my CS degree, here’s my biggest takeaway: the degree teaches problem solving, sure, but what it really gives you is 4 years to learn, build, and try things without heavy responsibility that freedom is the real value

English
0
0
0
57
Wesley
Wesley@SumHern27200·
as i wrap up my CS degree, here’s my biggest takeaway: the degree teaches problem solving, sure, but what it really gives you is 4 years to learn, build, and try things without heavy responsibility that freedom is the real value
English
0
0
0
85
Wesley
Wesley@SumHern27200·
CS students not touching grass not attending classes not sleeping so what even is CS anymore? just LeetCode, burnout, and resume-based ego boosts?
English
0
0
2
216
Wesley
Wesley@SumHern27200·
If you're a student or new grad starting to doubt yourself in this job market, try this: Upload your resume to Google NotebookLM, create a podcast "Tell me about this candidate and what they're capable of." the way it hypes you up?? instant ego boost lol
English
0
0
0
55