Post

Michael Louis
Michael Louis@MichaelLouis_za·
Most voice AI teams delay this decision far longer than they should - and eventually hit the same wall. The LLM ends up being the single biggest driver of both latency and cost.
English
1
0
0
63
Michael Louis
Michael Louis@MichaelLouis_za·
The solutions teams reach for are usually workarounds: • Racing multiple LLM endpoints and taking the fastest response (works, but expensive) • Cron jobs that monitor response times and hot-swap providers • Hosted OSS via Fireworks or Baseten (useful as a bridge, rarely the destination)
English
1
0
0
21
Michael Louis
Michael Louis@MichaelLouis_za·
However, fine-tuning and self-hosting are no longer hard for modern engineering teams. With frameworks like @UnslothAI and Verl, teams can fine-tune strong models for a few hundred dollars - and regain control over: • Latency • Cost • Data residency / regionality • And (most importantly) Model outputs
English
1
0
0
7
Michael Louis
Michael Louis@MichaelLouis_za·
The data prep step is also easier than most teams expect. You can reach baseline performance for common call workflows without a perfect dataset, then iterate from there.Evals are non-negotiable. When teams make this shift, we consistently see: • ~330ms P99 time-to-first-token • 50%+ cost reductions vs managed LLM APIs
English
1
0
0
3
Michael Louis
Michael Louis@MichaelLouis_za·
This typically only starts to make sense at ~15k+ calls per day. But that’s also when customers start noticing. Lower latency shows up immediately in: • Conversation flow • Turn-taking • Perceived intelligence
English
1
0
0
38
Michael Louis
Michael Louis@MichaelLouis_za·
If voice AI is core to your product, treating the LLM as a permanent black box is leaving UX and margin on the table. Happy to chat if you’re thinking about making this shift.
English
0
0
0
28
Compartilhar