abhijit
18 posts




You're right, @altryne. The realtime API keeps its own internal copy of the entire conversational history and resends all of the previous input and output tokens every turn. You can see this in the usage metrics. That's why the realtime API is so expensive. Audio uses a lot of tokens, and a multi-turn conversation re-sends all the tokens from every previous turn. Cost numbers without caching: - a typical one-minute conversation costs about $0.10 - a five-minute conversation costs about $1.80 - a 10-minute conversation costs about $6.00 Here's a calculator: docs.google.com/spreadsheets/d… Context caching should result in a huge price drop, which is really exciting! Relatedly, it's difficult to manage conversation context with the realtime API right now, because you have to keep your own client-side copy of the context, selectively delete conversation items, and you can't insert conversation items. I'm sure OpenAI is also working on adding more context management capabilities to the API.

Realtime API gets 5 new voices (more expressive) and price drops with caching


