Thibault | T-Bot retweetledi

StepFun’s new StepAudio 2.5 TTS ranks #3 on the Artificial Analysis Speech Arena Leaderboard, only behind Inworld’s Realtime TTS 1.5 Max and Google’s Gemini 3.1 Flash TTS
StepAudio 2.5 TTS represents a significant step forward for StepFun from previous TTS models, with notably increased naturalness of speech samples. The model now edges out Eleven v3 on our current prompt set with an Elo score of 1,187.
Key takeaways:
➤ Quality: StepAudio 2.5 TTS has an Elo of 1,187 based on 834 arena appearances, placing it 28 points behind the leading model (Inworld TTS 1.5 Max at 1,215) and 8 points ahead of Eleven v3 at 1,179
➤ Pricing: Model is priced at $85/1M characters, a premium to leading frontier models, Inworld TTS 1.5 Max at $35/1M and Gemini 3.1 Flash TTS at $36.6/1M
➤ Speed: Model generates characters 37.6 characters per second, compared to 220.5 chars/s for Inworld TTS 1.5 Max and 30.1 chars/s for Gemini 3.1 Flash TTS
➤ Prompting: StepAudio 2.5 TTS offers two paths to control delivery of speech: 1. Global context prompt for overall style, 2. Inline contextual tags for more granular emotion and prosody
See more details and listen to samples below ⬇️

English















