Gradium
52 posts

Gradium
@GradiumAI
The voice layer for modern apps and agents. Real-time, scalable voice APIs: TTS, STT, turn-taking & voice cloning. Devs: build → https://t.co/r5CdNClhI5






This post is for everyone who lives with Charcot's disease (ALS)! We’ve just released an online demo of Invincible Voice, a project created in collaboration with Olivier Goy. It knows you, listens, provides suggestions, and replies out loud in your own voice.


InteractionLabs and @GradiumAI are partnering to redefine human-robot interaction! Voice will become the primary interface between humans and machines. To get there, it needs to stop sounding like a machine. That's where Gradium comes in. They build foundational audio language models that make speech natural, expressive, and fast. At InteractionLabs, we take a fundamentally different approach to robotics. Where the industry obsesses over capabilities, we believe that before robots can be useful in the home, they must first be welcome in the home. This is just step one 🚀 More to announce soon.

What if your voice AI could interrupt you the moment it figured out your question - sometimes even before you finished asking it? Last week, I sat down with Neil, CEO of Gradium and co-founder of Kyutai , to talk about the future of speech-to-speech models and why he believes today's cascaded voice systems will soon look "archaic and brittle." Some highlights from our conversation: 🎯 How Kyutai built Moshi—a full duplex conversational AI with "negative latency"—in 6 months with just 4-6 people (while big tech teams had 10-20x the resources) 🧠 Why speech-to-speech models lose intelligence compared to their text counterparts (and what's being done about it) 📱 Pocket TTS: The first voice cloning model that runs on your phone's CPU—not GPU, CPU 🤖 Why robotics and spatial audio represent the next frontier (hint: current voice systems completely break in these environments) 👶 The efficiency gap: Babies learn to speak fluently from <5,000 hours of audio. Current models train on millions of hours. We're doing something wrong. My favorite vision from Neil? The first truly contrarian AI that interrupts you mid-sentence to tell you why you're wrong. Not just more natural conversation—but actually useful for testing ideas and playing devil's advocate. Full episode and detailed blog post linked in the comments 👇 What's your take - will speech-to-speech replace cascaded systems, or will modularity keep cascaded architectures dominant even as naturalness improves?

Last weekend my team built an video generation platform for writers to turn their blogs into stunning visuals, in their own voice. It's now live on @Lovable. Tech Stack: - @Lovable for building the app. - @fal for image, video generation. - @GradiumAI for text-to-speech & custom voices. Used Elena Verna's blog, image and her voice for this demo. If you are interested click the following below.


We're releasing Hibiki-Zero, a new real-time and multilingual speech translation model that can translate 🇫🇷French, 🇪🇸Spanish, 🇵🇹Portuguese and 🇩🇪German to English: accurate, low-latency, high audio quality, with voice transfer. And best of all: open-source.

In AI, audio is the one area where smaller labs continue to win, beating big labs with 10x+ the budget and team. @GradiumAI + @kyutai_labs built the first realtime voice model a year before OpenAI with a tiny team (and it’s open). What are the dynamics behind this? A few ideas: 1. Audio has a status problem If you’ve spoken to Siri lately you know how terrible the typical audio AI experience is compared to text. Why are we only now starting to see meaningful advances in audio AI, while text has been rapidly improving every single year since 2020? This problem is actually foundational. For years audio has occupied the bottom tier of AI/ML’s informal coolness hierarchy. For several reasons audio just wasn’t sexy. There are also practical reasons: training data for audio is genuinely scarce compared to text. You can scrape trillions of tokens from Wikipedia, Stack Overflow, books, and papers. High-quality conversational audio is harder to come by, and much of it isn’t particularly informative. 2. Audio requires incredibly specific domain expertise Audio is a completely different beast than text. It is not just about scaling compute and data. There are a million little edges to creating elite audio models, from correct turn taking to backchanneling and managing latency, that require deep domain expertise. Great audio models are trained by great audio researchers, and throwing money at the problem will only get you mediocrity. Your bitter lesson has no power here!! Kyutai’s Moshi model has 7B parameters and was trained on 2.1T tokens. Llama 3.1 has 405B parameters trained on 15T tokens—that’s orders of magnitude of difference in cost. This is actually the story of Kyutai. A small group of researchers in Paris like @neilzegh and @honualx were some of the only people at Google Brain / Meta working on voice, cooking in relative obscurity in their underfunded audio divisions. Them + a few others started Kyutai in 2023, the first and is the only open audio lab, named for the Japanese word for “sphere.” (which we love) 3. SOTA audio models are built directly on research ideas, not just scale The Gradium team have published years of inventive research that powers their leading models today. There are a lot of foundational ideas here, like SoundStream, a neural codec that can compress speech, music, and general audio at bitrates normally targeted by speech-only codecs. Then there’s the full duplex architecture that Kyutai pioneered, which finally enabled actually real time audio. The idea is to model both streams simultaneously, which one-shotted the turn-taking problem that had stumped researchers for years. Point is, throwing compute at the problem doesn’t work. Underlying SOTA audio models are novel research ideas that form the entire basis for the models, not fringe additions like in text. IMO Gradium is one of the most interesting companies building models today. You can read more about them in the lengthy (sorry) post linked in the reply.

In AI, audio is the one area where smaller labs continue to win, beating big labs with 10x+ the budget and team. @GradiumAI + @kyutai_labs built the first realtime voice model a year before OpenAI with a tiny team (and it’s open). What are the dynamics behind this? A few ideas: 1. Audio has a status problem If you’ve spoken to Siri lately you know how terrible the typical audio AI experience is compared to text. Why are we only now starting to see meaningful advances in audio AI, while text has been rapidly improving every single year since 2020? This problem is actually foundational. For years audio has occupied the bottom tier of AI/ML’s informal coolness hierarchy. For several reasons audio just wasn’t sexy. There are also practical reasons: training data for audio is genuinely scarce compared to text. You can scrape trillions of tokens from Wikipedia, Stack Overflow, books, and papers. High-quality conversational audio is harder to come by, and much of it isn’t particularly informative. 2. Audio requires incredibly specific domain expertise Audio is a completely different beast than text. It is not just about scaling compute and data. There are a million little edges to creating elite audio models, from correct turn taking to backchanneling and managing latency, that require deep domain expertise. Great audio models are trained by great audio researchers, and throwing money at the problem will only get you mediocrity. Your bitter lesson has no power here!! Kyutai’s Moshi model has 7B parameters and was trained on 2.1T tokens. Llama 3.1 has 405B parameters trained on 15T tokens—that’s orders of magnitude of difference in cost. This is actually the story of Kyutai. A small group of researchers in Paris like @neilzegh and @honualx were some of the only people at Google Brain / Meta working on voice, cooking in relative obscurity in their underfunded audio divisions. Them + a few others started Kyutai in 2023, the first and is the only open audio lab, named for the Japanese word for “sphere.” (which we love) 3. SOTA audio models are built directly on research ideas, not just scale The Gradium team have published years of inventive research that powers their leading models today. There are a lot of foundational ideas here, like SoundStream, a neural codec that can compress speech, music, and general audio at bitrates normally targeted by speech-only codecs. Then there’s the full duplex architecture that Kyutai pioneered, which finally enabled actually real time audio. The idea is to model both streams simultaneously, which one-shotted the turn-taking problem that had stumped researchers for years. Point is, throwing compute at the problem doesn’t work. Underlying SOTA audio models are novel research ideas that form the entire basis for the models, not fringe additions like in text. IMO Gradium is one of the most interesting companies building models today. You can read more about them in the lengthy (sorry) post linked in the reply.



Introducing the Bridgerclone app💃💕🫰 An app to create Victorian Era Style Videos with a voice cloning feature that transforms your simple messages into a grand declaration of love. Powered by @GradiumAI Live now 👇





Every Valentine's, I build something. This year, I built a Victorian Era Valentine video generator, so that just like the characters from Bridgerton, I can declare my love with passion. Built using @GradiumAI, @veedstudio Fabric and @fal Launching tomorrow!

