Gradium

52 posts

Gradium banner
Gradium

Gradium

@GradiumAI

The voice layer for modern apps and agents. Real-time, scalable voice APIs: TTS, STT, turn-taking & voice cloning. Devs: build → https://t.co/r5CdNClhI5

Katılım Eylül 2025
1 Takip Edilen2.8K Takipçiler
Sabitlenmiş Tweet
Gradium
Gradium@GradiumAI·
Gradium is out of stealth to solve voice. We raised $70M and after only 3 months we’re releasing our transcription and synthesis products to power the next generation of voice AI.
English
77
159
1.1K
423K
Gradium retweetledi
Pratim🥑
Pratim🥑@BhosalePratim·
I have joined @GradiumAI as their Lead Developer Advocate. Gradium builds voice AI models and infrastructure, to support all voice applications. I'll be working on taking the Developer experience and the developer community to the next level. Let's get to work now. PS: We're hiring!
English
39
6
146
13.6K
Gradium retweetledi
Alexandre Défossez
Alexandre Défossez@honualx·
We are looking for talent able to help us shape @GradiumAI data acquisition, annotation and QA process, with immediate impact on our line of products and future research. Help us bring speech interaction to the next level. 👉 gradium.homerun.co
English
3
3
28
2.9K
Gradium retweetledi
Alexandre Défossez
Alexandre Défossez@honualx·
If you are doing a PhD on generative AI, model alignment, or speech, and want to get at the forefront of speech modeling research, we are opening PhD level internship at @GradiumAI. Come and join the team behind Mimi, Moshi, Hibiki, and PocketTTS. 👉 gradium.homerun.co
English
0
6
32
3.2K
Gradium
Gradium@GradiumAI·
Technology is at its best when it solves the hardest human challenges. We’re proud to stand with @kyutai_labs in this initiative. To ensure Invincible Voice reaches everyone who needs it, Gradium is providing free API access to all patients and developers in this space.
kyutai@kyutai_labs

This post is for everyone who lives with Charcot's disease (ALS)! We’ve just released an online demo of Invincible Voice, a project created in collaboration with Olivier Goy. It knows you, listens, provides suggestions, and replies out loud in your own voice.

English
0
3
7
978
Gradium retweetledi
Matt Turck
Matt Turck@mattturck·
Voice used to be AI’s forgotten modality - now it's having its big moment: rapid innovation, big funding rounds, major agentic applications My conversation with @neilzegh, top AI researcher in the field (@GoogleDeepMind, @Meta, @kyutai_labs) and now CEO of @GradiumAI This is a reference episode on all things voice AI 🔥 00:00 Intro 01:21 Voice AI’s big moment, and why we’re still early 03:34 Why voice lagged behind text/image/video 06:06 The convergence era: transformers for every modality 07:40 Beyond Her: always-on assistants, wake words, voice-first devices 11:01 Voice vs text: where voice fits (even for coding) 12:56 Neil’s origin story: from finance to machine learning, with help from @ylecun and @soumithchintala 18:35 Neural codecs (SoundStream): compression as the unlock 22:30 Kyutai: open research, small elite teams, moving fast 31:32 Why big labs haven’t “won” voice AI4 34:01 On-device voice: where it works, why compact models matter 46:37 The last mile: real-world robustness, pronunciation, uptime 41:35 Benchmarking voice: why metrics fail, how they actually test 47:03 Cascades vs speech-to-speech: trade-offs + what’s next 54:05 Hardest frontier: noisy rooms, factories, multi-speaker chaos 1:00:50 New languages + dialects: what transfers, what doesn’t 1:02:54 Hardware & compute: why voice isn’t a 10,000-GPU game 1:07:27 What data do you need to train voice models 1:09:02 Deepfakes + privacy: why watermarking isn’t a solution 1:12:30 Voice + vision: multimodality, screen awareness, video+audio 1:14:43 Voice cloning vs voice design: where the market goes 1:16:32 Paris/Europe AI: talent density, underdog energy, what’s next
English
12
24
123
22.4K
Gradium
Gradium@GradiumAI·
Robots need more than mobility. They need connection. Voice is where that starts. We’re partnering with @labsinteraction to power Ongo with Gradium’s real-time audio models, delivering natural, low-latency voice that feels less like a machine and more like a companion. Full announcement in comments.
InteractionLabs@labsinteraction

InteractionLabs and @GradiumAI are partnering to redefine human-robot interaction! Voice will become the primary interface between humans and machines. To get there, it needs to stop sounding like a machine. That's where Gradium comes in. They build foundational audio language models that make speech natural, expressive, and fast. At InteractionLabs, we take a fundamentally different approach to robotics. Where the industry obsesses over capabilities, we believe that before robots can be useful in the home, they must first be welcome in the home. This is just step one 🚀 More to announce soon.

English
1
6
28
3.6K
Gradium retweetledi
Neil Zeghidour
Neil Zeghidour@neilzegh·
If you want to learn why speech-to-speech models are dumber than their textual counterpart, or how inefficient speech models are wrt a 4-year old baby 👶, check this conversation with @bnicholehopkins
Brooke Hopkins@bnicholehopkins

What if your voice AI could interrupt you the moment it figured out your question - sometimes even before you finished asking it? Last week, I sat down with Neil, CEO of Gradium and co-founder of Kyutai , to talk about the future of speech-to-speech models and why he believes today's cascaded voice systems will soon look "archaic and brittle." Some highlights from our conversation: 🎯 How Kyutai built Moshi—a full duplex conversational AI with "negative latency"—in 6 months with just 4-6 people (while big tech teams had 10-20x the resources) 🧠 Why speech-to-speech models lose intelligence compared to their text counterparts (and what's being done about it) 📱 Pocket TTS: The first voice cloning model that runs on your phone's CPU—not GPU, CPU 🤖 Why robotics and spatial audio represent the next frontier (hint: current voice systems completely break in these environments) 👶 The efficiency gap: Babies learn to speak fluently from <5,000 hours of audio. Current models train on millions of hours. We're doing something wrong. My favorite vision from Neil? The first truly contrarian AI that interrupts you mid-sentence to tell you why you're wrong. Not just more natural conversation—but actually useful for testing ideas and playing devil's advocate. Full episode and detailed blog post linked in the comments 👇 What's your take - will speech-to-speech replace cascaded systems, or will modularity keep cascaded architectures dominant even as naturalness improves?

English
2
6
45
9.7K
Gradium retweetledi
Neil Zeghidour
Neil Zeghidour@neilzegh·
Very proud of the work of my PhD student Tom (that I now advise from @GradiumAI). The main limitation of Hibiki was the need for aligned synth data, which required language-specific heuristics. Painful, brittle, now we don't need it anymore, we just RL the crap out of latency and BLEU score from scratch. I think this table of results vs Meta's Seamless speaks for itself.
Neil Zeghidour tweet media
kyutai@kyutai_labs

We're releasing Hibiki-Zero, a new real-time and multilingual speech translation model that can translate 🇫🇷French, 🇪🇸Spanish, 🇵🇹Portuguese and 🇩🇪German to English: accurate, low-latency, high audio quality, with voice transfer. And best of all: open-source.

English
0
3
57
5.1K
Gradium retweetledi
Rohan Virani
Rohan Virani@rohan_virani·
I first met @neilzegh and the @kyutai_labs team in Paris in 2024. Their enthusiasm was epidemically infectious and unsurprisingly led to Moshi, the world's first real-time full duplex voice model, and the amazing work they're doing at @GradiumAI today!
sisyphus bar and grill@itunpredictable

In AI, audio is the one area where smaller labs continue to win, beating big labs with 10x+ the budget and team. @GradiumAI + @kyutai_labs built the first realtime voice model a year before OpenAI with a tiny team (and it’s open). What are the dynamics behind this? A few ideas: 1. Audio has a status problem If you’ve spoken to Siri lately you know how terrible the typical audio AI experience is compared to text. Why are we only now starting to see meaningful advances in audio AI, while text has been rapidly improving every single year since 2020? This problem is actually foundational. For years audio has occupied the bottom tier of AI/ML’s informal coolness hierarchy. For several reasons audio just wasn’t sexy. There are also practical reasons: training data for audio is genuinely scarce compared to text. You can scrape trillions of tokens from Wikipedia, Stack Overflow, books, and papers. High-quality conversational audio is harder to come by, and much of it isn’t particularly informative. 2. Audio requires incredibly specific domain expertise Audio is a completely different beast than text. It is not just about scaling compute and data. There are a million little edges to creating elite audio models, from correct turn taking to backchanneling and managing latency, that require deep domain expertise. Great audio models are trained by great audio researchers, and throwing money at the problem will only get you mediocrity. Your bitter lesson has no power here!! Kyutai’s Moshi model has 7B parameters and was trained on 2.1T tokens. Llama 3.1 has 405B parameters trained on 15T tokens—that’s orders of magnitude of difference in cost. This is actually the story of Kyutai. A small group of researchers in Paris like @neilzegh and @honualx were some of the only people at Google Brain / Meta working on voice, cooking in relative obscurity in their underfunded audio divisions. Them + a few others started Kyutai in 2023, the first and is the only open audio lab, named for the Japanese word for “sphere.” (which we love) 3. SOTA audio models are built directly on research ideas, not just scale The Gradium team have published years of inventive research that powers their leading models today. There are a lot of foundational ideas here, like SoundStream, a neural codec that can compress speech, music, and general audio at bitrates normally targeted by speech-only codecs. Then there’s the full duplex architecture that Kyutai pioneered, which finally enabled actually real time audio. The idea is to model both streams simultaneously, which one-shotted the turn-taking problem that had stumped researchers for years. Point is, throwing compute at the problem doesn’t work. Underlying SOTA audio models are novel research ideas that form the entire basis for the models, not fringe additions like in text. IMO Gradium is one of the most interesting companies building models today. You can read more about them in the lengthy (sorry) post linked in the reply.

English
0
3
5
1.8K
Gradium
Gradium@GradiumAI·
Read this blog post about the story of @kyutai_labs and Gradium to understand why voice is the modality where being small, focused, and fast more than compensates for limited compute and headcount. Small teams can outcompete giants. If this excites you, join us.
sisyphus bar and grill@itunpredictable

In AI, audio is the one area where smaller labs continue to win, beating big labs with 10x+ the budget and team. @GradiumAI + @kyutai_labs built the first realtime voice model a year before OpenAI with a tiny team (and it’s open). What are the dynamics behind this? A few ideas: 1. Audio has a status problem If you’ve spoken to Siri lately you know how terrible the typical audio AI experience is compared to text. Why are we only now starting to see meaningful advances in audio AI, while text has been rapidly improving every single year since 2020? This problem is actually foundational. For years audio has occupied the bottom tier of AI/ML’s informal coolness hierarchy. For several reasons audio just wasn’t sexy. There are also practical reasons: training data for audio is genuinely scarce compared to text. You can scrape trillions of tokens from Wikipedia, Stack Overflow, books, and papers. High-quality conversational audio is harder to come by, and much of it isn’t particularly informative. 2. Audio requires incredibly specific domain expertise Audio is a completely different beast than text. It is not just about scaling compute and data. There are a million little edges to creating elite audio models, from correct turn taking to backchanneling and managing latency, that require deep domain expertise. Great audio models are trained by great audio researchers, and throwing money at the problem will only get you mediocrity. Your bitter lesson has no power here!! Kyutai’s Moshi model has 7B parameters and was trained on 2.1T tokens. Llama 3.1 has 405B parameters trained on 15T tokens—that’s orders of magnitude of difference in cost. This is actually the story of Kyutai. A small group of researchers in Paris like @neilzegh and @honualx were some of the only people at Google Brain / Meta working on voice, cooking in relative obscurity in their underfunded audio divisions. Them + a few others started Kyutai in 2023, the first and is the only open audio lab, named for the Japanese word for “sphere.” (which we love) 3. SOTA audio models are built directly on research ideas, not just scale The Gradium team have published years of inventive research that powers their leading models today. There are a lot of foundational ideas here, like SoundStream, a neural codec that can compress speech, music, and general audio at bitrates normally targeted by speech-only codecs. Then there’s the full duplex architecture that Kyutai pioneered, which finally enabled actually real time audio. The idea is to model both streams simultaneously, which one-shotted the turn-taking problem that had stumped researchers for years. Point is, throwing compute at the problem doesn’t work. Underlying SOTA audio models are novel research ideas that form the entire basis for the models, not fringe additions like in text. IMO Gradium is one of the most interesting companies building models today. You can read more about them in the lengthy (sorry) post linked in the reply.

English
1
1
10
1.4K
Gradium retweetledi
Constance Grisoni
Constance Grisoni@ConstanceGriso·
Proud day for Team @GradiumAI yesterday. Gradium was invited to the anniversary dinner of the AI summit. Thank you to President @EmmanuelMacron for inviting team Gradium and me. At Gradium, we are building the best voice AI models. Time to scale and continue our acceleration.
Constance Grisoni tweet media
English
0
2
24
1.4K
Gradium
Gradium@GradiumAI·
Building real-time voice AI? This post explains how Gradium gets the most out of NVIDIA GPUs, with concrete techniques for balancing quality vs latency, hitting sub-300ms TTFA, and avoiding audio skips in production. Learn what actually matters at inference time. Read it and try it yourself at gradium.ai/blog/optimizin…
Gradium tweet media
English
7
7
55
5.6K
Gradium
Gradium@GradiumAI·
If you still haven't thought of a Valentine's Day gift for your partner, check out this fun app built by @BhosalePratim that records your message, clones your voice and creates a Victorian Era Valentine's message for your loved ones.
Pratim🥑@BhosalePratim

Every Valentine's, I build something. This year, I built a Victorian Era Valentine video generator, so that just like the characters from Bridgerton, I can declare my love with passion. Built using @GradiumAI, @veedstudio Fabric and @fal Launching tomorrow!

English
3
3
11
2.1K
Gradium
Gradium@GradiumAI·
Introducing the Gradium Startup Program. 6 months free on our M Plan with 9M monthly credits for STT and TTS, 10 concurrent calls, commercial rights, pro voice cloning, direct support from our engineering team and even early access to our latest research previews (already including a world premiere 👀).
Gradium tweet media
English
2
4
20
7.1K