Desh Raj

2.1K posts

Desh Raj

@rdesh26

Speech + LLMs @nvidia | Previously: @Meta MSL, @jhuclsp, @IITGuwahati

New York, NY انضم Eylül 2009

1.8K يتبع4K المتابعون

تغريدة مثبتة

Desh Raj@rdesh26·12 Oca

I’m happy to share that I’m starting a new position as Senior Research Scientist at @nvidia! Looking forward to open science for speech full-duplex models :)

Desh Raj@rdesh26

After 2 wonderful years, I left Meta this week. During this time, I worked on several projects related to speech and LLMs: - Built the first multi-channel audio foundation model with M-BEST-RQ (arxiv.org/abs/2409.11494) - Made ASR with SpeechLLMs faster (arxiv.org/abs/2409.08148) and more accurate (ieeexplore.ieee.org/document/10890…) - Shipped the first production-ready full-duplex voice assistant (about.fb.com/news/2025/04/i…) - Improved Moshi’s reasoning capability with chain-of-thought (arxiv.org/abs/2510.07497) I am grateful to my managers for having my back on critical projects, and fortunate to have collaborated with several brilliant researchers and engineers during this time. As to what's next, I am still in NYC and continuing to do speech research. More on that later!

English

523

29.4K

Desh Raj@rdesh26·15h

Nice post! Here are some things you can change to improve the model (in increasing order of complexity): 1) Remove the RVQ. You can simply project the continuous embeddings to the vocab size. 2) Use LibriSpeech (16kHz) instead of LJSpeech (22.1kHz) --> shorter sequences and more speaker diversity. 3) Use BPE tokens instead of characters. A small vocab of 1k should be fine. 4) Use STFT or log-Mel features instead of raw waveforms. Makes it much easier for the model to learn with limited data. 5) Your current model is encoder-only with CTC. This assumes conditional independence of outputs (which is not great for sequence tasks). You can either add shallow fusion with a simple external LM, or change it to a Whisper-style encoder-decoder model.

English

299

Mayank Pratap Singh@Mayank_022·1d

I coded a Speech-to-Text model from scratch. 𝐇𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐛𝐥𝐨𝐠 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞: blogs.mayankpratapsingh.in/chapters/speec… No APIs. No pre-trained models. Just PyTorch, an A100 GPU, and hours of debugging. This started months ago. I wanted to understand how machines hear. Not surface-level understanding. I wanted to build the whole thing myself. So I built it piece by piece: autoencoders, VAEs, VQ-VAEs, Residual Vector Quantization, and CTC loss. Each one took days to get right. Trained for 3 hours on 13,100 audio clips. Got complete garbage. Changed the tokenizer from BPE to character-level. Rechecked everything. Asked @neural_avb who built STT models before. His answer: these models are tricky to train and need days of compute, not hours. Cut the dataset to 200 clips. After 2 hours, actual words appeared. Overfitted? Absolutely. But watching noise turn into recognizable English was satisfying. I have made a blog about this as well so you can learn about the same and my process - Audio fundamentals and waveform representation - Why attention breaks on raw audio - Convolutional downsampling - Transformer encoder with positional encoding - Vector Quantization, straight-through estimator, and RVQ - CTC loss and greedy decoding - Full training loop with VQ loss warmup - What went wrong and what finally worked Resources: - Blog: blogs.mayankpratapsingh.in/chapters/speec… - Code: github.com/Mayankpratapsi… More Resoures CTC loss distill.pub/2017/ctc/ @neural_avb videos @avb_fj" target="_blank" rel="nofollow noopener">youtube.com/@avb_fj SoundStream Paper arxiv.org/abs/2107.03312 LJ speech dataset keithito.com/LJ-Speech-Data… wav2vec paper arxiv.org/abs/2006.11477 RVQ blog drscotthawley.github.io/blog/posts/202… Next up: I've already trained two TTS architectures from scratch. Video post about those coming soon. But first, I'm dropping a visual breakdown of Vision Transformers, covering how they work and how to fine-tune them. Follow me @Mayank_022 you're into audio deep learning. Repost so others can find this

English

458

19.5K

Desh Raj أُعيد تغريده

Adele Bloch@adele_bloch·2d

how to life maxx more: > get off your phone > say yes to spontaneous plans even when you're tired - some of the best nights are unplanned > talk to strangers - at coffee shops, events, literally anywhere. serendipity maxx > make a bucket list and work your way through said bucket list!! > stop opting for boring hangs. switch things up with your friends. try something new!! > start a random hobby just for fun - pottery, dance, improv, cooking. not everything needs to "be productive" ok?? > be 5% more silly in your life. dance in your room, sing badly in the car, crack a bad joke. it's not that serious. grow the silly muscle > surround yourself with people who make you feel lighter - your time and energy is precious > don't forget the basics: move your body, get sunlight, take your vitamins, eat well, sleep > your time to live life is happening NOW so stop saving it for later!! we forgot that life is supposed to be FUN ya'll lets go PLAY!!!

Freyy@Freyy_is

when i say go outside i mean pottery classes, open mic nights, pilates, long walks, game nights, art exhibits, botanical gardens, book signings, sound baths, massages, gun ranges, outdoor cafes, jazz lounges. that kind of outside.

English

326

3.3K

137.9K

Desh Raj أُعيد تغريده

Artificial Analysis@ArtificialAnlys·4d

NVIDIA has released Nemotron 3 VoiceChat! A ~12B parameter Speech to Speech model that leads our open weights Conversational Dynamics vs. Speech Reasoning pareto frontier Understanding Speech to Speech model performance is multidimensional - two key and distinct dimensions are raw intelligence and conversational dynamics: how well a model handles the natural rhythms of human conversation such as turn-taking, interruptions. Amongst full duplex open weights models, NVIDIA’s new Nemotron 3 VoiceChat, V1, leads in balancing these dimensions, setting itself apart from other models on the Conversational Dynamics vs. Speech Reasoning pareto frontier. Key benchmarking results: ➤ Conversational Dynamics (Full Duplex Bench): Nemotron 3 VoiceChat (V1) scores 77.8%, second among open weights speech to speech models behind NVIDIA's own PersonaPlex (91.0%) and ahead of FLM-Audio (62.0%), Moshi (61.0%) and Freeze-Omni (58.7%) ➤ Speech Reasoning (Big Bench Audio): Nemotron 3 VoiceChat (V1) scores 29.2%, second among open weights speech to speech models behind Freeze-Omni (33.9%) and well ahead of PersonaPlex (12.6%), FLM-Audio (5.3%) and Moshi (1.7%) ➤ Pareto leader: While Freeze-Omni leads on speech reasoning and PersonaPlex leads on conversational dynamics, Nemotron 3 VoiceChat (V1) is the only open weights model that performs amongst the top 3 on both - making it the clear leader on the pareto frontier between these two critical dimensions ➤ Larger than other open weights models but still relatively small compared to LLMs: Nemotron 3 VoiceChat (V1) has 12B parameters, making it one of the larger open weights speech to speech models, while NVIDIA's PersonaPlex is ~7B. While larger compared to other larger open weights speech to speech models the model still is relatively small compared to leading LLMs ➤ Context vs. proprietary models: While this release materially advances open weights performance, open weights speech to speech models still significantly underperform leading proprietary offerings. For comparison, proprietary models on our Big Bench Audio benchmark score substantially higher - Step-Audio R1.1 at 96%, Grok Voice Agent at 92%, Gemini 2.5 Flash (Thinking) at 92%, and Nova 2.0 Sonic at 87%. The gap between open weights and proprietary remains large in this modality. As the capability and adoption of Speech to Speech models increases, we expect to expand our set of benchmarks to include elements such as tool-calling and multi-turn instruction following. See more details below ⬇️

English

321

47.6K

Desh Raj أُعيد تغريده

Will McKelvey@Will_McKelvey·4d

You went to a real NYC party this weekend if you: - had a good time and made some new friends

C.C. Gong@CCgong

you went to a real SF party this weekend if you talked about - openclaw and how it’s a paradigm shift but also zomg so unsafe - peptides - how impossible dating in SF is - AI agents replacing everyone - the Anthropic tender driving up housing prices - creatine

English

204

8.4K

360.2K

Desh Raj أُعيد تغريده

Bryan Catanzaro@ctnzr·11 Mar

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: research.nvidia.com/labs/nemotron/… And yes, Ultra is coming!

English

205

1.2K

200.9K

Desh Raj@rdesh26·9 Mar

@Uber did you know that all of your drivers in Jaipur/Udaipur ask for 2x the displayed cost and ride off-app?

English

413

Desh Raj@rdesh26·7 Mar

@JulianSlzr Yeah, the interchangeable use of "models" v/s "systems" is quite annoying to me TBH. x.com/rdesh26/status…

Desh Raj@rdesh26

Voice AI has grown a lot recently, and definitions of models/systems have become somewhat vague. Let's put down some basics. 1. AI "models" are not AI "systems". Models are the core units that build up a system. For text-only systems, the two are trivially equivalent (discounting the BPE tokenizer/detokenizer), but not for voice. For voice AI systems, examples of model may be ASR, TTS, LLM, SpeechLLM, OmniLLM, etc. 2. A model is the smallest replaceable unit within a system. For example, an STT model (user speech in / agent text out) often contains a speech encoder + an LLM, but neither of these components can be replaced without having to train the model again. 3. A speech-to-speech "system" (often called a voice agent) may take many forms and comprise many components, but it is always based on two requirements: (A) response generation --> what/how to respond (B) duplex control --> when to talk. Traditionally, (A) has been handled through an ASR/LLM/TTS cascade. Most of the current S2S modeling research aims to replace this pipeline with fewer models (either STT+TTS or S2S). Most systems still rely on external VADs and WebRTC for (B), with the famous exception of "full-duplex" models like Moshi. 4a. A SpeechLLM is a model that takes text+speech input, but only generates text output. It is also called a "speech understanding" model. 4b. An OmniLLM is a SpeechLLM that also generates speech (either codecs or continuous latents). It is also called a "speech generation" model (not to be confused with a TTS). 5. A speech-to-speech system is considered "realtime" if it satisfies 3 conditions: low latency (< 1s), streaming audio in/out, and barge-in/interruption handling. It can also be called a full-duplex system (not to be confused with a full-duplex "model").

English

164

Julian Salazar@JulianSlzr·7 Mar

@rdesh26 Thanks Desh! Since then, how others ended up using the term has made me disambiguate full-duplex into: - a system/experience: ongoing listening and reaction - an end-to-end architecture: model that always conditions and produces audio streams

English

166

Desh Raj@rdesh26·6 Mar

This ~1.5 year old thread from @JulianSlzr is the most precise way to think about speech-to-speech models. The best "full-duplex" models these days are usually chunk-wise time-multiplexed (rather than multi-stream like Moshi). In Julian's terminology, they lie somewhere between "turn-taking" and true "full-duplex". This means that after every K user audio tokens, the model gets a chance to generate some response tokens. The choice of K then becomes a critical design decision: a smaller K improves responsiveness (e.g. model can respond to interruptions faster), but increases the number of prefills needed during a session. Here is an example of such a system, MiniCPM-o-4.5:

Julian Salazar@JulianSlzr

Note: I only speculate from announcements/demos. Views mine, not GDM's. My definitions: - e2e = a model operating directly on audio (tokens) - turn-based = audio sequences in, audio sequences out - full-duplex = audio outputs are always conditioned on latest inputs [2/n]

English

5.1K

Desh Raj@rdesh26·6 Mar

A great write-up about the token-and-duration transducer, and why it makes NVIDIA's ASR models faster than anyone else!

Speechmatics@Speechmatics

The @HuggingFace Open ASR Leaderboard RTFx column is dominated by one model family. 😯 The mechanism is a modified forward-backward algorithm. 👇

English

1.6K

Desh Raj@rdesh26·3 Mar

@neilzegh "Gradian"?

English

Neil Zeghidour@neilzegh·2 Mar

Welcome Pratim! A new Gradiumite/Gradical/Grisotope (not sure how to call ourselves yet) joins the team.

Pratim🥑@BhosalePratim

I have joined @GradiumAI as their Lead Developer Advocate. Gradium builds voice AI models and infrastructure, to support all voice applications. I'll be working on taking the Developer experience and the developer community to the next level. Let's get to work now. PS: We're hiring!

English

1.5K

Desh Raj@rdesh26·28 Şub

Last week: Anthropic bad, see their "Chinese distillation" post This week: Anthropic good, see their DoW stance Me (continuing to use Claude Code): oh yeah for sure

English

862

Desh Raj@rdesh26·28 Şub

@awnihannun Congrats on MLX, and excited to see what you do next!

English

287

Awni Hannun@awnihannun·28 Şub

Today is my last day at Apple. Building MLX with our amazing team and community has been an absolute pleasure. It's still early days for AI on Apple silicon. Apple makes the best consumer hardware on the planet. There's so much potential for it to be the leading platform for AI. And I'm confident MLX will continue to have a big role in that. To the future: MLX remains in the exceptionally capable hands of our team including @angeloskath, @zcbenz, @DiganiJagrit, @NasFilippova, @trebolloc (and others not on X). Follow them or @shshnkp for future updates.

English

260

2.2K

396.1K

Desh Raj@rdesh26·28 Şub

@girlknowstech Unfortunately my X anniversary was way before it became X.

English

Marie - Girl Knows Tech@girlknowstech·27 Şub

Do you remember when you joined X? I do! #MyXAnniversary

English

176

Desh Raj@rdesh26·27 Şub

Not my dad sending me this article from a Hindi newspaper about Ruoming Pang leaving his "200 million Meta package" to join OpenAI 👀

English

532

Desh Raj أُعيد تغريده

Dyah Adila 🦄@dyahadila_·24 Şub

🦄 I recently wrapped up my PhD job search (industry research scientist role) and wrote up what I learned: interview types, prep materials, and how a spreadsheet saved my sanity. Hope it helps if you're going through it too!🔗dyahadila.github.io/blog/2026/indu…

English

536

41.2K

Desh Raj أُعيد تغريده

Alexandre Défossez@honualx·24 Şub

If you are doing a PhD on generative AI, model alignment, or speech, and want to get at the forefront of speech modeling research, we are opening PhD level internship at @GradiumAI. Come and join the team behind Mimi, Moshi, Hibiki, and PocketTTS. 👉 gradium.homerun.co

English

3.3K

Desh Raj أُعيد تغريده