Desh Raj

2.2K posts

Desh Raj banner
Desh Raj

Desh Raj

@rdesh26

Speech + LLMs @nvidia | Previously: @Meta MSL, @jhuclsp, @IITGuwahati

New York, NY Katılım Eylül 2009
1.9K Takip Edilen4.3K Takipçiler
Sabitlenmiş Tweet
Desh Raj
Desh Raj@rdesh26·
I’m happy to share that I’m starting a new position as Senior Research Scientist at @nvidia! Looking forward to open science for speech full-duplex models :)
Desh Raj tweet media
Desh Raj@rdesh26

After 2 wonderful years, I left Meta this week. During this time, I worked on several projects related to speech and LLMs: - Built the first multi-channel audio foundation model with M-BEST-RQ (arxiv.org/abs/2409.11494) - Made ASR with SpeechLLMs faster (arxiv.org/abs/2409.08148) and more accurate (ieeexplore.ieee.org/document/10890…) - Shipped the first production-ready full-duplex voice assistant (about.fb.com/news/2025/04/i…) - Improved Moshi’s reasoning capability with chain-of-thought (arxiv.org/abs/2510.07497) I am grateful to my managers for having my back on critical projects, and fortunate to have collaborated with several brilliant researchers and engineers during this time. As to what's next, I am still in NYC and continuing to do speech research. More on that later!

English
52
11
522
31.5K
Desh Raj retweetledi
Packy McCormick
Packy McCormick@packyM·
It is 70 and sunny in New York City, we’re heading to a kite festival, and I haven’t heard the words “agent” or “token” once all morning. Greatest city in the world.
Deedy@deedydas

The vibes in SF feel pretty frenetic right now. The divide in outcomes is the worst I've ever seen. Over the last 5yrs, a group of ~10k people - employees at Anthropic, OpenAI, xAI, Nvidia, Meta TBD, founders - have hit retirement wealth of well above $20M (back of the envelope AI estimation). Everyone outside that group feels like they can work their well-paying (but <$500k) job for their whole life and never get there. Worse yet, layoffs are in full swing. Many software engineers feel like their life's skill is no longer useful. The day to day role of most jobs has changed overnight with AI. As a result, 1. The corporate ladder looks like the wrong building to climb. Everyone's trying to align with a new set of career "paths": should I be a founder? Is it too late to join Anthropic / OpenAI? should I get into AI? what company stock will 10x next? People are demanding higher salaries and switching jobs more and more. 2. There’s a deep malaise about work (and its future). Why even work at all for “peanuts”? Will my job even exist in a few years? Many feel helpless. You hear the “permanent underclass” conversation a lot, esp from young people. It's hard to focus on doing good work when you think "man, if I joined Anthropic 2yrs ago, I could retire" 3. The mid to late middle managers feel paralyzed. Many have families and don't feel like they have the energy or network to just "start a company". They don't particularly have any AI skills. They see the writing on the wall: middle management is being hollowed out in many companies. 4. The rich aren’t particularly happy either. No one is shedding tears for them (and rightfully so). But those who have "made it" experience a profound lack of purpose too. Some have gone from <$150k to >$50M in a few years with no ramp. It flips your life plans upside down. For some, comparison is the thief of joy. For some, they escape to NYC to "live life". For others still, they start companies "just cuz", often to win status points. They never imagined that by age 30, they'd be set. I once asked a post-economic founder friend why they didn't just sell the co and they said "and do what? right now, everyone wants to talk to me. if i sell, I will only have money." I understand that many reading this scoff at the champagne problems of the valley. Society is warped in this tech bubble. What is often well-off anywhere else in the world is bang average here. Unlike many other places, tenure, intelligence and hard work can be loosely correlated with outcomes in the Bay. Living through a societally transformative gold rush in that environment can be paralyzing. "Am I in the right place? Should I move? Is there time still left? Am I gonna make it?" It psychologically torments many who have moved here in search of "success". Ironically, a frequent side effect of this torment is to spin up the very products making everyone rich in hopes that you too can vibecode your path to economic enlightenment.

English
38
53
1.6K
95K
Desh Raj retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
We've gone even farther: Nemotron 3 Super is 120B and pretrained on 25T tokens in NVFP4. Nemotron 3 Ultra is ~500B and also pretrained in NVFP4. Accelerated computing means we rethink every aspect of the AI stack looking for new opportunities to improve efficiency.
How To AI@HowToAI_

NVIDIA has done the impossible and nobody's talking about it. They trained a 12 BILLION parameter LLM in 4-bit precision on 10 trillion tokens. For years, the AI industry has been stuck. If you wanted to train a world-class AI, you had to use 16-bit or 8-bit precision. Going lower to 4-bit, was a death sentence for the model. It would become unstable, "hallucinate" its own math, and eventually collapse. But NVIDIA proved that "impossible" was just a math problem. They used a new format called NVFP4. Instead of a standard, rigid structure, NVFP4 uses "micro-scaling." It groups numbers into tiny blocks and applies individual scaling factors to each one. It’s like giving the AI a pair of high-definition glasses for its own data, allowing it to see fine details even with 75% less memory. The result is a total paradigm shift: - 2× to 3× faster arithmetic performance. - 50% reduction in memory usage. - Near-zero loss in intelligence. The researchers compared the 4-bit model against a massive 8-bit baseline. The curves are identical. On MMLU, GSM8K, and coding benchmarks, the "tiny" 4-bit version performed within 0.1% of the more expensive model. This is an economic earthquake. Training a frontier model used to require tens of thousands of GPUs and months of time. NVIDIA just showed we can get the same results with half the hardware and a fraction of the electricity.

English
30
79
818
109.6K
Desh Raj
Desh Raj@rdesh26·
It seems LLM folks have finally discovered multi-stream models. Let me introduce you to array geometry invariant multi-channel multi-speaker ASR.
English
1
2
35
4.4K
Desh Raj
Desh Raj@rdesh26·
Here is the conversation in a gist: gist.github.com/desh2608/5d601… TL;DR: I gave the Thinky blog link and asked it to draw the illustrations in a similar design. For the first animation, I iterated on the design a few times (asking it to render with Python code instead of HTML, etc.). Once the design was satisfactory, the rest of the animations were faster by explaining what I wanted to show. I didn't check the code much, but looks like it only used packages from the Python standard library. Perhaps using something like manim would be prettier.
English
1
0
4
215
Xutai Ma
Xutai Ma@xutai_ma·
Great post. Totally agree at the end. It’s really the full duplex experience that matters over the full duplex architecture, and the speech researcher often focus too much on the latter. A totally different topic, how do you prompt the codex to get the animations? Any tools used?
English
1
0
3
241
Vipul Gupta
Vipul Gupta@vipul_1011·
First time in NYC: Walked into a coffee shop and all 5 people around me are talking about AI
English
3
0
20
1.8K
Desh Raj
Desh Raj@rdesh26·
Hi Ruixiang, great work on the dMel paper and nice to see training-free representations being used at large scale! Two questions that I was hoping to get your thoughts on: (1) I understand that discretizing the Mel filterbanks makes it easier to do speech generation (cross-entropy + top-K sampling). But why use dMel rather than just the continuous Mel for input representations? (2) Mel is biased towards human perception of speech. For general audio input, maybe it's better to use STFT?
English
0
0
6
561
Desh Raj
Desh Raj@rdesh26·
I guess I coined the term "interaction model"
Desh Raj@rdesh26

@krandiash Great analogy! I have a similar mental model of how things will turn out: a small "interaction" model which is voice-native and can operate in real-time, and a larger "problem-solving" model which can do reasoning etc. and operate asynchronously.

English
0
0
8
1.1K
Desh Raj
Desh Raj@rdesh26·
I think streaming video input is a big plus, but for somewhat different reasons. IMO it may be better to use video to communicate environmental context to the model (e.g., showing a half-baked pie in the oven to ask how much longer I should let it stay) rather than to aid in device-directedness, which can be inferred with high accuracy from the voice input + semantics.
English
0
0
1
55
Sathvik Udupa
Sathvik Udupa@SathvikUdupa·
@rdesh26 Conferencing setups as well as directed speech, third party speech (arxiv.org/pdf/2601.05564). Video provides a lot more information with diarization, gaze and localisation, so I wonder how far we can go with only streaming speech
English
1
0
0
55
Desh Raj
Desh Raj@rdesh26·
7/ Regarding the frontend/backend design (what they call the "interaction" and "background" models): (i) How do you teach the frontend model when to defer to the backend? LLM's famously have problems in knowning what they don't know. For a certain class of queries like date/time, news/stock/weather, etc., this is easily determined and can be trained in SFT. But it becomes quite fuzzy as you get into questions like "what was the favorite movie of the 38th US president". Always deferring to the backend increases time to task completion (not desirable), but not deferring enough runs the risk of fumbling up responses. (ii) How to seamlessly integrate the backend context into the agent audio stream? From an inference perspective, this is just an extra prefill step, but post-training this capability requires very careful construction of token sequences, since the prefill can happen at any point during the frontend model's response.
Desh Raj@rdesh26

x.com/i/article/2054…

English
1
0
17
3.7K
Desh Raj
Desh Raj@rdesh26·
@adelmoumen_ That paper has been on my reading list for a while now. Thanks for sharing, will bump it up and get around to it this week!
English
1
0
2
88
Adel
Adel@adelmoumen_·
@rdesh26 but here I am making a lot of speculations. I think arxiv.org/pdf/2504.07951… is quite nice as they show that early fus is more compute efficient than late fus. Appendix B.7 shows that using pre-trained vision encoder/LLM is giving a boost but can be closed by sufficient training
English
1
0
2
90
Desh Raj
Desh Raj@rdesh26·
@SathvikUdupa Do you mean a conferencing style setup with multiple users and 1 agent?
English
1
0
0
340
Sathvik Udupa
Sathvik Udupa@SathvikUdupa·
@rdesh26 nice writeup! Yea, I see a 1-2 second latency in some demos such as the first one, I imagine the timing can be calibrated with prompting. Simultaneous speech generation, with multi / background spk capabilities are quite interesting, do you think it can be done w/o video?
English
1
0
0
409
Desh Raj
Desh Raj@rdesh26·
Great question! They don't talk much about their post-training data/recipe (for obvious reasons), but in my experience, it takes both supervised fine-tuning and RL. - SFT data is usually a mix of synthetic conversations (to teach instruction following) and real dialog (to retain naturalness). The goal in this stage is to build a model that can respond in a variety of ways. - The RL stage is great for tuning the behavior on things like interruption handling or prompt adherence. Notably, the turn-taking behavior should not be "forced" (i.e., 180ms response latency is not always the right thing to do), but should be inferred from context. So it's hard to teach it through human conversations alone. Also, real human speech contains a lot of disfluencies that we don't want the model to learn.
English
1
0
0
412
Hardik Chauhan
Hardik Chauhan@backpropagater·
@rdesh26 Great blog :) The interactivity evals feel very forward-looking. Do you think imitation from human conversations is enough for interruption/overlap/repair, or do we need objectives that reward conversational timing directly?
English
1
0
0
455
Desh Raj
Desh Raj@rdesh26·
Yeah there's growing interest in encoder-less setups, partially because it natively unifies understanding and generation (c.f. the Tuna-2 paper: arxiv.org/abs/2604.24763). The question is not whether it can be done (of course it can, the decoder is much larger than the encoder), but rather why waste precious decoder training compute on learning trivial mappings from input space to high-dim semantic representations. But maybe this is less of a concern now if audio is a first-class citizen for foundation model training.
English
1
1
4
676
Adel
Adel@adelmoumen_·
@rdesh26 My two cents on the use of dMel: 1. arxiv.org/pdf/2504.07951… show that for img+txt, you don't need late fusion (i.e. big encoder), and you could just do "early fusion": fuse everything at the input layer. So, for speech it would mean no need of a big speech encoder just fuse asap
English
2
1
1
730
Desh Raj
Desh Raj@rdesh26·
Suddenly seeing a lot of "voice AI" posts from OpenAI and affiliates. Perhaps voice mode upgrade is coming soon?
English
2
0
8
2K
Julia Turc
Julia Turc@juliarturc·
The enshittification of AI has begun. This girl is just trying to read two measly papers. > NotebookLM: has been timing out over the past two days, doesn't respond at all > Claude app: Sorry can't upload files over 30mb > ChatGPT app: Technically works but a vomit of emojis and sentimentalities even on technical papers. Grateful for no goblins yet.
English
43
3
218
18.3K