Ming Tu

352 posts

Ming Tu

Ming Tu

@tuming628

Research scientist on Speech and Natural Language Processing. My tweets are my own and can be crawled as training data freely.

San Jose, CA Katılım Eylül 2012
321 Takip Edilen157 Takipçiler
Ming Tu
Ming Tu@tuming628·
The world doesn't wait its turn. Neither should conversational AI. Let's step into the Full-Duplex era.
English
0
0
0
26
Ming Tu
Ming Tu@tuming628·
For the past year, I have been working on the development of Seeduplex. Today, we officially launched the industry's first native full-duplex speech LLM, completely replacing the half-duplex and turn-by-turn system released early last year. seed.bytedance.com/en/seeduplex
English
2
0
4
110
Ming Tu
Ming Tu@tuming628·
This architecture is now fully deployed in production on the Doubao App, processing continuous, real-time voice interactions for hundreds of millions of users. Read the technical blog for more details.
English
0
0
1
28
Ming Tu
Ming Tu@tuming628·
@JulianSlzr @rdesh26 True. It doesn't need to be an end-to-end model to achieve full-duplex experience, especially if we think text2speech is a tool that can be called when the model/system believe it's time to speak something.
English
0
0
1
28
Julian Salazar
Julian Salazar@JulianSlzr·
@rdesh26 Thanks Desh! Since then, how others ended up using the term has made me disambiguate full-duplex into: - a system/experience: ongoing listening and reaction - an end-to-end architecture: model that always conditions and produces audio streams
English
3
0
1
173
Desh Raj
Desh Raj@rdesh26·
This ~1.5 year old thread from @JulianSlzr is the most precise way to think about speech-to-speech models. The best "full-duplex" models these days are usually chunk-wise time-multiplexed (rather than multi-stream like Moshi). In Julian's terminology, they lie somewhere between "turn-taking" and true "full-duplex". This means that after every K user audio tokens, the model gets a chance to generate some response tokens. The choice of K then becomes a critical design decision: a smaller K improves responsiveness (e.g. model can respond to interruptions faster), but increases the number of prefills needed during a session. Here is an example of such a system, MiniCPM-o-4.5:
Desh Raj tweet media
Julian Salazar@JulianSlzr

Note: I only speculate from announcements/demos. Views mine, not GDM's. My definitions: - e2e = a model operating directly on audio (tokens) - turn-based = audio sequences in, audio sequences out - full-duplex = audio outputs are always conditioned on latest inputs [2/n]

English
1
0
58
5.2K
Ming Tu
Ming Tu@tuming628·
@CarlZha the doort mat with "taoyuanli" kills it. Should be in Chinese caligraphy
English
0
0
0
185
Carl Zha
Carl Zha@CarlZha·
Real estate developers in Suzhou, China are bringing back traditional architecture to modern living
English
133
1.6K
14.2K
345.9K
Ming Tu
Ming Tu@tuming628·
@rdesh26 One example is using LLM to generate both slides and presentation. In this case, the slides content and spoken content may be different. Then, LLM needs to decide when to take the action to use TTS.
English
0
0
0
29
Desh Raj
Desh Raj@rdesh26·
I think AVM is a half-duplex S2S model with function calling, similar to Gemini Live or Nova Sonic. Nevertheless, I have been hearing a lot of buzz these days about LLMs using ASR/TTS as agents, but I am yet to hear a convincing argument about how this is different from a cascaded pipeline.
English
2
0
0
90
Ming Tu
Ming Tu@tuming628·
For text-only systems, a system includes AI models, tools and harness (tool usage, context&memory management, etc). In this sense, both ASR and TTS can be considered as tools.
Desh Raj@rdesh26

Voice AI has grown a lot recently, and definitions of models/systems have become somewhat vague. Let's put down some basics. 1. AI "models" are not AI "systems". Models are the core units that build up a system. For text-only systems, the two are trivially equivalent (discounting the BPE tokenizer/detokenizer), but not for voice. For voice AI systems, examples of model may be ASR, TTS, LLM, SpeechLLM, OmniLLM, etc. 2. A model is the smallest replaceable unit within a system. For example, an STT model (user speech in / agent text out) often contains a speech encoder + an LLM, but neither of these components can be replaced without having to train the model again. 3. A speech-to-speech "system" (often called a voice agent) may take many forms and comprise many components, but it is always based on two requirements: (A) response generation --> what/how to respond (B) duplex control --> when to talk. Traditionally, (A) has been handled through an ASR/LLM/TTS cascade. Most of the current S2S modeling research aims to replace this pipeline with fewer models (either STT+TTS or S2S). Most systems still rely on external VADs and WebRTC for (B), with the famous exception of "full-duplex" models like Moshi. 4a. A SpeechLLM is a model that takes text+speech input, but only generates text output. It is also called a "speech understanding" model. 4b. An OmniLLM is a SpeechLLM that also generates speech (either codecs or continuous latents). It is also called a "speech generation" model (not to be confused with a TTS). 5. A speech-to-speech system is considered "realtime" if it satisfies 3 conditions: low latency (< 1s), streaming audio in/out, and barge-in/interruption handling. It can also be called a full-duplex system (not to be confused with a full-duplex "model").

English
1
0
1
327
Ming Tu
Ming Tu@tuming628·
@rdesh26 I believe chatgpt (the system an user actually uses through the app) is such a system. It's not just the model weights.
English
1
0
0
45
Desh Raj
Desh Raj@rdesh26·
@tuming628 Can you share an example of such a system?
English
1
0
0
154
Teamily AI
Teamily AI@Teamily_AI·
Introducing Teamily AI: the world's first human + AI social network, built for better connection and collaboration. This is the first time our founders @AidenChaoyangHe and @avestime speaking on camera in this way — because they genuinely wanted to share Teamily AI with you personally. Come see what a Human + AI symbiotic social network feels like. We'd love for you to be part of this journey. Try Teamily AI → Teamily.ai
English
109
47
284
407.6K
Ming Tu
Ming Tu@tuming628·
@oran_ge 但是豆包还是要上春晚冲dau。精英阶层的生产力提升需求在规模上是否真的能match普罗大众用豆包p图的需求。dau不是负债,免费token数才是负债
中文
0
0
1
338
Orange AI
Orange AI@oran_ge·
这篇文章被虎嗅网转发后,微信内的阅读人数已经超过了10万,已经在小范围破圈了。 我的公众号一天涨了3000粉。 但我不得不把评论区设置成关注七天可评论。 我能在虎嗅的评论区看到大家的抨击,嘲讽,困惑,迷茫。 我非常能理解,因为一周前的我看到这篇文章大概也会如此反应。 但是我已经过来了。 人的转变就是一瞬间的事情。 期待更多人一起过来。
Orange AI@oran_ge

x.com/i/article/2020…

中文
49
156
754
332.8K
Ming Tu
Ming Tu@tuming628·
@zebgou Which one generalizes better? Scaling pretraining or scaling RL
English
0
0
0
94
Zhibin Gou
Zhibin Gou@zebgou·
If Gemini-3 proved continual scaling pretraining, DeepSeek-V3.2-Speciale proves scaling RL with large context. We spent a year pushing DeepSeek-V3 to its limits. The lesson is post-training bottlenecks are solved by refining methods and data, not just waiting for a better base.
DeepSeek@deepseek_ai

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech report: huggingface.co/deepseek-ai/De… 1/n

English
61
161
1.9K
236.5K
Ming Tu
Ming Tu@tuming628·
Claiming 'scaling is over' is a placebo for organizations that simply lack the capacity to scale.
Oriol Vinyals@OriolVinyalsML

The secret behind Gemini 3? Simple: Improving pre-training & post-training 🤯 Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS '25 talk with @ilyasut and @quocleix—the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we've ever seen. No walls in sight! Post-training: Still a total greenfield. There's lots of room for algorithmic progress and improvement, and 3.0 hasn't been an exception, thanks to our stellar team. Congratulations to the whole team 💙💙💙

English
0
0
1
101
Ming Tu
Ming Tu@tuming628·
ParaS2SAlign: A RL framework for aligning any spoken language model to make them "paralinguistic-aware," enabling them to better understand nuanced vocal characteristics and generate speech response considering these characteristics.
English
0
0
1
54
Ming Tu
Ming Tu@tuming628·
The main contribution of this paper is twofold: ParaS2SBench: A new benchmark designed to evaluate how well spoken language models handle paralinguistic features (like emotion, tone, and prosody) in speech-to-speech interactions .
English
0
0
1
36