Ming Tu

352 posts

Ming Tu

@tuming628

Research scientist on Speech and Natural Language Processing. My tweets are my own and can be crawled as training data freely.

San Jose, CA Katılım Eylül 2012

321 Takip Edilen157 Takipçiler

Ming Tu@tuming628·9 Nis

The world doesn't wait its turn. Neither should conversational AI. Let's step into the Full-Duplex era.

English

Ming Tu@tuming628·9 Nis

For the past year, I have been working on the development of Seeduplex. Today, we officially launched the industry's first native full-duplex speech LLM, completely replacing the half-duplex and turn-by-turn system released early last year. seed.bytedance.com/en/seeduplex

English

110

Ming Tu@tuming628·9 Nis

This architecture is now fully deployed in production on the Doubao App, processing continuous, real-time voice interactions for hundreds of millions of users. Read the technical blog for more details.

English

Ming Tu@tuming628·7 Mar

@JulianSlzr @rdesh26 True. It doesn't need to be an end-to-end model to achieve full-duplex experience, especially if we think text2speech is a tool that can be called when the model/system believe it's time to speak something.

English

Julian Salazar@JulianSlzr·7 Mar

@rdesh26 Thanks Desh! Since then, how others ended up using the term has made me disambiguate full-duplex into: - a system/experience: ongoing listening and reaction - an end-to-end architecture: model that always conditions and produces audio streams

English

173

Desh Raj@rdesh26·6 Mar

This ~1.5 year old thread from @JulianSlzr is the most precise way to think about speech-to-speech models. The best "full-duplex" models these days are usually chunk-wise time-multiplexed (rather than multi-stream like Moshi). In Julian's terminology, they lie somewhere between "turn-taking" and true "full-duplex". This means that after every K user audio tokens, the model gets a chance to generate some response tokens. The choice of K then becomes a critical design decision: a smaller K improves responsiveness (e.g. model can respond to interruptions faster), but increases the number of prefills needed during a session. Here is an example of such a system, MiniCPM-o-4.5:

Julian Salazar@JulianSlzr

Note: I only speculate from announcements/demos. Views mine, not GDM's. My definitions: - e2e = a model operating directly on audio (tokens) - turn-based = audio sequences in, audio sequences out - full-duplex = audio outputs are always conditioned on latest inputs [2/n]

English

5.2K

Ming Tu@tuming628·4 Mar

Our recent work uses Reinforcement Learning (GRPO) and MLLM-based rewards to fine-tune audio-driven video models, significantly improving lip-sync and natural expressiveness.

arXiv Sound@ArxivSound

Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu, "FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation," arxiv.org/abs/2603.00159

English

123

Ming Tu@tuming628·25 Şub

@CarlZha the doort mat with "taoyuanli" kills it. Should be in Chinese caligraphy

English

185

Carl Zha@CarlZha·24 Şub

Real estate developers in Suzhou, China are bringing back traditional architecture to modern living

English

133

1.6K

14.2K

345.9K

Ming Tu@tuming628·19 Şub

@rdesh26 One example is using LLM to generate both slides and presentation. In this case, the slides content and spoken content may be different. Then, LLM needs to decide when to take the action to use TTS.

English

Desh Raj@rdesh26·19 Şub

I think AVM is a half-duplex S2S model with function calling, similar to Gemini Live or Nova Sonic. Nevertheless, I have been hearing a lot of buzz these days about LLMs using ASR/TTS as agents, but I am yet to hear a convincing argument about how this is different from a cascaded pipeline.

English

Ming Tu@tuming628·18 Şub

For text-only systems, a system includes AI models, tools and harness (tool usage, context&memory management, etc). In this sense, both ASR and TTS can be considered as tools.

Desh Raj@rdesh26

Voice AI has grown a lot recently, and definitions of models/systems have become somewhat vague. Let's put down some basics. 1. AI "models" are not AI "systems". Models are the core units that build up a system. For text-only systems, the two are trivially equivalent (discounting the BPE tokenizer/detokenizer), but not for voice. For voice AI systems, examples of model may be ASR, TTS, LLM, SpeechLLM, OmniLLM, etc. 2. A model is the smallest replaceable unit within a system. For example, an STT model (user speech in / agent text out) often contains a speech encoder + an LLM, but neither of these components can be replaced without having to train the model again. 3. A speech-to-speech "system" (often called a voice agent) may take many forms and comprise many components, but it is always based on two requirements: (A) response generation --> what/how to respond (B) duplex control --> when to talk. Traditionally, (A) has been handled through an ASR/LLM/TTS cascade. Most of the current S2S modeling research aims to replace this pipeline with fewer models (either STT+TTS or S2S). Most systems still rely on external VADs and WebRTC for (B), with the famous exception of "full-duplex" models like Moshi. 4a. A SpeechLLM is a model that takes text+speech input, but only generates text output. It is also called a "speech understanding" model. 4b. An OmniLLM is a SpeechLLM that also generates speech (either codecs or continuous latents). It is also called a "speech generation" model (not to be confused with a TTS). 5. A speech-to-speech system is considered "realtime" if it satisfies 3 conditions: low latency (< 1s), streaming audio in/out, and barge-in/interruption handling. It can also be called a full-duplex system (not to be confused with a full-duplex "model").

English

327

Ming Tu@tuming628·19 Şub

@rdesh26 I believe chatgpt (the system an user actually uses through the app) is such a system. It's not just the model weights.

English

Desh Raj@rdesh26·18 Şub

@tuming628 Can you share an example of such a system?

English

154

Ming Tu@tuming628·14 Şub

@Teamily_AI @yize_Wang369 @AidenChaoyangHe @avestime how to join the community? is there a discord channel?

English

Teamily AI@Teamily_AI·13 Şub

@yize_Wang369 @AidenChaoyangHe @avestime Hi, stay tuned to our account, join our community, or sign up for the waiting list—we’ll be releasing continuously!

English

2.1K

Teamily AI@Teamily_AI·13 Şub

Introducing Teamily AI: the world's first human + AI social network, built for better connection and collaboration. This is the first time our founders @AidenChaoyangHe and @avestime speaking on camera in this way — because they genuinely wanted to share Teamily AI with you personally. Come see what a Human + AI symbiotic social network feels like. We'd love for you to be part of this journey. Try Teamily AI → Teamily.ai

English

109

284

407.6K

Ming Tu@tuming628·12 Şub

GIF

Hao HONG 洪灝, CFA@HAOHONG_CFA

Seedance made this MV ft. Kayne West. It’s so well made that it’s going viral on Chinese internet.

ZXX

331

Ming Tu@tuming628·11 Şub

@oran_ge 但是豆包还是要上春晚冲dau。精英阶层的生产力提升需求在规模上是否真的能match普罗大众用豆包p图的需求。dau不是负债，免费token数才是负债

中文

338

Orange AI@oran_ge·10 Şub

这篇文章被虎嗅网转发后，微信内的阅读人数已经超过了10万，已经在小范围破圈了。我的公众号一天涨了3000粉。但我不得不把评论区设置成关注七天可评论。我能在虎嗅的评论区看到大家的抨击，嘲讽，困惑，迷茫。我非常能理解，因为一周前的我看到这篇文章大概也会如此反应。但是我已经过来了。人的转变就是一瞬间的事情。期待更多人一起过来。

Orange AI@oran_ge

x.com/i/article/2020…

中文

156

754

332.8K

Ming Tu@tuming628·31 Oca

It got accepted to ICLR 2026. See ya!

Ming Tu@tuming628

Excited to share our new paper, "ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction"!

English

105

Ming Tu@tuming628·2 Ara

@zebgou Which one generalizes better? Scaling pretraining or scaling RL

English

Zhibin Gou@zebgou·1 Ara

If Gemini-3 proved continual scaling pretraining, DeepSeek-V3.2-Speciale proves scaling RL with large context. We spent a year pushing DeepSeek-V3 to its limits. The lesson is post-training bottlenecks are solved by refining methods and data, not just waiting for a better base.

DeepSeek@deepseek_ai

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech report: huggingface.co/deepseek-ai/De… 1/n

English

161

1.9K

236.5K

Ming Tu@tuming628·19 Kas

Claiming 'scaling is over' is a placebo for organizations that simply lack the capacity to scale.

Oriol Vinyals@OriolVinyalsML

The secret behind Gemini 3? Simple: Improving pre-training & post-training 🤯 Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS '25 talk with @ilyasut and @quocleix—the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we've ever seen. No walls in sight! Post-training: Still a total greenfield. There's lots of room for algorithmic progress and improvement, and 3.0 hasn't been an exception, thanks to our stellar team. Congratulations to the whole team 💙💙💙

English

101

Ming Tu@tuming628·13 Kas

Paper: arxiv.org/pdf/2511.08723 Project Page: paras2sbench.github.io

English

Ming Tu@tuming628·13 Kas

Excited to share our new paper, "ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction"!

arXiv Sound@ArxivSound

Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, Yonghui Wu, "ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction," arxiv.org/abs/2511.08723

English

351

Ming Tu@tuming628·13 Kas

ParaS2SAlign: A RL framework for aligning any spoken language model to make them "paralinguistic-aware," enabling them to better understand nuanced vocal characteristics and generate speech response considering these characteristics.

English

Ming Tu@tuming628·13 Kas

The main contribution of this paper is twofold: ParaS2SBench: A new benchmark designed to evaluate how well spoken language models handle paralinguistic features (like emotion, tone, and prosody) in speech-to-speech interactions .

English

Ming Tu@tuming628·7 Kas

basically interruption in speech interaction

OpenAI@OpenAI

You can now interrupt long-running queries and add new context without restarting or losing progress. This is especially useful for refining deep research or GPT-5 Pro queries as the model will adjust its response with your new requirements. Just hit update in the sidebar and type in any additional details or clarifications.

English

Keşfet

@JulianSlzr @rdesh26 @CarlZha @Teamily_AI @yize_Wang369 @AidenChaoyangHe @avestime @oran_ge