Xiaofei Wang

@Orpheus_wang

Researcher @Microsoft

Beigetreten Şubat 2012

48 Folgt14 Follower

Xiaofei Wang retweetet

Cheng Han Chiang (姜成翰)@dcml0714·9 Eki

🚨 New paper! SHANKS lets spoken language models (SLMs) think while listening💭👂 This enables the SLM to interrupt the user in a timely manner and make early tool calls when the speaker is still speaking. Paper: arxiv.org/abs/2510.06917 Project page: d223302.github.io/SHANKS/

English

Xiaofei Wang retweetet

Cheng Han Chiang (姜成翰)@dcml0714·23 Ağu

🎉 Excited to share that our paper on audio-LLM-as-a-judge has been accepted to EMNLP 2025 Findings! 🔗 arxiv.org/abs/2506.05984… 🗝️ Highlights: 🧑‍⚖️ Agreement between human and audio-LLM-judge can be as high as human-human agreements 👑 Gemini-2.5-pro outperforms GPT-4o-audio as a speaking-style judge 🗣️ There's still room for improvement in style following & natural dialogue generation for SLMs

English

6.6K

Xiaofei Wang@Orpheus_wang·22 Tem

Please take a look at our latest work and the insights we've uncovered. Thanks to all co-authors for their contributions.

Cheng Han Chiang (姜成翰)@dcml0714

1/7 🔗 Introducing STITCH: our new method to make Spoken Language Models (SLMs) think and talk at the same time. Paper link 👉 arxiv.org/abs/2507.15375

English

Xiaofei Wang retweetet

SLT 2024@ieee_slt·5 Ara

Best Paper Awards of SLT 2024: Contextualized Automatic Speech Recognition with Dynamic Vocabulary. E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS. Congratulations!

English

1.8K

Xiaofei Wang retweetet

AK@_akhaliq·13 Şub

Microsoft presents ELaTE Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like paper page: huggingface.co/papers/2402.07… ELaTE is a zero-shot text-to-speech (TTS) system that can generate natural laughing speech from any speaker based on a speaker prompt to mimic the voice characteristic, a text prompt to indicate the contents of the generated speech, and an input to control the laughter expression.

English

17K

Xiaofei Wang retweetet

AK@_akhaliq·15 Ağu

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer paper page: huggingface.co/papers/2308.06… Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.

English

311

74.1K

Xiaofei Wang retweetet

Takuya Yoshioka@_ty274·15 Ağu

SpeechX from our new paper is a single generative model that edits, enhances & creates speech, enabling zero-shot TTS, spoken content editing (while preserving ambience), speaker extraction & speech/noise removal. Demo: aka.ms/speechx Paper: arxiv.org/abs/2308.06873

English

6.1K

Entdecken

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry