Xiaofei Wang

7 posts

Xiaofei Wang

Xiaofei Wang

@Orpheus_wang

Researcher @Microsoft

Beigetreten Şubat 2012
48 Folgt14 Follower
Xiaofei Wang retweetet
Cheng Han Chiang (姜成翰)
🎉 Excited to share that our paper on audio-LLM-as-a-judge has been accepted to EMNLP 2025 Findings! 🔗 arxiv.org/abs/2506.05984… 🗝️ Highlights: 🧑‍⚖️ Agreement between human and audio-LLM-judge can be as high as human-human agreements 👑 Gemini-2.5-pro outperforms GPT-4o-audio as a speaking-style judge 🗣️ There's still room for improvement in style following & natural dialogue generation for SLMs
Cheng Han Chiang (姜成翰) tweet media
English
3
17
87
6.6K
Xiaofei Wang retweetet
SLT 2024
SLT 2024@ieee_slt·
Best Paper Awards of SLT 2024: Contextualized Automatic Speech Recognition with Dynamic Vocabulary. E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS. Congratulations!
SLT 2024 tweet media
English
0
4
21
1.8K
Xiaofei Wang retweetet
AK
AK@_akhaliq·
Microsoft presents ELaTE Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like paper page: huggingface.co/papers/2402.07… ELaTE is a zero-shot text-to-speech (TTS) system that can generate natural laughing speech from any speaker based on a speaker prompt to mimic the voice characteristic, a text prompt to indicate the contents of the generated speech, and an input to control the laughter expression.
AK tweet media
English
1
21
91
17K
Xiaofei Wang retweetet
AK
AK@_akhaliq·
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer paper page: huggingface.co/papers/2308.06… Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.
AK tweet media
English
4
87
311
74.1K
Xiaofei Wang retweetet
Takuya Yoshioka
Takuya Yoshioka@_ty274·
SpeechX from our new paper is a single generative model that edits, enhances & creates speech, enabling zero-shot TTS, spoken content editing (while preserving ambience), speaker extraction & speech/noise removal. Demo: aka.ms/speechx Paper: arxiv.org/abs/2308.06873
English
0
16
72
6.1K