AlphaCephei

1.8K posts

AlphaCephei banner
AlphaCephei

AlphaCephei

@alphacep

Developers of Vosk Speech Toolkit

α Cep / Astrakhan, Russia Katılım Ekim 2019
470 Takip Edilen1K Takipçiler
AlphaCephei
AlphaCephei@alphacep·
@CatGodSandHive Unfortunately we didn't get to the point to check pronunciation easily. We'll need some reliable GOP implementation, not yet exists.
English
1
0
0
22
CatGod
CatGod@CatGodSandHive·
@alphacep "Pronunciation issues" are a good test, because they don't just check for "smoothness." They highlight where models still stumble.
English
1
0
1
68
AlphaCephei
AlphaCephei@alphacep·
We systematically test modern TTS engines on Russian dataset. Qwen feels most interesting one. Good clarity and sound quality, reasonable intonation. Issues with pronunciation as always, it is a common thing. VibeVoice hallucinates. Fish is reasonable but a bit plain.
AlphaCephei tweet media
English
5
4
34
3.1K
AlphaCephei
AlphaCephei@alphacep·
@FeitengLi Here is the updated table with 0.6B included. Also good actually, thanks for the suggestion.
AlphaCephei tweet media
English
0
0
2
71
AlphaCephei
AlphaCephei@alphacep·
It would be nice to find English data for a similar test. Many people test CER but few signal quality (UTMOS or others) and intonation.
English
0
0
0
244
AlphaCephei
AlphaCephei@alphacep·
@rdesh26 "Comparable or greater accuracy" is simply not true. TDT is usually less accurate in noisy conditions because duration prediction fails. Only good for slow clean speech. Depth-first convolutions add to that.
English
0
0
1
102
Desh Raj
Desh Raj@rdesh26·
A great write-up about the token-and-duration transducer, and why it makes NVIDIA's ASR models faster than anyone else!
Speechmatics@Speechmatics

The @HuggingFace Open ASR Leaderboard RTFx column is dominated by one model family. 😯 The mechanism is a modified forward-backward algorithm. 👇

English
1
0
18
1.6K
AlphaCephei
AlphaCephei@alphacep·
@realmrfakename The real secret is that you don't to have to use network of size of 2M params, those are not going to be reliable in noise. A lightweight ASR of 20M params can do much better.
English
0
0
1
369
AlphaCephei
AlphaCephei@alphacep·
@ZDi____ Hope you know about sidon as well, it is an improved version
English
1
2
3
643
ZD1908
ZD1908@ZDi____·
How did I not know this before? The guys over at Google proved that you can have a very good speech restoration model by using features extracted from a pretrained model.
ZD1908 tweet mediaZD1908 tweet media
English
1
2
7
551
AlphaCephei retweetledi
William Chen
William Chen@chenwanch1·
What if you had nano-banana for audio? AudioChat is a multi-modal LM that performs fine-grained understanding, generation, and editing of multi-source scenes By diffusing continuous latents, it generates 48khz stereo edits with great input adherence: wanchichen.github.io/audiochat/
English
7
19
152
9K
AlphaCephei retweetledi
Kaitlyn Zhou
Kaitlyn Zhou@KaitlynZhou·
Text-to-speech models can’t get your address right? Turns out you’re not the only one. 📢New preprint! State-of-the-art speech models get 44% of street names wrong — and non-English primary speakers suffer twice the error impact!
Kaitlyn Zhou tweet media
English
6
13
81
11K
AlphaCephei retweetledi
OpenMOSS
OpenMOSS@Open_MOSS·
🚀 The MOSS-TTS Family is here. From zero-shot cloning to real-time VoiceAgents, we have released our most powerful suite of audio models yet. The Lineup: MOSS-TTS Flagship: The industry's best zero-shot voice cloning. Features precise control over duration & Pinyin, capable of generating 1 hour of speech. MOSS-TTSD-v1.0: A new standard for dialogue generation. Comprehensive optimization for conversational scenes and small languages. Best-in-class performance in all evaluations. MOSS-VoiceGenerator: One-shot timbre generation. Create voices with a single sentence and complex instruction handling. MOSS-TTS-Realtime: Built for the next era of VoiceAgents. Synthesis starts in just 2 characters for instant response. MOSS-SoundEffect: Text-to-Audio sound effects to expand your creative toolkit. 🔥 Try it now: studio.mosi.cn/voice-synthesis 💻 Deploy (GitHub): github.com/OpenMOSS/MOSS-… 🔌 API Docs: studio.mosi.cn/docs/moss-tts Welcome to our demo. The era of 'childhood' for TTS is over. #MOSS #AI #TextToSpeech #TTS #OpenClaw #Agent #OpenMOSS #Opensource #VoiceAgent
English
7
5
21
1.6K
AlphaCephei
AlphaCephei@alphacep·
@huseinzol05 UTMOS score is small for audio with clicks, noise corruptions like metallic noise. If your speech is clean and naturally sounding but phones sound different, your UTMOS will be good.
English
1
0
0
53
husein
husein@huseinzol05·
@alphacep i dont get about "smoothness", can u clarify
English
1
0
0
64
AlphaCephei retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Google just released Waxal on Hugging Face 1,250 hours of transcribed speech in 14 African languages for ASR and 240 hours for TTS, covering over 100 million speakers huggingface.co/datasets/googl…
English
0
9
45
2.6K
AlphaCephei
AlphaCephei@alphacep·
Active marketing in voice tech is a new thing of last year (our subreddit is full of dumb product mentions).
English
0
0
3
230