AlphaCephei

1.8K posts

AlphaCephei

@alphacep

Developers of Vosk Speech Toolkit

α Cep / Astrakhan, Russia Katılım Ekim 2019

480 Takip Edilen1K Takipçiler

Sabitlenmiş Tweet

AlphaCephei@alphacep·16 Haz

Voting, Ensembles and bringing AI to life alphacephei.com/nsh/2022/06/14…

English

AlphaCephei retweetledi

ModelScope@ModelScope2022·16h

MOSS-TTS v1.5 is here, an upgrade to v1.0 from @OpenMOSS. (demo👇)🤖modelscope.ai/models/OpenMOS… Key improvements: ⏸️ Inline pause control: [pause 3.2s] now supported mid-sentence 🌍 31 languages, up from 20 — now includes Cantonese, Hindi, Thai, Vietnamese, Tagalog, Swahili and more 🎙️ More stable voice cloning with reduced variance across repeated generations 📝 Better long-reference, short-text cloning All v1.0 capabilities preserved: zero-shot cloning, long-form speech, Pinyin/IPA control, code-switching. 💻 github.com/OpenMOSS/MOSS-…

English

204

10.8K

AlphaCephei@alphacep·1d

Some our recent work on model training, nothing very deep but still important for the users alphacephei.com/nsh/2026/05/24…

English

1.1K

AlphaCephei retweetledi

Xie Zhifei@XieZhifei14110·6d

Stop using Whisper for ASR ! open sourcing Mega-ASR — the first full-scenario SOTA industrial-grade ASR model, built for the audio nobody else can crack: far-field, reverb, electrical hum, device noise, the real-world mess. beats open + closed SOTA by 10–30% on real-world benchmarks. the harder the audio is for humans, the bigger the lead.

English

597

33.3K

AlphaCephei retweetledi

Desh Raj@rdesh26·12 May

x.com/i/article/2054…

ZXX

170

35.7K

AlphaCephei@alphacep·7 May

@Muramasa_2 Its all great only if there are 100 participants and 200 publications, not 5k/20k.

English

155

Muramasa@Muramasa_2·7 May

国際学会、いろんな国の研究者に直接話聞けたり質問できて結構好きなのでなんとかオワコンにならないでほしい...(切実)

日本語

707

Muramasa@Muramasa_2·7 May

国際会議をオワコンにしないためには, サクラよりも先に質問をいろんな人がするしかない気がする

日本語

4.2K

AlphaCephei retweetledi

steven@Tu7uruu·6 May

Big announcement for speech AI Benchmarks get gamed. So we added a repellent. The Open ASR Leaderboard now includes private evaluation data from Appen and DataoceanAI, making speech recognition benchmarks more robust against test-set contamination and “benchmaxxing.” Better signal. Less overfitting. More real-world ASR.

English

114

12K

AlphaCephei retweetledi

MOSI@MosiAI_Official·24 Nis

Meet MOSS-Audio. A unified open-source model for real-world audio understanding, built to handle speech, emotion, speakers, sound events, music, temporal grounding, and reasoning in one system. In the reported evaluation, our 4B model outperforms many 7B–9B open models, and MOSS-Audio-8B-Thinking reaches 71.08 average accuracy. Strong results. Real-world audio. GitHub: github.com/OpenMOSS/MOSS-… HuggingFace: huggingface.co/collections/Op… MOSI.AI: mosi.cn OpenMOSS: open-moss.com

English

347

16.6K

AlphaCephei@alphacep·24 Nis

@ryu0000000001 Almost nothing even for MSA. Most of the data is automatically annotated and the quality of such annotation is way worse than English/Chinese.

English

ryu@ryu0000000001·23 Nis

@alphacep Isn't arabic quite different based on the region? I wonder how much data there is for the standard kind

English

AlphaCephei@alphacep·23 Nis

Many researchers study under-resourced languages. Even major languages suffer from lack of high quality datasets. Arabic, Bengali and many others never had any significant amount of data with high-quality annotations. MACS-Arabic transcripts are from Youtube. Same as Yodas.

English

643

AlphaCephei retweetledi

DeepL@DeepLcom·16 Nis

In less than an hour, we’ll be live with DeepL Spring Launch — and this is one of the breakthroughs we’re most excited to share with you: real-time, spoken translation. You speak in your preferred language. Everyone else hears you in theirs. Jarek and the DeepL team will be demonstrating voice-to-voice translation live onstage from 4pm CEST / 10am EDT. We’ll see you there! #DeepLVoice #VoiceToVoice #SpringLaunch #LanguageSolved

English

491

AlphaCephei@alphacep·19 Nis

betrac.github.io BeTraC is a shared evaluation challenge building end-to-end speech models for clinical dialog analysis.

English

888

AlphaCephei retweetledi

DailyPapers@HuggingPapers·18 Nis

Tencent & HKUST release Audio-Omni First unified framework for audio understanding, generation, and editing across sound, music, and speech. Combines a frozen MLLM (Qwen2.5-Omni) with a trainable Diffusion Transformer for high-fidelity synthesis.

English

4.1K

AlphaCephei retweetledi

Haesung Jeon@jeon_haesung·11 Nis

Updated table, with two additional models - FunAudioLLM/SenseVoiceSmall - RaonSpeech/Raon-Speech-9B Also changed text normalization to more korean-specific way. RaonSpeech/Raon-Speech-9B is first model beating whisper! surprisingly, sensevoice-small was quite good and fast

Haesung Jeon@jeon_haesung

There were various asr model release(qwen3-asr, cohere-transcribe, voxtral-realtime) supporting korean. Real question is "Is it better than Whisper?" So I tested on various korean dataset. github.com/seastar105/oss… And, still whisper-large-v3 is king

English

1.2K

AlphaCephei retweetledi

Firoj Alam@firojalam04·9 Nis

Arabic is spoken by 400M+ people across dozens of dialects — yet most speech AI covers only a handful. The bottleneck? Diverse, high-quality data. We introduce MENASpeechBank: a large-scale, publicly available speech dataset spanning 124 speakers, 18 MENA countries, 5 dialect groups, and 417K persona-conditioned conversations — designed to advance speech understanding in LLMs across the full spectrum of Arabic. 📄 arxiv.org/abs/2602.07036 🤗 Data: huggingface.co/datasets/QCRI/… W/ @shammur_absar Zein, Hunzalah, Rabindra #AudioLLM #ArabicNLP #SpeechAI

English

447

AlphaCephei retweetledi

Chayenne Zhao@GenAI_is_real·8 Nis

Omni Model Inference: How We Move Tensors Between Stages A friend once asked me: what's the fundamental difference between serving Omni multimodal models and serving plain LLMs? I thought about it and the simplest way to put it is this — a regular LLM handles a request with a single model in a single process; an Omni model handles a request by relaying it across multiple models. Take Qwen3 Omni as an example. The lifecycle of a voice conversation request looks roughly like this: the user sends an audio clip, which first passes through an audio encoder that converts the waveform into embeddings, then feeds into the Thinker (a large language model) for inference to generate text tokens, and finally those tokens stream into the Talker (a speech synthesis model) that progressively produces audio waveforms to return to the user. TTS models with a Dual-AR architecture follow a similar pattern — a large AR model generates coarse-grained tokens, a small AR model fills in the fine-grained tokens, and a vocoder synthesizes the final audio. These models have inherent data dependencies: if the Thinker doesn't produce tokens, the Talker has nothing to consume; if the large AR doesn't emit coarse tokens, the small AR can't fill in fine tokens. But at the same time, they must run in parallel — the Talker neither needs to nor can afford to wait for the Thinker to finish generating everything before it starts. Otherwise, the user would wait several seconds before hearing the first syllable, and the experience would completely fall apart. The dependency between them is streaming: as soon as the upstream produces a small chunk of data, the downstream must consume it immediately. This is why we split the entire inference pipeline into multiple stages, each running a component model, with intermediate results passed between stages via inter-process communication. Each stage is an independent process with its own GPU/hardware management, its own scheduling loop, and its own batch management. The upstream stage produces tensors, the downstream stage consumes tensors — a textbook producer-consumer relationship. But "passing tensors between processes" actually breaks down into two fundamentally different concerns. The first is signaling — telling the downstream that data is ready. A few dozen bytes, demanding low latency at the microsecond level. The second is data transfer — moving tens of megabytes of tensor data from one process to another, demanding throughput, ideally with zero copy. ZMQ is naturally suited for signaling — lightweight and low-latency — but asking it to transfer a 64MB tensor means serialization overhead that blows up latency. Shared memory + CUDA IPC is naturally suited for moving large blocks of data with near-zero copy, but it has no built-in event notification mechanism; you'd have to resort to polling or bolt on external signaling to notify the downstream. So our design is straightforward: separate Control Plane from Data Plane. ZMQ handles only notifications (lightweight messages like DataReadyMessage), while the Relay handles only data (tensor transfer via shared memory / NCCL / CUDA IPC), each doing what it does best. Once this separation was established, many downstream architectural decisions fell into place naturally. With the Control Plane and Data Plane separated, the next natural question is: how exactly does the Data Plane move data? The most intuitive approach is serialization — the upstream serializes the tensor into a byte stream, sends it to the downstream via socket, and the downstream deserializes it back. Logically clean, but the cost is prohibitive: a 64MB tensor going through serialization, memory copy, the network stack, and deserialization every time — the latency and CPU overhead are simply unacceptable in a streaming inference scenario. Since upstream and downstream stages run in different processes on the same machine, a more natural approach is shared memory: the upstream writes the tensor directly into a memory region accessible to both processes, and the downstream reads from the same address. No serialization needed, no copy needed, and with CUDA IPC, even GPU tensors can be accessed directly across processes — zero copy in the truest sense. But shared memory is no free lunch. The biggest question is: who manages the read-write cadence of this memory? Upstream and downstream speeds don't necessarily match — the Thinker might suddenly slow down due to a long context, or the Talker might fall behind because of heavy vocoder computation. If the upstream writes faster than the downstream can consume, the shared memory will eventually be exhausted. This calls for a flow control mechanism. We chose a credit mechanism, which is essentially a classic semaphore. A fixed number of shared memory slots are pre-allocated between upstream and downstream (say 10 slots, each 64MB), and the credit represents the number of currently available empty slots. Before writing data, the upstream acquires one credit; after writing, it sends a notification via ZMQ. Once the downstream finishes reading, it releases the credit, and the upstream can reuse that slot. When credits are exhausted, the upstream blocks — naturally forming backpressure. The pipeline's throughput automatically degrades to the speed of the slowest stage rather than blowing up memory. This is also why the downstream must consume as quickly as possible after receiving a notification: releasing credits lets the upstream keep pushing forward; otherwise, the entire pipeline stalls. This approach looks simple at first glance, but it's worth comparing against several common alternatives: Ring buffer — a fixed-size circular buffer maintaining read and write pointers, blocking when write catches up to read. However, our stages run on different GPUs, and cross-GPU tensor transfer goes through CUDA IPC. CUDA IPC is per-allocation: each slot is an independent cudaMalloc, corresponding to an independent IPC handle, and the downstream's mapped address is determined by the driver — slots are not contiguous in address space. The "single contiguous memory block" assumption that ring buffers rely on simply doesn't hold here. If you force it, you're just rotating a slot index with modulo, which is logically equivalent to credit counting but adds an unnecessary layer of abstraction. Dynamic allocation — no pre-allocated fixed slots; malloc new memory each time and free it when done. Maximum flexibility, but in a shared memory context, cross-process shm allocation and deallocation is inherently heavy, and fragmentation accumulates relentlessly in long-running inference services. For a scenario where slot sizes are fixed and quantities are bounded, dynamic allocation is using a cannon to kill a mosquito. Unbounded queue — unlimited capacity, upstream writes freely. Simplest to implement but provides zero flow control. If the downstream can't keep up, it's OOM. Unacceptable in production. Drop without backpressure — when the upstream fills up, discard or overwrite. Works for real-time streaming media scenarios where dropping a few video frames goes unnoticed, but in an inference pipeline every token carries semantic meaning — dropping one means getting it wrong. Comparing horizontally across these options, for the specific set of constraints we face — cross-process shared memory for large tensors, mismatched upstream/downstream speeds, and long-running operation — pre-allocated fixed slots + semaphore counting is almost the most natural choice: zero fragmentation, bounded memory, built-in backpressure. The more I work on systems design, the more I feel this: the hard part isn't coming up with a clever solution — it's recognizing, among a pile of solutions that all "seem to work," the one whose constraint alignment is the tightest. The credit mechanism is exactly this — at first glance it seems too textbook, but a textbook solution running stably in production means the problem was modeled correctly in the first place.

English

AlphaCephei@alphacep·9 Nis

@ayousanz Well, it is probably a mix. They use style model. Not sure about license.

English

123

ようさん@ayousanz·9 Nis

StyleTTSではなくBert-VITS2ベースでは？？ Bert-VITS2ベースの場合ライセンス違反していないのかな

AlphaCephei@alphacep

Interesting tiny 1.6M params TTS engine, based on StyleTTS github.com/tronghieuit/ti…

日本語

763

AlphaCephei@alphacep·9 Nis

Interesting tiny 1.6M params TTS engine, based on StyleTTS github.com/tronghieuit/ti…

English

7.9K

AlphaCephei retweetledi

OpenBMB@OpenBMB·6 Nis

🚀 VoxCPM 2 is live! 🎉 Another open-source AI #TTS model from China — and one that stands shoulder to shoulder with Qwen3-TTS, while bringing everything into a single unified model. After rapid iterations from V1 (zero-shot cloning) to V1.5 (long-form + fine-tuning), #VoxCPM has consistently pushed quality and usability forward. Now, VoxCPM 2 takes it further: 🔹30+ languages — truly global, truly local. 🔹Infinite voice design — type it, hear it, control it. From a whisper to a booming cinematic voice. 🔹Studio-grade audio — 48kHz ultra-high fidelity with emotional depth 🔹Diffusion-Autoregressive cloning — preserves more acoustic and emotional detail than token-based models like Qwen3-TTS 💡 Big shoutout to @grok — used your multi-image video magic for our launch demo. It’s scarily good at keeping visuals consistent across shots. Elon @elonmusk, this one’s for you. 😉 Check the demo & start cloning your dream voice: 🌐 Hugging Face Space: huggingface.co/spaces/openbmb… 🤗 Hugging Face Model: huggingface.openbmb.com/model/openbmb/… 🤖 ModelScope Model: modelscope.cn/models/OpenBMB… 💻 GitHub：github.com/OpenBMB/VoxCPM/ #TTS #AI #VoiceCloning #GrokImagine #ElonMusk #OpenBMB #VoxCPM