.@KugelAudio builds multilingual voice AI you can run in your own Kubernetes cluster. It handles 30+ languages and dialects naturally. Even phone numbers, emails, and mixed-language text — fully on-prem.
ycombinator.com/launches/QXA-k…
KugelAudio is launching today!
Excited to share our first frontier voice model with the world: real‑time multilingual TTS. Super low latency and over 30 languages. And you can even clone your voice. Try it yourself 👇
Hyper (@heyhyperai) is building the self-driving brain for companies.
Hyper's agents synthesize millions of emails, docs, Slacks, and make everyone's AI tools instantly smarter without anyone doing anything.
Congrats on the launch, @kanyesthaker and @_shalinshah_!
heyhyper.ai
Launch Week Day 1. Today, Consent.io becomes inth.com.
We've evolved from just a cookie consent banner.
We're Inth, consent and privacy infrastructure. Built for performance, full developer control.
We're not doing this alone. We've raised a $1.2m pre-seed and joined @ycombinator's YC P26.
Over the last year, @c15tdev has reached 1.7k GitHub stars and 1.6M+ downloads.
This week, we're announcing the first steps of what Inth is building.
ICYMI: KugelAudio is an open source TTS model that should get way more attention
> fine-tuned from Vibe-Voice 7B
> trained on 200K hours of 23 Languages
> state-of-the-art performance 🔥
We finally have a 7B parameter Transformer for Text-to-Speech. 📉
**KugelAudio-0-Open** just dropped, and the architecture is fascinating.
Most modern TTS systems (like F5-TTS) are purely diffusion-based.
KugelAudio takes a "Hybrid" approach that leans heavily on LLM reasoning.
**The Engineering Stack:**
1. **The Brain (Qwen2.5-7B):** 🧠
Instead of a tiny text encoder, it uses a full 7B LLM (Qwen2.5) to process the input.
*Why it matters:* It understands that "The wind needs to *wind* down" uses two different pronunciations of "wind" based on semantic context.
2. **The Voice (VibeVoice Base):**
It builds on Microsoft's VibeVoice architecture (AR + Diffusion).
It predicts semantic latents first (what to say), then uses diffusion to refine the acoustic details (how to say it).
3. **Voice Cloning (Zero-Shot):**
You can feed it a 10-second reference clip (e.g., "Angry Captain"), and because of the Semantic Encoder, it clones not just the timbre, but the *vibe*.
4. **The Cost:**
It needs **~19GB VRAM** (FP16).
This is strictly for the RTX 3090/4090 crowd or A100 server deployments. It is not a "run on your laptop" model (yet).
**Benchmarks:**
It claims state-of-the-art performance on European languages (German, French, Spanish, Polish), specifically outperforming commercial APIs in blind preference tests.
**GitHub:** lnkd.in/gM5-TPUf
**Weights:** lnkd.in/gtRPDVNR
---
🚀 **Need Custom Training?**
We specialize in adapting these massive models for enterprise deployment. If you need **Custom Fine-Tuning for Voice Models** (cloning, accents, or low-latency optimization), **DM me**. 📩
♻️ **Repost** if you have the VRAM to run this!
➕ **Follow me Pasha S** for more Engineering Deep Dives.