乌鱼小子
1.7K posts

乌鱼小子
@mintisan
走出自己的舒适区,然后跑进别人的舒适区去瞅瞅.
Shenzhen.China Katılım Mart 2014
1.5K Takip Edilen128 Takipçiler

I coded a Speech-to-Text model from scratch.
𝐇𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐛𝐥𝐨𝐠 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞:
blogs.mayankpratapsingh.in/chapters/speec…
No APIs. No pre-trained models. Just PyTorch, an A100 GPU, and hours of debugging.
This started months ago. I wanted to understand how machines hear. Not surface-level understanding. I wanted to build the whole thing myself.
So I built it piece by piece: autoencoders, VAEs, VQ-VAEs, Residual Vector Quantization, and CTC loss. Each one took days to get right.
Trained for 3 hours on 13,100 audio clips. Got complete garbage. Changed the tokenizer from BPE to character-level. Rechecked everything.
Asked @neural_avb who built STT models before.
His answer: these models are tricky to train and need days of compute, not hours.
Cut the dataset to 200 clips. After 2 hours, actual words appeared. Overfitted? Absolutely. But watching noise turn into recognizable English was satisfying.
I have made a blog about this as well so you can learn about the same and my process
- Audio fundamentals and waveform representation
- Why attention breaks on raw audio
- Convolutional downsampling
- Transformer encoder with positional encoding
- Vector Quantization, straight-through estimator, and RVQ
- CTC loss and greedy decoding
- Full training loop with VQ loss warmup
- What went wrong and what finally worked
Resources:
- Blog:
blogs.mayankpratapsingh.in/chapters/speec…
- Code:
github.com/Mayankpratapsi…
More Resoures
CTC loss
distill.pub/2017/ctc/
@neural_avb videos
@avb_fj" target="_blank" rel="nofollow noopener">youtube.com/@avb_fj
SoundStream Paper
arxiv.org/abs/2107.03312
LJ speech dataset
keithito.com/LJ-Speech-Data…
wav2vec paper
arxiv.org/abs/2006.11477
RVQ blog
drscotthawley.github.io/blog/posts/202…
Next up: I've already trained two TTS architectures from scratch. Video post about those coming soon. But first, I'm dropping a visual breakdown of Vision Transformers, covering how they work and how to fine-tune them.
Follow me @Mayank_022 you're into audio deep learning. Repost so others can find this
English

MiniMax M2.7 first benchmarks look great, confirming the good vibes!
Thanks @v2fffvxhyz for sharing this!

English

Introducing: OpenGranola 🔥
I built an open source meeting copilot for macOS.
It transcribes both sides of your call on-device, searches your own notes in real time, and hands you talking points right when the conversation needs them. No audio leaves your Mac.
Point it at a folder of markdown files, pick any LLM through OpenRouter (Claude, GPT-4o, Gemini, Llama), and it just works. It's invisible to screen share too — nobody knows you have it.
The whole thing is open source.
Link below
English

Happy to share 🌍Omnilingual Machine Translation🌍
In this work @AIatMeta we explore translation systems supporting 1,600+ languages. We show how our models (1B to 8B) can outperform baselines of up to 70B while having much larger language coverage.
📄:ai.meta.com/research/publi…

English

Introducing Unsloth Studio ✨
A new open-source web UI to train and run LLMs.
• Run models locally on Mac, Windows, Linux
• Train 500+ models 2x faster with 70% less VRAM
• Supports GGUF, vision, audio, embedding models
• Auto-create datasets from PDF, CSV, DOCX
• Self-healing tool calling and code execution
• Compare models side by side + export to GGUF
GitHub: github.com/unslothai/unsl…
Blog and Guide: unsloth.ai/docs/new/studio
Available now on Hugging Face, NVIDIA, Docker and Colab.
English

实际用 jj github.com/jj-vcs/jj 也有一段时间了,写了一篇关于 jj 的安利。Git 是协作的标准,但在本地和 AI agent 配合干活这件事上,意外地 jj 的心智模型明显更合适。简单高效不中断,给 agent 的提示词终于可以只聊业务不聊流程了。文中有实际场景对比和一个配套的 agent skill,也欢迎取用。
onevcat.com/2026/03/jj-for…

中文

我错了,腾讯和腾讯云都成 Featured sponsors 了,和 OpenAI 排一起

酱紫表@pengchujin
看到说腾讯赞助 OpenClaw,确实能在 Github 赞助者里看到腾讯云,不过这是花几刀、几十刀就能上的 Github 普通赞助,赞助里面还有个人是 X 上的好友。别当真,上面的特别赞助才是真花了钱的。
中文












