
Satvik Dixit
26 posts

Satvik Dixit
@SatvikDixit9
Audio understanding and generation | Prev @CarnegieMellon @IITDelhi



we show for the first time ever that sub-billion audio models can reason. we introduce mellow, a small audio-language model (167M) that gets SoTA on different audio reasoning tasks. by using our method and data, you can train an alm within 24 hrs on academic resources (1/n 🧵)





Congratulations to Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi for winning the IEEE Best Paper Award for "SoundStream: An End-to-End Natural Audio Codec"! arxiv.org/abs/2107.03312 #SPSAwards #IEEEAwards



Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access to information. Try it out 👉 vis.moshi.chat Blog post 👉 kyutai.org/moshivis


Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting 🇫🇷➡️🇬🇧. Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters. 🧵

📢Join us at @ieeeICASSP 2025 for a workshop on all aspects of Speech and Audio Language Models, including synthetic data, training methods, evaluation metrics, and benchmarks. We'd love to see your work! 🎤📚 Submission deadline: November 1st, 2024 (salmaworkshop.github.io)





