Satvik Dixit

26 posts

Satvik Dixit

@SatvikDixit9

Audio understanding and generation | Prev @CarnegieMellon @IITDelhi

SF Katılım Mart 2021

1.1K Takip Edilen148 Takipçiler

Satvik Dixit retweetledi

Thinking Machines@thinkymachines·1d

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English

418

1.8K

14.5K

6.7M

Satvik Dixit@SatvikDixit9·4 Ara

Code for training mellow-like models released as well: github.com/soham97/mellow

English

123

Satvik Dixit@SatvikDixit9·4 Ara

Our paper "Mellow: a small audio language model for reasoning" was accepted to #NeurIPS2025! Catch our poster in Session 3 at 11am today.

Soham Deshmukh@sohamdesh_

we show for the first time ever that sub-billion audio models can reason. we introduce mellow, a small audio-language model (167M) that gets SoTA on different audio reasoning tasks. by using our method and data, you can train an alm within 24 hrs on academic resources (1/n 🧵)

English

721

Satvik Dixit retweetledi

arXiv Sound@ArxivSound·18 Kas

Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, Chris Donahue, "FoleyBench: A Benchmark For Video-to-Audio Models," arxiv.org/abs/2511.13219

Indonesia

752

Satvik Dixit@SatvikDixit9·14 Eki

Excited to be at WASPAA 2025!

English

458

Satvik Dixit retweetledi

Chris Donahue@chrisdonahuey·29 Tem

Excited to share our beta release of Music Arena, a live evaluation platform for state-of-the-art AI music generation models! 🎧 Listen to the latest models and 🗳️ vote for your favorite ⚔️ music-arena.org ⭐️ github.com/gclef-cmu/musi… 📜 arxiv.org/abs/2507.20900

English

239

28.1K

Satvik Dixit retweetledi

Albert Gu@_albertgu·8 Tem

I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.

English

116

782

119.4K

Satvik Dixit retweetledi

Chris Donahue@chrisdonahuey·20 Haz

Excited to announce 🎵Magenta RealTime, the first open weights music generation model capable of real-time audio generation with real-time control. 👋 **Try Magenta RT on Colab TPUs**: colab.research.google.com/github/magenta… 👀 Blog post: g.co/magenta/rt 🧵 below

English

395

81.2K

Satvik Dixit retweetledi

Neil Zeghidour@neilzegh·9 Nis

Thanks @GoogleAI 🙏, I'm proud to see concepts introduced in this paper (RVQ-VAE, quantizer dropout) being still as relevant four years later, and in particular how the RVQ turned out to be a perfect fit for audio language models.

Google AI@GoogleAI

Congratulations to Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi for winning the IEEE Best Paper Award for "SoundStream: An End-to-End Natural Audio Codec"! arxiv.org/abs/2107.03312 #SPSAwards #IEEEAwards

English

183

12.7K

Satvik Dixit@SatvikDixit9·6 Nis

Paper: arxiv.org/abs/2411.00321

English

108

Satvik Dixit@SatvikDixit9·6 Nis

Excited to be at #ICASSP2025! If you’re working on or interested in audio language models, feel free to reach out. Also, come by our poster on audio caption evaluation at the SALMA Workshop tomorrow at 4 PM.

English

251

Satvik Dixit retweetledi

Neil Zeghidour@neilzegh·21 Mar

Trimodal training (text-audio-img) is challenging because you a have a lot of unimodal data, some bimodal and few to none with all 3 modalities & combining them is not obvious. We propose a simple extension to Moshi that allows it to understand images.

kyutai@kyutai_labs

Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access to information. Try it out 👉 vis.moshi.chat Blog post 👉 kyutai.org/moshivis

English

2.7K

Satvik Dixit retweetledi

𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8·12 Mar

Mellow: a small audio language model for reasoning

English

100

8.2K

Satvik Dixit retweetledi

arXiv Sound@ArxivSound·12 Mar

``Mellow: a small audio language model for reasoning,'' Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj, ift.tt/BFgXl2L

Indonesia

3.4K

Satvik Dixit retweetledi

Neil Zeghidour@neilzegh·6 Şub

Today we release Hibiki, real-time speech translation that runs on your phone. Adaptive flow without fancy policy, simple temperature sampling of a multistream audio-text LM. Very proud of @tom_labiausse 's work as an intern.

kyutai@kyutai_labs

Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting 🇫🇷➡️🇬🇧. Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters. 🧵

English

404

47.4K

Satvik Dixit retweetledi

Soham Deshmukh@sohamdesh_·23 Oca

SALMA is scheduled on 7th April 2025 at ICASSP 2025! We look forward to seeing you there! #ICASSP2025 @ieeeICASSP

Soham Deshmukh@sohamdesh_

📢Join us at @ieeeICASSP 2025 for a workshop on all aspects of Speech and Audio Language Models, including synthetic data, training methods, evaluation metrics, and benchmarks. We'd love to see your work! 🎤📚 Submission deadline: November 1st, 2024 (salmaworkshop.github.io)

English

1.9K

Satvik Dixit@SatvikDixit9·13 Ara

Paper: arxiv.org/abs/2411.12058.

English

118

Satvik Dixit@SatvikDixit9·13 Ara

Key highlights: 1. GPT-4o outperforms human experts on VSC 2. Few-shot learning improves performance significantly 3. Applications in using VLMs for audio-based tasks, like audio caption augmentation

English

149

Satvik Dixit@SatvikDixit9·13 Ara

Come check out our poster at NeurIPS Audio Imagination Workshop on "Vision Language Models Are Few-Shot Audio Spectrogram Classifiers". Unfortunately I couldn't travel to present, but look for our poster in poster presentation session 2 @ 4:15 PM Saturday.

English

263

Keşfet

@GoogleAI @tom_labiausse @ieeeICASSP @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates