Rithesh Kumar

399 posts

Rithesh Kumar

@ritheshkumar_

audio @openai

San Francisco, CA Katılım Kasım 2015

575 Takip Edilen924 Takipçiler

Sabitlenmiş Tweet

Rithesh Kumar@ritheshkumar_·13 Haz

✨ Super excited to share our work on neural audio quantizers. It’s especially very timely considering the interest in AudioLMs, MusicLM and MusicGen! Fully open sourced training + inference code and model weights with MIT license 🎉 arxiv.org/abs/2306.06546

AK@_akhaliq

High-Fidelity Audio Compression with Improved RVQGAN paper page: huggingface.co/papers/2306.06… Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.

English

252

103.7K

Rithesh Kumar retweetledi

Krea@krea_ai·31 Tem

if you want to learn about how we trained KREA Flux, we prepared a detailed blog in the link below: krea.ai/blog/flux-krea…

English

107

26.2K

Rithesh Kumar retweetledi

Mistral AI@MistralAI·23 Tem

In our continued commitment to open-science, we are releasing the Voxtral Technical Report: arxiv.org/abs/2507.13264 The report covers details on pre-training, post-training, alignment and evaluations. We also present analysis on selecting the optimal model architecture, which pre-training format to use, and the benefits of DPO.

English

190

1.3K

75.5K

Rithesh Kumar retweetledi

Jiaming Song@baaadas·11 Mar

As one of the people who popularized the field of diffusion models, I am excited to share something that might be the “beginning of the end” of it. IMM has a single stable training stage, a single objective, and a single network — all are what make diffusion so popular today.

Luma@LumaLabsAI

Today, we release Inductive Moment Matching (IMM): a new pre-training paradigm breaking the algorithmic ceiling of diffusion models. Higher sample quality. 10x more efficient. Single-stage, single network, stable training. Read more: lumalabs.ai/news/imm

English

103

906

155K

Rithesh Kumar retweetledi

Sander Dieleman@sedielem·25 Oca

Nice paper on the trade-off between decoding quality and modelability in 2-stage generative models. I disagree with this framing though: the trade-off is quite clear from an information-theoretic perspective. Do most people really believe this? Maybe it's time for a blog post🤔

Vivek Ramanujan@RamanujanVivek

Happy to (belatedly) share our recent work introducing Causally Regularized Tokenization 📺, matching LlamaGen-3B generation performance with 0.5x the number of tokens/image (256 vs 576) and 0.25x the number of params (770M vs 3B) on ImageNet. arxiv.org/pdf/2412.16326 1/n

English

190

25.5K

Rithesh Kumar retweetledi

Justin Salamon@justin_salamon·9 Ara

📢 Audio AI Job opportunity at Adobe! The Sound Design AI Group (SODA) is looking for an exceptional research engineer to join us in building the future of AI-assisted audio and video creation. Strong ML background, GenAI experience a plus. Details: adobe.wd5.myworkdayjobs.com/external_exper…

English

4.2K

Rithesh Kumar retweetledi

Ruiqi Gao@RuiqiGao·2 Ara

A common question nowadays: Which is better, diffusion or flow matching? 🤔 Our answer: They’re two sides of the same coin. We wrote a blog post to show how diffusion models and Gaussian flow matching are equivalent. That’s great: It means you can use them interchangeably.

English

199

945

172.6K

Rithesh Kumar retweetledi

Ziyang Chen@CzyangChen·27 Kas

🎥 Introducing MultiFoley, a video-aware audio generation method with multimodal controls! 🔊 We can ⌨️Make a typewriter sound like a piano 🎹 🐱Make a cat meow like a lion roars! 🦁 ⏱️Perfectly time existing SFX 💥 to a video

English

213

41.9K

Rithesh Kumar retweetledi

Scott H. Hawley@drscotthawley·13 Kas

New tutorial! I spent 3 weeks realizing flow-matching/rectified flows can be viewed in a simple way that end-runs the usual pages of math: "Basic physics provides a 'straight, fast' way to get up to speed with flow-based generative models" Colab included! drscotthawley.github.io/blog/posts/Flo…

English

449

52.7K

Rithesh Kumar retweetledi

Emiel Hoogeboom@emiel_hoogeboom·28 Eki

Is pixel diffusion passé? In 'Simpler Diffusion' (arxiv.org/abs/2410.19324) , we achieve 1.5 FID on ImageNet512, and SOTA on 128x128 and 256x256. We ablated out a lot of complexity, making it truly 'simpler'. w/ @tejmensink @JonathanHeek @KayLamerigts @RuiqiGao @TimSalimans

English

367

54.5K

Rithesh Kumar retweetledi

Justin Salamon@justin_salamon·16 Eki

What a thrill to present on the big stage! So excited to reveal our Sounds Effects GenAI tech in #ProjectSuperSonic #AdobeMAX Text-to-SFX and *VOICE*-to-SFX for expressive control! Huge kudos to @urinieto @pseetharaman @hugggof and our collaborators in design & prototyping!

scott belsky@scottbelsky

using your voice as an “audio sketch” to generate sound effects, part of the #ProjectSuperSonic sneak from our labs.

English

8.5K

Rithesh Kumar retweetledi

Justin Salamon@justin_salamon·14 Eki

Adobe just announced Generative Extend for Premiere Pro (beta) at #AdobeMAX! Use GenAI to extend your video clip *including the audio* @pseetharaman @urinieto and me in the Sound Design AI Group at @AdobeResearch worked on the audio part and we're so excited to see it go out!

English

9.4K

Rithesh Kumar retweetledi

Zachary Novack@zacknovack·8 Eki

Ultra-fast text-to-music generation w/o degrading quality? Introducing Presto! Distilling Steps and Layers for Accelerating Music Generation 🎹: buff.ly/4dC3rpl 📖: buff.ly/3TZBiBU w/@__gzhu__ @CasebeerJonah @BergKirkpatrick @McAuleyLabUCSD @NicholasJBryan 🧵

English

17.7K

Rithesh Kumar retweetledi

Jordi Pons@jordiponsdotme·26 Tem

ICML in Vienna is coming to a close! 🇦🇹 Here are the top-10 general (and audio) trends from ICML 2024. A thread 🧵 1. Open vs. Closed AI: The debate was very present, notable in @soumithchintala's keynote or by the release of Llama 3.1 (among others). icml.cc/virtual/2024/p…

English

2.1K

Rithesh Kumar@ritheshkumar_·10 Tem

@RafaelValleArt This is awesome! The cuda code for the alias-free activation is a huge contribution since this won't blow up the memory 2x. Been waiting for this since BigVGAN originally came out. Thank you :)

English

274

Rafael Valle@RafaelValleArt·10 Tem

Do you work on audio synthesis and need state of the art vocoders? BigVGAN v2 is out! BigVGAN v2 is the state-of-the-art in quality, faster and has commercial friendly checkpoints in 44, 24 and 22khz! By the way, it tops again the vocoding leaderboard! paperswithcode.com/sota/speech-sy…

English

8.9K

Rithesh Kumar@ritheshkumar_·17 Haz

@erogol Do you particularly observe a difference compared to training standard diffusion objectives? like EDM or v-prediction

English

192

erogol@erogol·17 Haz

I'm playing with flow-matching models and I started to hate the time I wasted with GANs. FM trains way faster..

English

1.6K

Rithesh Kumar retweetledi

Jason Weston@jaseweston·30 May

🚨 Contextual Position Encoding (CoPE) 🚨 Context matters! CoPE is a new positional encoding method for transformers that takes into account *context*. - Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens. - CoPE solves counting & copy tasks that standard transformers cannot. - Better PPL on language modeling + coding tasks. arxiv.org/abs/2405.18719 🧵(1/5)

English

290

1.7K

1.5M

Rithesh Kumar retweetledi

David Braun (dbraun.bsky.social)@DoItRealTime·19 May

Happy to release "DAC-JAX: A JAX Implementation of the Descript Audio Codec." This can reuse PyTorch weights of all model sizes, and it includes a device-parallel training script. It uses the standard JAX libraries: Flax, Optax, Orbax, and CLU. github.com/DBraun/DAC-JAX

English

2.1K

Rithesh Kumar retweetledi

Jing Yu Koh@kohjingyu·16 May

Last week when presenting Parti (parti.research.google) at ICLR, I explained at least 20 times how I felt about autoregressive text-to-image generation models vs. diffusion models. So this is my take: The major benefit of autoregressive image generation models is that they just predict image tokens, which makes it super easy to integrate into your LLM pretraining stack. Tokens in, tokens out: everything becomes just seq2seq! This also works for audio (google-research.github.io/seanet/audiolm…) input and output. All modalities become sequences of discrete tokens, so it's easy to train once you learn the first stage to quantize image/audio. For frontier companies like Google/OpenAI, this is advantageous from a systems perspective, because existing infrastructure is often already hyperoptimized for training transformer models on next-token prediction. In my experience, training these models is also a lot more stable than diffusion models or GANs. Another major benefit is that your model is a standard transformer, and you can use all the wonderful LLM bag-of-tricks that other people have developed: FlashAttention, speculative decoding, and other MLsys goodies. One thing that others have pointed out is that these models seem really good at text rendering (if GPT-4o is such a model). This makes a lot of intuitive sense since it's generating discrete patches one by one (which can be thought of as patches of individual characters). I was also impressed by this ability of Parti to render text well in 2022. So why doesn't everyone train such a model? One major downside is that you need to learn some kind of VQ-VAE (arxiv.org/abs/1711.00937) to compress images/audio into discrete tokens. This means that your overall generation quality is upper bounded by how good this quantizer is. If you mess up this first stage, it can be very hard to generate high quality images even if your second (transformer) stage is strong. Another downside for training VLMs this way is that you potentially use much more compute by being a multi-modal model from the beginning (as opposed to training a LM on text only data, a vision encoder on images, and stapling them together at the end with some multimodal data).

Greg Brockman@gdb

A GPT-4o generated image — so much to explore with GPT-4o's image generation capabilities alone. Team is working hard to bring those to the world.

English

116

802

227.2K

Rithesh Kumar retweetledi

Ge Zhu@__gzhu__·18 Mar

MusicHiFi: Fast High-Fidelity Stereo Vocoding. Fast, high-fidelity stereophonic vocoding for music generation. 📝: arxiv.org/abs/2403.10493 🎵: musichifi.github.io/web/ w/ @j_p_caceres @ZhiyaoDuan @NicholasJBryan

English

24.2K

Rithesh Kumar@ritheshkumar_·15 Mar

@Carankt Not yet, but hopefully soon :)

English

Karan Thakkar@Carankt·15 Mar

@ritheshkumar_ Amazing work, excited to read more about details of the model! Any technical reports on this ?

English

Rithesh Kumar@ritheshkumar_·14 Mar

Super happy to share this preview into what we've been building with the Speech AI team @ Adobe! Please reach out if you're interested in building large-scale audio models like this..

Adobe@Adobe

What if you could easily translate dialogue into different languages? Check out this sneak preview from Adobe Research. 👀 Dubbing & Lip Sync explores how to quickly translate videos. As with all Adobe AI, when we do release this we will do so thoughtfully and responsibly.

English

3.2K

Keşfet

@JonathanHeek @KayLamerigts @RuiqiGao @TimSalimans @urinieto @pseetharaman @AdobeResearch @__gzhu__