Rithesh Kumar

399 posts

Rithesh Kumar

Rithesh Kumar

@ritheshkumar_

audio @openai

San Francisco, CA Katılım Kasım 2015
575 Takip Edilen924 Takipçiler
Sabitlenmiş Tweet
Rithesh Kumar
Rithesh Kumar@ritheshkumar_·
✨ Super excited to share our work on neural audio quantizers. It’s especially very timely considering the interest in AudioLMs, MusicLM and MusicGen! Fully open sourced training + inference code and model weights with MIT license 🎉 arxiv.org/abs/2306.06546
AK@_akhaliq

High-Fidelity Audio Compression with Improved RVQGAN paper page: huggingface.co/papers/2306.06… Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.

English
15
49
252
103.7K
Rithesh Kumar retweetledi
Krea
Krea@krea_ai·
if you want to learn about how we trained KREA Flux, we prepared a detailed blog in the link below: krea.ai/blog/flux-krea…
English
1
19
107
26.2K
Rithesh Kumar retweetledi
Mistral AI
Mistral AI@MistralAI·
In our continued commitment to open-science, we are releasing the Voxtral Technical Report: arxiv.org/abs/2507.13264 The report covers details on pre-training, post-training, alignment and evaluations. We also present analysis on selecting the optimal model architecture, which pre-training format to use, and the benefits of DPO.
Mistral AI tweet media
English
37
190
1.3K
75.5K
Rithesh Kumar retweetledi
Jiaming Song
Jiaming Song@baaadas·
As one of the people who popularized the field of diffusion models, I am excited to share something that might be the “beginning of the end” of it. IMM has a single stable training stage, a single objective, and a single network — all are what make diffusion so popular today.
Luma@LumaLabsAI

Today, we release Inductive Moment Matching (IMM): a new pre-training paradigm breaking the algorithmic ceiling of diffusion models. Higher sample quality. 10x more efficient. Single-stage, single network, stable training. Read more: lumalabs.ai/news/imm

English
21
103
906
155K
Rithesh Kumar retweetledi
Sander Dieleman
Sander Dieleman@sedielem·
Nice paper on the trade-off between decoding quality and modelability in 2-stage generative models. I disagree with this framing though: the trade-off is quite clear from an information-theoretic perspective. Do most people really believe this? Maybe it's time for a blog post🤔
Sander Dieleman tweet media
Vivek Ramanujan@RamanujanVivek

Happy to (belatedly) share our recent work introducing Causally Regularized Tokenization 📺, matching LlamaGen-3B generation performance with 0.5x the number of tokens/image (256 vs 576) and 0.25x the number of params (770M vs 3B) on ImageNet. arxiv.org/pdf/2412.16326 1/n

English
16
15
190
25.5K
Rithesh Kumar retweetledi
Justin Salamon
Justin Salamon@justin_salamon·
📢 Audio AI Job opportunity at Adobe! The Sound Design AI Group (SODA) is looking for an exceptional research engineer to join us in building the future of AI-assisted audio and video creation. Strong ML background, GenAI experience a plus. Details: adobe.wd5.myworkdayjobs.com/external_exper…
English
2
8
35
4.2K
Rithesh Kumar retweetledi
Ruiqi Gao
Ruiqi Gao@RuiqiGao·
A common question nowadays: Which is better, diffusion or flow matching? 🤔 Our answer: They’re two sides of the same coin. We wrote a blog post to show how diffusion models and Gaussian flow matching are equivalent. That’s great: It means you can use them interchangeably.
Ruiqi Gao tweet media
English
16
199
945
172.6K
Rithesh Kumar retweetledi
Ziyang Chen
Ziyang Chen@CzyangChen·
🎥 Introducing MultiFoley, a video-aware audio generation method with multimodal controls! 🔊 We can ⌨️Make a typewriter sound like a piano 🎹 🐱Make a cat meow like a lion roars! 🦁 ⏱️Perfectly time existing SFX 💥 to a video
English
11
41
213
41.9K
Rithesh Kumar retweetledi
Scott H. Hawley
Scott H. Hawley@drscotthawley·
New tutorial! I spent 3 weeks realizing flow-matching/rectified flows can be viewed in a simple way that end-runs the usual pages of math: "Basic physics provides a 'straight, fast' way to get up to speed with flow-based generative models" Colab included! drscotthawley.github.io/blog/posts/Flo…
English
15
70
449
52.7K
Rithesh Kumar retweetledi
Justin Salamon
Justin Salamon@justin_salamon·
What a thrill to present on the big stage! So excited to reveal our Sounds Effects GenAI tech in #ProjectSuperSonic #AdobeMAX Text-to-SFX and *VOICE*-to-SFX for expressive control! Huge kudos to @urinieto @pseetharaman @hugggof and our collaborators in design & prototyping!
scott belsky@scottbelsky

using your voice as an “audio sketch” to generate sound effects, part of the #ProjectSuperSonic sneak from our labs.

English
6
8
75
8.5K
Rithesh Kumar retweetledi
Justin Salamon
Justin Salamon@justin_salamon·
Adobe just announced Generative Extend for Premiere Pro (beta) at #AdobeMAX! Use GenAI to extend your video clip *including the audio* @pseetharaman @urinieto and me in the Sound Design AI Group at @AdobeResearch worked on the audio part and we're so excited to see it go out!
English
5
18
86
9.4K
Rithesh Kumar retweetledi
Jordi Pons
Jordi Pons@jordiponsdotme·
ICML in Vienna is coming to a close! 🇦🇹 Here are the top-10 general (and audio) trends from ICML 2024. A thread 🧵 1. Open vs. Closed AI: The debate was very present, notable in @soumithchintala's keynote or by the release of Llama 3.1 (among others). icml.cc/virtual/2024/p…
Jordi Pons tweet media
English
1
7
37
2.1K
Rithesh Kumar
Rithesh Kumar@ritheshkumar_·
@RafaelValleArt This is awesome! The cuda code for the alias-free activation is a huge contribution since this won't blow up the memory 2x. Been waiting for this since BigVGAN originally came out. Thank you :)
English
0
0
3
274
Rafael Valle
Rafael Valle@RafaelValleArt·
Do you work on audio synthesis and need state of the art vocoders? BigVGAN v2 is out! BigVGAN v2 is the state-of-the-art in quality, faster and has commercial friendly checkpoints in 44, 24 and 22khz! By the way, it tops again the vocoding leaderboard! paperswithcode.com/sota/speech-sy…
English
2
26
93
8.9K
Rithesh Kumar
Rithesh Kumar@ritheshkumar_·
@erogol Do you particularly observe a difference compared to training standard diffusion objectives? like EDM or v-prediction
English
1
0
0
192
erogol
erogol@erogol·
I'm playing with flow-matching models and I started to hate the time I wasted with GANs. FM trains way faster..
English
4
0
19
1.6K
Rithesh Kumar retweetledi
Jason Weston
Jason Weston@jaseweston·
🚨 Contextual Position Encoding (CoPE) 🚨 Context matters! CoPE is a new positional encoding method for transformers that takes into account *context*. - Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens. - CoPE solves counting & copy tasks that standard transformers cannot. - Better PPL on language modeling + coding tasks. arxiv.org/abs/2405.18719 🧵(1/5)
Jason Weston tweet media
English
1
290
1.7K
1.5M
Rithesh Kumar retweetledi
David Braun (dbraun.bsky.social)
Happy to release "DAC-JAX: A JAX Implementation of the Descript Audio Codec." This can reuse PyTorch weights of all model sizes, and it includes a device-parallel training script. It uses the standard JAX libraries: Flax, Optax, Orbax, and CLU. github.com/DBraun/DAC-JAX
English
1
6
29
2.1K
Rithesh Kumar retweetledi
Jing Yu Koh
Jing Yu Koh@kohjingyu·
Last week when presenting Parti (parti.research.google) at ICLR, I explained at least 20 times how I felt about autoregressive text-to-image generation models vs. diffusion models. So this is my take: The major benefit of autoregressive image generation models is that they just predict image tokens, which makes it super easy to integrate into your LLM pretraining stack. Tokens in, tokens out: everything becomes just seq2seq! This also works for audio (google-research.github.io/seanet/audiolm…) input and output. All modalities become sequences of discrete tokens, so it's easy to train once you learn the first stage to quantize image/audio. For frontier companies like Google/OpenAI, this is advantageous from a systems perspective, because existing infrastructure is often already hyperoptimized for training transformer models on next-token prediction. In my experience, training these models is also a lot more stable than diffusion models or GANs. Another major benefit is that your model is a standard transformer, and you can use all the wonderful LLM bag-of-tricks that other people have developed: FlashAttention, speculative decoding, and other MLsys goodies. One thing that others have pointed out is that these models seem really good at text rendering (if GPT-4o is such a model). This makes a lot of intuitive sense since it's generating discrete patches one by one (which can be thought of as patches of individual characters). I was also impressed by this ability of Parti to render text well in 2022. So why doesn't everyone train such a model? One major downside is that you need to learn some kind of VQ-VAE (arxiv.org/abs/1711.00937) to compress images/audio into discrete tokens. This means that your overall generation quality is upper bounded by how good this quantizer is. If you mess up this first stage, it can be very hard to generate high quality images even if your second (transformer) stage is strong. Another downside for training VLMs this way is that you potentially use much more compute by being a multi-modal model from the beginning (as opposed to training a LM on text only data, a vision encoder on images, and stapling them together at the end with some multimodal data).
Greg Brockman@gdb

A GPT-4o generated image — so much to explore with GPT-4o's image generation capabilities alone. Team is working hard to bring those to the world.

English
25
116
802
227.2K
Karan Thakkar
Karan Thakkar@Carankt·
@ritheshkumar_ Amazing work, excited to read more about details of the model! Any technical reports on this ?
English
1
0
0
69
Rithesh Kumar
Rithesh Kumar@ritheshkumar_·
Super happy to share this preview into what we've been building with the Speech AI team @ Adobe! Please reach out if you're interested in building large-scale audio models like this..
Adobe@Adobe

What if you could easily translate dialogue into different languages? Check out this sneak preview from Adobe Research. 👀 Dubbing & Lip Sync explores how to quickly translate videos. As with all Adobe AI, when we do release this we will do so thoughtfully and responsibly.

English
3
1
35
3.2K