Karn Tiwari

106 posts

Karn Tiwari

@TiwariKarn

ML Researcher

India Katılım Aralık 2019

2.4K Takip Edilen76 Takipçiler

Karn Tiwari retweetledi

Alec Helbling@alec_helbling·17 Oca

Flow-based generative models trained with flow matching tend to learn curved trajectories, which are challenging to approximate in a few steps. Rectified flows aim to learn straight trajectories, which are easier to simulate with less computation.

English

430

35.1K

Karn Tiwari retweetledi

Grad@Grad62304977·7 Kas

43% of the speedup in the new NanoGPT record is due to a variant of value residual learning that I developed. Value residual learning (recently proposed by arxiv.org/abs/2410.17897) allows all blocks in the transformer to access the values computed by the first block. The paper proposes to set `v = 0.5 * v + 0.5 * v1` in each block, where `v1` is the value computed in the first block. After I implemented this method, this reduces the loss of the previous record, when it is run for 3200 steps (the final duration of the new record) from 3.327 to 3.315. I then found that if we make the coefficients learnable, we can get further improvements. Specifically I changed the formula to `v = lambda * v + (1 - lambda) * v1` where lambda is a learnable parameter. After that, the loss reduces further from 3.315 to 3.3065, which corresponds to a 1.7x greater performance improvement than vanilla value residual. The graph attached shows that the later blocks have a smaller lambda, which means they make greater use of the shortcut to see v1. After I got this result, @kellerjordan0 also implemented a similar shortcut allowing later blocks to directly access the embedding, which resulted in further gains. Also, I have tried adding data dependence to lambda however this seems to have a big benefit only when training for longer token duration (actually the ideal situation), but this is a definite area for potential improvements. Excited to see people try this in their models and see if this works at scale and across different setups!

English

513

95.8K

Karn Tiwari retweetledi

Bidipta Sarkar@bidiptas13·21 Kas

Introducing 🥚EGGROLL 🥚(Evolution Guided General Optimization via Low-rank Learning)! 🚀 Scaling backprop-free Evolution Strategies (ES) for billion-parameter models at large population sizes ⚡100x Training Throughput 🎯Fast Convergence 🔢Pure Int8 Pretraining of RNN LLMs

English

155

307.5K

Karn Tiwari retweetledi

Thinking Machines@thinkymachines·29 Eyl

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/

English

560

3.5K

1.4M

Karn Tiwari retweetledi

Ricardo Buitrago@rbuit_·7 Tem

Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!

English

197

42.3K

Karn Tiwari retweetledi

Rain@rainnekoneko·17 Haz

We're working on a new LM architecture that does not use any variant of multi-head attention or recurrence, and it works well with long context lengths. We're calling it "Avey". Everything is open-sourced under a Apache-2.0 license. Paper: arxiv.org/abs/2506.11305 Demo Models: huggingface.co/collections/av… GitHub: github.avey.ai/avey-dpa (feel free to dm me with questions, suggestions, etc o7) Work on the next version of the architecture with many improvements is already underway. The currently released models are only pre-trained for 100 billion tokens, but there are plans to train larger LLMs using this architecture on a much larger dataset in the near future. Here's a demo of Avey 1.5B generating a completion with an input of 45K tokens on my 4060 laptop while using less than 4GB of VRAM (with bf16).

English

439

46.9K

Karn Tiwari retweetledi

Ruben Hassid@rubenhassid·7 Haz

BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. Here's what Apple discovered: (hint: we're not as close to AGI as the hype suggests)

English

2.6K

9.1K

62.8K

14.2M

Karn Tiwari retweetledi

tyler hogge@thogge·6 Haz

one of Steve Jobs' best lines. simplicity is so hard.

English

115

1.4K

15K

1.2M

Karn Tiwari retweetledi

Yoni Slutzky@YoniSlutzky·6 Haz

Do neural nets really need gradient descent to generalize?🚨 We dive into matrix factorization and find a sharp split: wide nets rely on GD, while deep nets can thrive with any low-training-error weights! arxiv.org/abs/2506.03931 🧵

English

4.5K

Karn Tiwari retweetledi

Lucky Iyinbor@Luckyballa·29 May

New day, new paper «e𝑓unc: An Efficient Function Representation Without Neural Networks» It looks math-heavy, but it’s not that complicated Core idea is to swap an MLP with a bunch of small analytic patches, store them in a grid, and then perform global weighted interpolation Every grid vertex (32³) stores a polynomial, each with 13 learnable parameters, then all results are RBF-blended and soft-maxed That’s about it They use “query,” “key,” and “value” terminology, but it’s not like in attention, which is quite confusing Query - point we sample, key - grid-vertex position, value - polynomial function Some similarities with attention - it also has quadratic complexity, but it can be optimized with some smart parallel reduction techniques Additionally, you can give each vertex an optional offset to better align it with the surface, which costs a few extra floats per vertex but improves the accuracy Overall, it has better compression than a global MLP or NGP, but it’s slower than NGP to train and evaluate arxiv.org/pdf/2505.21319…

English

446

27.9K

Karn Tiwari@TiwariKarn·22 May

Honored to be recognized as a notable reviewer for ICLR 2025! 🎉 Huge thanks to the organizers and the entire community for the opportunity to contribute to such an impactful conference. #ICLR2025 #PeerReview #MachineLearning

ICLR@iclr_conf

2025 had a record breaking number of submissions, and we are grateful for the contributions of all reviewers & ACs. We wish to acknowledge the following notable reviewers who went above and beyond, reviewing 4 or more papers. Thank you for your service. iclr.cc/Conferences/20…

English

166

Karn Tiwari retweetledi

Sakana AI@SakanaAILabs·12 May

Introducing Continuous Thought Machines New Blog: sakana.ai/ctm/ Modern AI is powerful, but it’s still distinct from human-like flexible intelligence. We believe neural timing is key. Our Continuous Thought Machine is built from the ground up to use neural dynamics as a powerful representation for intelligence. Thought takes time, and reasoning is a process. Biological brains inspire us with their complex neural activity, where neural timing is critical to intelligence. We’re exploring how to bring that power to AI. The Continuous Thought Machine (CTM) incorporates neuron-level temporal processing and neural synchronization, moving beyond current AI limitations. Our approach has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique parameters to process a history of incoming signals for fine-grained temporal dynamics, and (2) neural synchronization, used as a direct latent representation to modulate data and produce outputs, encoding information directly in the timing of neural activity. Learn more about our approach: Interactive Report: pub.sakana.ai/ctm/ Full Paper: arxiv.org/abs/2505.05522 GitHub : github.com/SakanaAI/conti…

English

283

1.3K

289.6K

Karn Tiwari@TiwariKarn·2 May

@backpropogator @KinjawlB Congratulations 🎉👏.

English

Piyush Tiwary@backpropogator·1 May

Happy to inform my first first-authored A* paper at #icml2025 : "LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation" with hardworking @KinjawlB & my guide, Prathosh A.P. Preprint soon!

English

2.8K

Karn Tiwari retweetledi

Ashish Vaswani@ashVaswani·8 Nis

Reinforcement learning has shown success in eliciting reflection from LLMs, but what if this capability actually manifests earlier in pre-training? We investigated this question and our results are surprising 👇 [1/4]

English

100

806

137.8K

Karn Tiwari retweetledi

Pedro Domingos@pmddomingos·29 Mar

Neural networks don’t have to be distilled into other neural networks. They can be distilled into decision trees or sets of rules. And then interpretability becomes dramatically easier.

English

1.3K

146.2K

Karn Tiwari retweetledi

Andy Keller@t_andy_keller·10 Mar

In the physical world, almost all information is transmitted through traveling waves -- why should it be any different in your neural network? Super excited to share recent work with the brilliant @mozesjacobs: "Traveling Waves Integrate Spatial Information Through Time" 1/14

GIF

English

143

882

760.2K

Karn Tiwari retweetledi

Jeremy Bernstein@jxbz·7 Mar

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

English

138

125.6K

Karn Tiwari retweetledi

Peyman Milanfar@docmilanfar·5 Mar

With today's Turing, it's worth highlighting Barto & Sutton's view of "related work": "In this book, we consider all of the work in optimal control also to be, in a sense, work in reinforcement learning" It's an example of reframing & renaming in CS that's somehow both myopic & maximalist. 1/2

English

20.7K

Karn Tiwari retweetledi

mgostIH@mgostIH·2 Mar

This paper is pretty cool: The Belief State Transformer Very simple technique and fast to train, makes transformers (or other seq models) better at modelling state and can additionally condition on the end! I wonder what this is like for RL, we might condition on high end reward!

English

634

80.3K

Karn Tiwari retweetledi

Aran Komatsuzaki@arankomatsuzaki·17 Şub

Large Language Diffusion Models Presents LLaDA, a 8B diffusion LM, trained entirely from scratch, rivaling LLaMA3 8B in performance despite being trained on 7x fewer tokens (2T tokens).

English

158

35.8K

Keşfet

@kellerjordan0 @backpropogator @KinjawlB @mozesjacobs @elonmusk @BarackObama @taylorswift13 @cristiano