Karn Tiwari

106 posts

Karn Tiwari

Karn Tiwari

@TiwariKarn

ML Researcher

India Katılım Aralık 2019
2.4K Takip Edilen76 Takipçiler
Karn Tiwari retweetledi
Alec Helbling
Alec Helbling@alec_helbling·
Flow-based generative models trained with flow matching tend to learn curved trajectories, which are challenging to approximate in a few steps. Rectified flows aim to learn straight trajectories, which are easier to simulate with less computation.
English
9
42
430
35.1K
Karn Tiwari retweetledi
Grad
Grad@Grad62304977·
43% of the speedup in the new NanoGPT record is due to a variant of value residual learning that I developed. Value residual learning (recently proposed by arxiv.org/abs/2410.17897) allows all blocks in the transformer to access the values computed by the first block. The paper proposes to set `v = 0.5 * v + 0.5 * v1` in each block, where `v1` is the value computed in the first block. After I implemented this method, this reduces the loss of the previous record, when it is run for 3200 steps (the final duration of the new record) from 3.327 to 3.315. I then found that if we make the coefficients learnable, we can get further improvements. Specifically I changed the formula to `v = lambda * v + (1 - lambda) * v1` where lambda is a learnable parameter. After that, the loss reduces further from 3.315 to 3.3065, which corresponds to a 1.7x greater performance improvement than vanilla value residual. The graph attached shows that the later blocks have a smaller lambda, which means they make greater use of the shortcut to see v1. After I got this result, @kellerjordan0 also implemented a similar shortcut allowing later blocks to directly access the embedding, which resulted in further gains. Also, I have tried adding data dependence to lambda however this seems to have a big benefit only when training for longer token duration (actually the ideal situation), but this is a definite area for potential improvements. Excited to see people try this in their models and see if this works at scale and across different setups!
Grad tweet mediaGrad tweet media
English
18
53
513
95.8K
Karn Tiwari retweetledi
Bidipta Sarkar
Bidipta Sarkar@bidiptas13·
Introducing 🥚EGGROLL 🥚(Evolution Guided General Optimization via Low-rank Learning)! 🚀 Scaling backprop-free Evolution Strategies (ES) for billion-parameter models at large population sizes ⚡100x Training Throughput 🎯Fast Convergence 🔢Pure Int8 Pretraining of RNN LLMs
Bidipta Sarkar tweet media
English
20
155
1K
307.5K
Karn Tiwari retweetledi
Thinking Machines
Thinking Machines@thinkymachines·
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/
Thinking Machines tweet media
English
82
560
3.5K
1.4M
Karn Tiwari retweetledi
Ricardo Buitrago
Ricardo Buitrago@rbuit_·
Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!
Ricardo Buitrago tweet media
English
6
34
197
42.3K
Karn Tiwari retweetledi
Rain
Rain@rainnekoneko·
We're working on a new LM architecture that does not use any variant of multi-head attention or recurrence, and it works well with long context lengths. We're calling it "Avey". Everything is open-sourced under a Apache-2.0 license. Paper: arxiv.org/abs/2506.11305 Demo Models: huggingface.co/collections/av… GitHub: github.avey.ai/avey-dpa (feel free to dm me with questions, suggestions, etc o7) Work on the next version of the architecture with many improvements is already underway. The currently released models are only pre-trained for 100 billion tokens, but there are plans to train larger LLMs using this architecture on a much larger dataset in the near future. Here's a demo of Avey 1.5B generating a completion with an input of 45K tokens on my 4060 laptop while using less than 4GB of VRAM (with bf16).
English
20
55
439
46.9K
Karn Tiwari retweetledi
Ruben Hassid
Ruben Hassid@rubenhassid·
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. Here's what Apple discovered: (hint: we're not as close to AGI as the hype suggests)
Ruben Hassid tweet media
English
2.6K
9.1K
62.8K
14.2M
Karn Tiwari retweetledi
tyler hogge
tyler hogge@thogge·
one of Steve Jobs' best lines. simplicity is so hard.
tyler hogge tweet media
English
115
1.4K
15K
1.2M
Karn Tiwari retweetledi
Yoni Slutzky
Yoni Slutzky@YoniSlutzky·
Do neural nets really need gradient descent to generalize?🚨 We dive into matrix factorization and find a sharp split: wide nets rely on GD, while deep nets can thrive with any low-training-error weights! arxiv.org/abs/2506.03931 🧵
Yoni Slutzky tweet media
English
1
13
49
4.5K
Karn Tiwari retweetledi
Lucky Iyinbor
Lucky Iyinbor@Luckyballa·
New day, new paper «e𝑓unc: An Efficient Function Representation Without Neural Networks» It looks math-heavy, but it’s not that complicated Core idea is to swap an MLP with a bunch of small analytic patches, store them in a grid, and then perform global weighted interpolation Every grid vertex (32³) stores a polynomial, each with 13 learnable parameters, then all results are RBF-blended and soft-maxed That’s about it They use “query,” “key,” and “value” terminology, but it’s not like in attention, which is quite confusing Query - point we sample, key - grid-vertex position, value - polynomial function Some similarities with attention - it also has quadratic complexity, but it can be optimized with some smart parallel reduction techniques Additionally, you can give each vertex an optional offset to better align it with the surface, which costs a few extra floats per vertex but improves the accuracy Overall, it has better compression than a global MLP or NGP, but it’s slower than NGP to train and evaluate arxiv.org/pdf/2505.21319…
Lucky Iyinbor tweet mediaLucky Iyinbor tweet mediaLucky Iyinbor tweet mediaLucky Iyinbor tweet media
English
5
52
446
27.9K
Karn Tiwari
Karn Tiwari@TiwariKarn·
Honored to be recognized as a notable reviewer for ICLR 2025! 🎉 Huge thanks to the organizers and the entire community for the opportunity to contribute to such an impactful conference. #ICLR2025 #PeerReview #MachineLearning
ICLR@iclr_conf

2025 had a record breaking number of submissions, and we are grateful for the contributions of all reviewers & ACs. We wish to acknowledge the following notable reviewers who went above and beyond, reviewing 4 or more papers. Thank you for your service. iclr.cc/Conferences/20…

English
0
0
6
166
Karn Tiwari retweetledi
Sakana AI
Sakana AI@SakanaAILabs·
Introducing Continuous Thought Machines New Blog: sakana.ai/ctm/ Modern AI is powerful, but it’s still distinct from human-like flexible intelligence. We believe neural timing is key. Our Continuous Thought Machine is built from the ground up to use neural dynamics as a powerful representation for intelligence. Thought takes time, and reasoning is a process. Biological brains inspire us with their complex neural activity, where neural timing is critical to intelligence. We’re exploring how to bring that power to AI. The Continuous Thought Machine (CTM) incorporates neuron-level temporal processing and neural synchronization, moving beyond current AI limitations. Our approach has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique parameters to process a history of incoming signals for fine-grained temporal dynamics, and (2) neural synchronization, used as a direct latent representation to modulate data and produce outputs, encoding information directly in the timing of neural activity. Learn more about our approach: Interactive Report: pub.sakana.ai/ctm/ Full Paper: arxiv.org/abs/2505.05522 GitHub : github.com/SakanaAI/conti…
English
36
283
1.3K
289.6K
Piyush Tiwary
Piyush Tiwary@backpropogator·
Happy to inform my first first-authored A* paper at #icml2025 : "LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation" with hardworking @KinjawlB & my guide, Prathosh A.P. Preprint soon!
English
5
2
33
2.8K
Karn Tiwari retweetledi
Ashish Vaswani
Ashish Vaswani@ashVaswani·
Reinforcement learning has shown success in eliciting reflection from LLMs, but what if this capability actually manifests earlier in pre-training? We investigated this question and our results are surprising 👇 [1/4]
Ashish Vaswani tweet media
English
13
100
806
137.8K
Karn Tiwari retweetledi
Pedro Domingos
Pedro Domingos@pmddomingos·
Neural networks don’t have to be distilled into other neural networks. They can be distilled into decision trees or sets of rules. And then interpretability becomes dramatically easier.
English
49
98
1.3K
146.2K
Karn Tiwari retweetledi
Andy Keller
Andy Keller@t_andy_keller·
In the physical world, almost all information is transmitted through traveling waves -- why should it be any different in your neural network? Super excited to share recent work with the brilliant @mozesjacobs: "Traveling Waves Integrate Spatial Information Through Time" 1/14
GIF
English
143
882
7K
760.2K
Karn Tiwari retweetledi
Jeremy Bernstein
Jeremy Bernstein@jxbz·
I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)
Jeremy Bernstein tweet media
English
13
138
1K
125.6K
Karn Tiwari retweetledi
Peyman Milanfar
Peyman Milanfar@docmilanfar·
With today's Turing, it's worth highlighting Barto & Sutton's view of "related work": "In this book, we consider all of the work in optimal control also to be, in a sense, work in reinforcement learning" It's an example of reframing & renaming in CS that's somehow both myopic & maximalist. 1/2
English
1
5
68
20.7K
Karn Tiwari retweetledi
mgostIH
mgostIH@mgostIH·
This paper is pretty cool: The Belief State Transformer Very simple technique and fast to train, makes transformers (or other seq models) better at modelling state and can additionally condition on the end! I wonder what this is like for RL, we might condition on high end reward!
mgostIH tweet media
English
15
94
634
80.3K
Karn Tiwari retweetledi
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
Large Language Diffusion Models Presents LLaDA, a 8B diffusion LM, trained entirely from scratch, rivaling LLaMA3 8B in performance despite being trained on 7x fewer tokens (2T tokens).
Aran Komatsuzaki tweet media
English
3
30
158
35.8K