Karn Tiwari retweetledi
Karn Tiwari
106 posts

Karn Tiwari retweetledi

43% of the speedup in the new NanoGPT record is due to a variant of value residual learning that I developed.
Value residual learning (recently proposed by arxiv.org/abs/2410.17897) allows all blocks in the transformer to access the values computed by the first block. The paper proposes to set `v = 0.5 * v + 0.5 * v1` in each block, where `v1` is the value computed in the first block.
After I implemented this method, this reduces the loss of the previous record, when it is run for 3200 steps (the final duration of the new record) from 3.327 to 3.315.
I then found that if we make the coefficients learnable, we can get further improvements. Specifically I changed the formula to `v = lambda * v + (1 - lambda) * v1` where lambda is a learnable parameter.
After that, the loss reduces further from 3.315 to 3.3065, which corresponds to a 1.7x greater performance improvement than vanilla value residual.
The graph attached shows that the later blocks have a smaller lambda, which means they make greater use of the shortcut to see v1.
After I got this result, @kellerjordan0 also implemented a similar shortcut allowing later blocks to directly access the embedding, which resulted in further gains.
Also, I have tried adding data dependence to lambda however this seems to have a big benefit only when training for longer token duration (actually the ideal situation), but this is a definite area for potential improvements.
Excited to see people try this in their models and see if this works at scale and across different setups!


English
Karn Tiwari retweetledi
Karn Tiwari retweetledi

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
thinkingmachines.ai/blog/lora/

English
Karn Tiwari retweetledi
Karn Tiwari retweetledi

We're working on a new LM architecture that does not use any variant of multi-head attention or recurrence, and it works well with long context lengths. We're calling it "Avey". Everything is open-sourced under a Apache-2.0 license.
Paper: arxiv.org/abs/2506.11305
Demo Models: huggingface.co/collections/av…
GitHub: github.avey.ai/avey-dpa
(feel free to dm me with questions, suggestions, etc o7)
Work on the next version of the architecture with many improvements is already underway.
The currently released models are only pre-trained for 100 billion tokens, but there are plans to train larger LLMs using this architecture on a much larger dataset in the near future.
Here's a demo of Avey 1.5B generating a completion with an input of 45K tokens on my 4060 laptop while using less than 4GB of VRAM (with bf16).
English
Karn Tiwari retweetledi
Karn Tiwari retweetledi
Karn Tiwari retweetledi

Do neural nets really need gradient descent to generalize?🚨
We dive into matrix factorization and find a sharp split: wide nets rely on GD, while deep nets can thrive with any low-training-error weights!
arxiv.org/abs/2506.03931
🧵

English
Karn Tiwari retweetledi

New day, new paper
«e𝑓unc: An Efficient Function Representation Without Neural Networks»
It looks math-heavy, but it’s not that complicated
Core idea is to swap an MLP with a bunch of small analytic patches, store them in a grid, and then perform global weighted interpolation
Every grid vertex (32³) stores a polynomial, each with 13 learnable parameters, then all results are RBF-blended and soft-maxed
That’s about it
They use “query,” “key,” and “value” terminology, but it’s not like in attention, which is quite confusing
Query - point we sample, key - grid-vertex position, value - polynomial function
Some similarities with attention - it also has quadratic complexity, but it can be optimized with some smart parallel reduction techniques
Additionally, you can give each vertex an optional offset to better align it with the surface, which costs a few extra floats per vertex but improves the accuracy
Overall, it has better compression than a global MLP or NGP, but it’s slower than NGP to train and evaluate
arxiv.org/pdf/2505.21319…




English

Honored to be recognized as a notable reviewer for ICLR 2025! 🎉 Huge thanks to the organizers and the entire community for the opportunity to contribute to such an impactful conference. #ICLR2025 #PeerReview #MachineLearning
ICLR@iclr_conf
2025 had a record breaking number of submissions, and we are grateful for the contributions of all reviewers & ACs. We wish to acknowledge the following notable reviewers who went above and beyond, reviewing 4 or more papers. Thank you for your service. iclr.cc/Conferences/20…
English
Karn Tiwari retweetledi

Introducing Continuous Thought Machines
New Blog: sakana.ai/ctm/
Modern AI is powerful, but it’s still distinct from human-like flexible intelligence. We believe neural timing is key. Our Continuous Thought Machine is built from the ground up to use neural dynamics as a powerful representation for intelligence.
Thought takes time, and reasoning is a process. Biological brains inspire us with their complex neural activity, where neural timing is critical to intelligence. We’re exploring how to bring that power to AI. The Continuous Thought Machine (CTM) incorporates neuron-level temporal processing and neural synchronization, moving beyond current AI limitations.
Our approach has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique parameters to process a history of incoming signals for fine-grained temporal dynamics, and (2) neural synchronization, used as a direct latent representation to modulate data and produce outputs, encoding information directly in the timing of neural activity.
Learn more about our approach:
Interactive Report: pub.sakana.ai/ctm/
Full Paper: arxiv.org/abs/2505.05522
GitHub : github.com/SakanaAI/conti…
English
Karn Tiwari retweetledi
Karn Tiwari retweetledi
Karn Tiwari retweetledi

In the physical world, almost all information is transmitted through traveling waves -- why should it be any different in your neural network?
Super excited to share recent work with the brilliant @mozesjacobs: "Traveling Waves Integrate Spatial Information Through Time"
1/14
GIF
English
Karn Tiwari retweetledi
Karn Tiwari retweetledi

With today's Turing, it's worth highlighting Barto & Sutton's view of "related work":
"In this book, we consider all of the work in optimal control also to be, in a sense, work in reinforcement learning"
It's an example of reframing & renaming in CS that's somehow both myopic & maximalist.
1/2
English
Karn Tiwari retweetledi
Karn Tiwari retweetledi


















