Anand Gopalakrishnan

161 posts

Anand Gopalakrishnan banner
Anand Gopalakrishnan

Anand Gopalakrishnan

@agopal42

Postdoc at @Harvard with @du_yilun and @gershbrain. PhD with @SchmidhuberAI. Previously: Apple MLR, AWS AI Lab. 7\. Same handle on 🦋

Cambridge, MA เข้าร่วม Ocak 2018
523 กำลังติดตาม734 ผู้ติดตาม
ทวีตที่ปักหมุด
Anand Gopalakrishnan
Anand Gopalakrishnan@agopal42·
Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: arxiv.org/abs/2509.10534
English
30
180
1.5K
154.1K
Max Li 李赵硕
Max Li 李赵硕@mli0603·
I've been debugging RoPE recently and kept getting tripped up by details that most explanations gloss over. So I wrote a deep dive. "Understanding RoPE: From Rotary Embeddings to Context Extension" mli0603.notion.site/Understanding-… The blog covers: • Full RoPE derivation from rotation matrices • A clean proof of why RoPE's attention decays with distance (and when it breaks) • The π boundary (RoPE's Nyquist limit) • NTK-aware scaling derivation • Dynamic NTK • YaRN's frequency ramp + attention scaling • Reference PyTorch code Hope it helps! Feedback welcome!
English
8
58
538
60.5K
Anand Gopalakrishnan รีทวีตแล้ว
Kazuki Irie
Kazuki Irie@kzkirie·
Back in 2019, I reduced transformer LM KV-cache size by: (1) setting K=V (storing only K), (2) deeper FF blocks & fewer self-attn layers overall. Published at ICASSP 2020. To my knowledge, the first publication on KV-cache reduction--lmk if you know anything older!
Kazuki Irie tweet media
English
3
10
87
4.8K
Anand Gopalakrishnan รีทวีตแล้ว
Jonas
Jonas@LoosJonas·
Can we replace RoPE with PoPE (Polar Coordinate Positional Embeddings) in pretrained language models? Turns out we can! Using small pythia models, after a small recalibration (~2% of pretrain), we get significantly better length generalization. 1/3 x.com/agopal42/statu…
Jonas tweet media
Anand Gopalakrishnan@agopal42

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: arxiv.org/abs/2509.10534

English
1
4
14
1.1K
Anand Gopalakrishnan รีทวีตแล้ว
Hansen Lillemark
Hansen Lillemark@hansenlillemark·
State of the art World Models still lack a unified world memory for representing and predicting dynamics out of their field of view. Why is that, and how can we fix it? Introducing Flow Equivariant World Models: models with memory capable of predicting out of view dynamics!🧵⬇️
English
17
104
755
88.9K
Anand Gopalakrishnan รีทวีตแล้ว
Andy Keller
Andy Keller@t_andy_keller·
When you're crossing the street and turn your head, you typically remember whether or not a car is coming from the other direction - so why can't today's world models? Introducing Flow Equivariant World Models flowequivariantworldmodels.github.io Led by @hansenlillemark & @huskydogewoof🧵👇
Hansen Lillemark@hansenlillemark

State of the art World Models still lack a unified world memory for representing and predicting dynamics out of their field of view. Why is that, and how can we fix it? Introducing Flow Equivariant World Models: models with memory capable of predicting out of view dynamics!🧵⬇️

English
1
10
49
3.5K
Anand Gopalakrishnan
Anand Gopalakrishnan@agopal42·
@sasuke___420 @jm_alexia No we don't. By partial RoPE you mean applying rotations on a subset of all channels/features? Since that's an orthogonal design choice and can be done on both RoPE and PoPE we decided to compare the simplest versions.
English
0
0
1
106
sasuke⚡420
sasuke⚡420@sasuke___420·
@agopal42 @jm_alexia hello, do you have ablations with partial RoPE? the full RoPE baseline is sort of unrepresentative of what people actually do
English
1
0
0
106
Alexia Jolicoeur-Martineau
Alexia Jolicoeur-Martineau@jm_alexia·
Llama4 tried to use NOPE (no positional information) and it was a huge failure. My expectation is that this will fail in practice and lead to weird behaviors. But I would be happy to be wrong since ROPE is limiting long context generalization. Time will tell.
Sakana AI@SakanaAILabs

Introducing DroPE: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings pub.sakana.ai/DroPE/ We are releasing a new method called DroPE to extend the context length of pretrained LLMs without the massive compute costs usually associated with long-context fine-tuning. The core insight of this work challenges a fundamental assumption in Transformer architecture. We discovered that explicit positional embeddings like RoPE are critical for training convergence but eventually become the primary bottleneck preventing models from generalizing to longer sequences. Our solution is radically simple: We treat positional embeddings as a temporary training scaffold rather than a permanent architectural necessity. Real-world workflows like reviewing massive code diffs or analyzing legal contracts require context windows that break standard pretrained models. While models without positional embeddings (NoPE) generalize better to these unseen lengths, they are notoriously unstable to train from scratch. Here, we achieve the best of both worlds by using embeddings to ensure stability during pretraining and then dropping them to unlock length extrapolation during inference. Our approach unlocks seamless zero-shot context extension without any expensive long-context training. We demonstrated this on a range of off-the-shelf open-source LLMs. In our tests, recalibrating any model with DroPE requires less than 1% of the original pretraining budget, yet it significantly outperforms established methods on challenging benchmarks like LongBench and RULER. We have released the code and the full paper to encourage the community to rethink the role of positional encodings in modern LLMs. Paper: arxiv.org/abs/2512.12167 Code: github.com/SakanaAI/DroPE

English
24
27
427
48.5K
Anand Gopalakrishnan รีทวีตแล้ว
Kazuki Irie
Kazuki Irie@kzkirie·
Humans can't write programs that classify cats vs dogs. Deep learning lets GD write that program. Continual learning is the same: it's too hard for us to design good CL algorithms. Let GD write that algorithm too. That’s the idea of metalearning CL algos: arxiv.org/abs/2312.00276
English
0
2
13
836
François Fleuret
François Fleuret@francoisfleuret·
Coding keeps you humble.
François Fleuret tweet media
English
3
1
25
3.7K
Anand Gopalakrishnan
Anand Gopalakrishnan@agopal42·
@deaton_jon Yes that's true, but the eqns (7-10) were presented in a feature/channel-wise manner. So we wrote an extra multiplication (per channel). Thanks for your response!
English
0
0
1
216
Anand Gopalakrishnan
Anand Gopalakrishnan@agopal42·
Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: arxiv.org/abs/2509.10534
English
30
180
1.5K
154.1K
Anand Gopalakrishnan
Anand Gopalakrishnan@agopal42·
@eshear By log-polar do you mean log spaced frequencies (thetas) used in RoPE? Or something else?
English
0
0
1
59
Emmett Shear
Emmett Shear@eshear·
@agopal42 Did you consider doing log-polar (what the eye uses, effectively CP^1 geometry) vs linear-polar? V curious if you tested it and if it mattered!
English
1
0
2
95
Mayank Chaturvedi
Mayank Chaturvedi@imayank42·
@agopal42 Love the decoupling of the q,k vectors and their locations. Is the code available on GitHub yet? I’d love to take a closer look at the implementation.
English
1
0
1
1.2K
Anand Gopalakrishnan
Anand Gopalakrishnan@agopal42·
11/ Key takeaway: The what-where entanglement in RoPE hurts sequence modelling performance and length generalization. PoPE's disentanglement provides a powerful inductive bias that solves these issues. Huge thanks to my co-authors @robert_csordas , @SchmidhuberAI, @mc_mozer !
English
1
0
28
2.7K
Anand Gopalakrishnan
Anand Gopalakrishnan@agopal42·
10/ Length extrapolation is where things get very interesting. We evaluate pre-trained models with 1024 tokens, then test on sequences up to 10,240 tokens. PoPE maintains stable performance without any fine-tuning or frequency interpolation. Even beats YaRN, which uses both!
Anand Gopalakrishnan tweet media
English
1
1
39
3.3K
Anand Gopalakrishnan
Anand Gopalakrishnan@agopal42·
9/ Zero-shot downstream tasks (LAMBADA, BLiMP, CBT, HellaSwag, PIQA, ARC): PoPE consistently outperforms RoPE across model sizes. At 774M params, we see improvements on every single task.
English
1
0
19
3.1K