Anand Gopalakrishnan (@agopal42) - โปรไฟล์ Twitter

ทวีตที่ปักหมุด

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: arxiv.org/abs/2509.10534

English

30

180

1.5K

154.1K

Anand Gopalakrishnan@agopal42·1 Mar

@mli0603 Nice blogpost and explanations! You might find our recent work helpful in understanding another failure mode of RoPE -- entanglement of 'what' and 'where'. x.com/i/status/20039…

Anand Gopalakrishnan@agopal42

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: arxiv.org/abs/2509.10534

English

0

5

1.7K

Max Li 李赵硕@mli0603·1 Mar

I've been debugging RoPE recently and kept getting tripped up by details that most explanations gloss over. So I wrote a deep dive. "Understanding RoPE: From Rotary Embeddings to Context Extension" mli0603.notion.site/Understanding-… The blog covers: • Full RoPE derivation from rotation matrices • A clean proof of why RoPE's attention decays with distance (and when it breaks) • The π boundary (RoPE's Nyquist limit) • NTK-aware scaling derivation • Dynamic NTK • YaRN's frequency ramp + attention scaling • Reference PyTorch code Hope it helps! Feedback welcome!

English

8

58

538

60.5K

Anand Gopalakrishnan รีทวีตแล้ว

Kazuki Irie@kzkirie·20 Şub

Back in 2019, I reduced transformer LM KV-cache size by: (1) setting K=V (storing only K), (2) deeper FF blocks & fewer self-attn layers overall. Published at ICASSP 2020. To my knowledge, the first publication on KV-cache reduction--lmk if you know anything older!

English

3

10

87

4.8K

Anand Gopalakrishnan รีทวีตแล้ว

Jonas@LoosJonas·4 Şub

Can we replace RoPE with PoPE (Polar Coordinate Positional Embeddings) in pretrained language models? Turns out we can! Using small pythia models, after a small recalibration (~2% of pretrain), we get significantly better length generalization. 1/3 x.com/agopal42/statu…

Anand Gopalakrishnan@agopal42

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: arxiv.org/abs/2509.10534

English

1

4

14

1.1K

Anand Gopalakrishnan รีทวีตแล้ว

Hansen Lillemark@hansenlillemark·15 Oca

State of the art World Models still lack a unified world memory for representing and predicting dynamics out of their field of view. Why is that, and how can we fix it? Introducing Flow Equivariant World Models: models with memory capable of predicting out of view dynamics!🧵⬇️

English

17

104

755

88.9K

Anand Gopalakrishnan รีทวีตแล้ว

Andy Keller@t_andy_keller·15 Oca

When you're crossing the street and turn your head, you typically remember whether or not a car is coming from the other direction - so why can't today's world models? Introducing Flow Equivariant World Models flowequivariantworldmodels.github.io Led by @hansenlillemark & @huskydogewoof🧵👇

Hansen Lillemark@hansenlillemark

State of the art World Models still lack a unified world memory for representing and predicting dynamics out of their field of view. Why is that, and how can we fix it? Introducing Flow Equivariant World Models: models with memory capable of predicting out of view dynamics!🧵⬇️

English

1

10

49

3.5K

Anand Gopalakrishnan@agopal42·12 Oca

@sasuke___420 @jm_alexia No we don't. By partial RoPE you mean applying rotations on a subset of all channels/features? Since that's an orthogonal design choice and can be done on both RoPE and PoPE we decided to compare the simplest versions.

English

0

1

106

sasuke⚡420@sasuke___420·12 Oca

@agopal42 @jm_alexia hello, do you have ablations with partial RoPE? the full RoPE baseline is sort of unrepresentative of what people actually do

English

1

0

106

Alexia Jolicoeur-Martineau@jm_alexia·12 Oca

Llama4 tried to use NOPE (no positional information) and it was a huge failure. My expectation is that this will fail in practice and lead to weird behaviors. But I would be happy to be wrong since ROPE is limiting long context generalization. Time will tell.

Sakana AI@SakanaAILabs

Introducing DroPE: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings pub.sakana.ai/DroPE/ We are releasing a new method called DroPE to extend the context length of pretrained LLMs without the massive compute costs usually associated with long-context fine-tuning. The core insight of this work challenges a fundamental assumption in Transformer architecture. We discovered that explicit positional embeddings like RoPE are critical for training convergence but eventually become the primary bottleneck preventing models from generalizing to longer sequences. Our solution is radically simple: We treat positional embeddings as a temporary training scaffold rather than a permanent architectural necessity. Real-world workflows like reviewing massive code diffs or analyzing legal contracts require context windows that break standard pretrained models. While models without positional embeddings (NoPE) generalize better to these unseen lengths, they are notoriously unstable to train from scratch. Here, we achieve the best of both worlds by using embeddings to ensure stability during pretraining and then dropping them to unlock length extrapolation during inference. Our approach unlocks seamless zero-shot context extension without any expensive long-context training. We demonstrated this on a range of off-the-shelf open-source LLMs. In our tests, recalibrating any model with DroPE requires less than 1% of the original pretraining budget, yet it significantly outperforms established methods on challenging benchmarks like LongBench and RULER. We have released the code and the full paper to encourage the community to rethink the role of positional encodings in modern LLMs. Paper: arxiv.org/abs/2512.12167 Code: github.com/SakanaAI/DroPE

English

24

27

427

48.5K

Anand Gopalakrishnan รีทวีตแล้ว

Kazuki Irie@kzkirie·4 Ara

Humans can't write programs that classify cats vs dogs. Deep learning lets GD write that program. Continual learning is the same: it's too hard for us to design good CL algorithms. Let GD write that algorithm too. That’s the idea of metalearning CL algos: arxiv.org/abs/2312.00276

English

0

2

13

836

Anand Gopalakrishnan@agopal42·29 Ara

@francoisfleuret To _err is human, to _errr is undefined

English

0

3

196

François Fleuret@francoisfleuret·29 Ara

Coding keeps you humble.

English

3

1

25

3.7K

Anand Gopalakrishnan@agopal42·27 Ara

@deaton_jon Yes that's true, but the eqns (7-10) were presented in a feature/channel-wise manner. So we wrote an extra multiplication (per channel). Thanks for your response!

English

0

1

216

Anand Gopalakrishnan@agopal42·24 Ara

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: arxiv.org/abs/2509.10534

English

30

180

1.5K

154.1K

Anand Gopalakrishnan@agopal42·26 Ara

@eshear By log-polar do you mean log spaced frequencies (thetas) used in RoPE? Or something else?

English

0

1

59

Emmett Shear@eshear·25 Ara

@agopal42 Did you consider doing log-polar (what the eye uses, effectively CP^1 geometry) vs linear-polar? V curious if you tested it and if it mattered!

English

1

0

2

95

Anand Gopalakrishnan@agopal42·26 Ara

@imayank42 Plan to release the repo soon-ish. Thanks for your interest!

English

1

0

5

708

Mayank Chaturvedi@imayank42·25 Ara

@agopal42 Love the decoupling of the q,k vectors and their locations. Is the code available on GitHub yet? I’d love to take a closer look at the implementation.

English

1

0

1

1.2K

Anand Gopalakrishnan@agopal42·24 Ara

Posted this a day early and the pun practically writes itself. Noooooo!

GIF

Anand Gopalakrishnan@agopal42

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: arxiv.org/abs/2509.10534

English

1

2

11

2.2K

Anand Gopalakrishnan@agopal42·24 Ara

11/ Key takeaway: The what-where entanglement in RoPE hurts sequence modelling performance and length generalization. PoPE's disentanglement provides a powerful inductive bias that solves these issues. Huge thanks to my co-authors @robert_csordas , @SchmidhuberAI, @mc_mozer !

English

1

0

28

2.7K

Anand Gopalakrishnan@agopal42·24 Ara

10/ Length extrapolation is where things get very interesting. We evaluate pre-trained models with 1024 tokens, then test on sequences up to 10,240 tokens. PoPE maintains stable performance without any fine-tuning or frequency interpolation. Even beats YaRN, which uses both!

English

1

39

3.3K

Anand Gopalakrishnan@agopal42·24 Ara

9/ Zero-shot downstream tasks (LAMBADA, BLiMP, CBT, HellaSwag, PIQA, ARC): PoPE consistently outperforms RoPE across model sizes. At 774M params, we see improvements on every single task.

English

1

0

19

3.1K

Anand Gopalakrishnan

ค้นพบ