Marcel Rød

@marcelroed

PhD Student at Stanford working with Tatsu Hashimoto and Jure Leskovec. Previously MIT, Oxford, CERN

Stanford, CA Beigetreten Mart 2012

242 Folgt490 Follower

Marcel Rød retweetet

Arnuv Tandon@arnuvtandon·29 Ara

Excited to release a new paper today: “End-to-End Test-Time Training for Long Context”. Our method, TTT-E2E, enables models to continue learning at test-time via next-token prediction on the given context – compressing context into model weights. For our main result, we extend 3B parameter models from 8K to 128K. TTT-E2E scales with context length like full attention without maintaining keys and values for every token in the sequence. With linear-complexity, TTT-E2E is 2.7x faster than full attention at 128K tokens while achieving better performance. Paper: test-time-training.github.io/e2e.pdf Code: github.com/test-time-trai…

English

243

47.4K

Marcel Rød retweetet

Karan Dalal@karansdalal·29 Ara

Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training. With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics. Paper: test-time-training.github.io/e2e.pdf Code: github.com/test-time-trai…

English

210

1.2K

182.3K

Marcel Rød@marcelroed·1 Tem

@simonguozirui @josancamon19 @neilbband We just ordered a new (small) batch!

English

100

Simon Guo@simonguozirui·1 Tem

@josancamon19 @neilbband @marcelroed do we have extras? or should we produce a second batch

English

116

Joan Cabezas@josancamon19·1 Tem

BPE is easy in principle.. 1) regex pre-tokenize 2) pre-tokens into bytes 3) count pairs frequencies 4) merge highest frequency - repeat 3,4 until target vocab size. Took me about 2 hours, the optimized version took me more than 27!

Simon Guo@simonguozirui

Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by @percyliang @tatsu_hashimoto @marcelroed @neilbband @rckpudi Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch: - Build and Train a Tokenizer 🔤 - Write Triton kernels for Attention ⚡️ - Construct Scaling Laws 📉 - Implement GRPO 🐙

English

847

Marcel Rød@marcelroed·26 Ağu

@james_r_lucas @alfcnz Can you post the code too? I wanna get some idea of how much you had to write for that result. Looks great!

English

James Lucas@james_r_lucas·23 Ağu

I discovered the manim library this weekend. I guess my talk slides are getting an upgrade...

English

323

Entdecken

@simonguozirui @josancamon19 @neilbband @james_r_lucas @alfcnz @elonmusk @BarackObama @taylorswift13