Marcel Rød

4 posts

Marcel Rød

Marcel Rød

@marcelroed

PhD Student at Stanford working with Tatsu Hashimoto and Jure Leskovec. Previously MIT, Oxford, CERN

Stanford, CA เข้าร่วม Mart 2012
242 กำลังติดตาม490 ผู้ติดตาม
Marcel Rød รีทวีตแล้ว
Arnuv Tandon
Arnuv Tandon@arnuvtandon·
Excited to release a new paper today: “End-to-End Test-Time Training for Long Context”. Our method, TTT-E2E, enables models to continue learning at test-time via next-token prediction on the given context – compressing context into model weights. For our main result, we extend 3B parameter models from 8K to 128K. TTT-E2E scales with context length like full attention without maintaining keys and values for every token in the sequence. With linear-complexity, TTT-E2E is 2.7x faster than full attention at 128K tokens while achieving better performance. Paper: test-time-training.github.io/e2e.pdf Code: github.com/test-time-trai…
Arnuv Tandon tweet media
English
4
42
243
47.4K
Marcel Rød รีทวีตแล้ว
Karan Dalal
Karan Dalal@karansdalal·
Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training. With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics. Paper: test-time-training.github.io/e2e.pdf Code: github.com/test-time-trai…
Karan Dalal tweet media
English
42
210
1.2K
182.3K
Joan Cabezas
Joan Cabezas@josancamon19·
BPE is easy in principle.. 1) regex pre-tokenize 2) pre-tokens into bytes 3) count pairs frequencies 4) merge highest frequency - repeat 3,4 until target vocab size. Took me about 2 hours, the optimized version took me more than 27!
Simon Guo@simonguozirui

Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by @percyliang @tatsu_hashimoto @marcelroed @neilbband @rckpudi Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch: - Build and Train a Tokenizer 🔤 - Write Triton kernels for Attention ⚡️ - Construct Scaling Laws 📉 - Implement GRPO 🐙

English
1
0
2
847
Marcel Rød
Marcel Rød@marcelroed·
@james_r_lucas @alfcnz Can you post the code too? I wanna get some idea of how much you had to write for that result. Looks great!
English
1
0
4
0
James Lucas
James Lucas@james_r_lucas·
I discovered the manim library this weekend. I guess my talk slides are getting an upgrade...
English
3
24
323
0