Ryan Ehrlich

56 posts

Ryan Ehrlich

Ryan Ehrlich

@ryansehrlich

AI Research Scientist at Meta

เข้าร่วม Ağustos 2021
108 กำลังติดตาม322 ผู้ติดตาม
ทวีตที่ปักหมุด
Ryan Ehrlich
Ryan Ehrlich@ryansehrlich·
Giving LLMs very large amounts of context can be really useful, but it can also be slow and expensive. Could scaling inference time compute help? In our latest work, we show that allowing models to spend test time compute to “self-study” a large corpora can >20x decode throughput while maintaining downstream task performance. Our approach is simple: 1. Use the LLM to sample synthetic conversations about the corpora. 2. Using gradient descent, train a small adapter (we term a Cartridge) on these synthetic conversations to “burn” the corpora into the adapter weights. Surprisingly, parameterizing this adapter as a KV cache rather than a LoRA lead to both better in-domain task performance and less forgetting of unrelated facts. There were a bunch of other interesting results like this: take a look at @EyubogluSabri's thread and the paper for more details about our methodology and results. Joint work with @EyubogluSabri , @simran_s_arora, @NeelGuha, @dylan_zinsley, @james_y_zou, @Azaliamirh, @HazyResearch & others!
Sabri Eyuboglu@EyubogluSabri

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x (enabling 26x higher tok/s and lower TTFT) while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests! Github: HazyResearch/cartridges

English
0
7
34
8.1K
Ryan Ehrlich รีทวีตแล้ว
Anne Ouyang
Anne Ouyang@anneouyang·
KernelBench v0.1 is out, featuring: - A guideline on analyzing the validity of results and ruling out physically impossible performance claims. - Support for randomized testing beyond normal distributions. - Fixed problem sizes and improved numerics
Anne Ouyang tweet media
English
8
32
198
30.6K
Ryan Ehrlich รีทวีตแล้ว
Charles 🎉 Frye
Charles 🎉 Frye@charles_irl·
Tokasaurus, the "little LLM engine that could" by @jordanjuravsky and @EyubogluSabri of @HazyResearch/@ScalingIntelLab, is capable of some pretty impressive perf. We replicated their report of >80k tok/s for 16bit LLaMA 3.1 8B on Large Language Monkeys GSM8K - and you can too!
Charles 🎉 Frye tweet media
English
4
9
78
9.7K
Ryan Ehrlich รีทวีตแล้ว
Jordan Juravsky
Jordan Juravsky@jordanjuravsky·
Check out Tokasaurus on Modal to make Llama-1B brrr! This repeated sampling example shows off two engine features that are important for serving small models: very low CPU overhead and automatic shared prefix exploitation with Hydragen.
Charles 🎉 Frye@charles_irl

Tokasaurus, the "little LLM engine that could" by @jordanjuravsky and @EyubogluSabri of @HazyResearch/@ScalingIntelLab, is capable of some pretty impressive perf. We replicated their report of >80k tok/s for 16bit LLaMA 3.1 8B on Large Language Monkeys GSM8K - and you can too!

English
1
8
34
3.3K
Ryan Ehrlich
Ryan Ehrlich@ryansehrlich·
@gwuah_ Very happy to hear - we tried hard to keep the paper simple + understandable.
English
1
0
4
91
Ryan Ehrlich
Ryan Ehrlich@ryansehrlich·
Total fair point about the cost of creating a Cartridge. We spent almost no effort trying to make Cartridge construction efficient -- I'd bet one could make it 10-1000x faster either through a meta network or better training objective :) I think it's an exciting direction for future work!
English
0
0
1
54
Aryan Agal
Aryan Agal@aryanagxl·
Oh dang. This is exactly what everyone thought we would do, but no one thought you could just train a KV Cache (atleast I didn't tbh). We have been RAGging our way through documents...so it gives a fair alternative. Only issue with this: Updating cartridges is expensive and you may end up spending tons of compute with every delta in the underlying text. Great for classic textbooks though!
will brown@willccbb

cant stop thinking about this one insanely elegant, seems insanely powerful

English
1
0
4
270
will brown
will brown@willccbb·
cant stop thinking about this one insanely elegant, seems insanely powerful
will brown tweet media
English
26
52
841
102.1K
Ryan Ehrlich
Ryan Ehrlich@ryansehrlich·
Thank you for the kind words -- we can't either! We're really excited about models learning new things and remembering their experience, and we think that Cartridges is a step towards that future. My co-author @EyubogluSabri will be giving a talk on Cartridges at @ESFoMo at ICML: if you're interested in long-context or online learning, I'd recommend going! All of this work wouldn't have been possible without @PrimeIntellect @modal_labs @togethercompute and @VoltagePark sponsoring the compute :)
will brown@willccbb

cant stop thinking about this one insanely elegant, seems insanely powerful

English
2
6
56
5.5K
Ryan Ehrlich รีทวีตแล้ว
Jacky Kwok
Jacky Kwok@jackyk02·
✨ Test-Time Scaling for Robotics ✨ Excited to release 🤖 RoboMonkey, which characterizes test-time scaling laws for Vision-Language-Action (VLA) models and introduces a framework that significantly improves the generalization and robustness of VLAs! 🧵(1 / N) 🌐 Website: robomonkey-vla.github.io 💻 Code: github.com/robomonkey-vla… 🗂️ Datasets and Models: huggingface.co/robomonkey-vla 🚀 Serving Engine: github.com/robomonkey-vla… 📄 Paper: arxiv.org/abs/2506.17811
Jacky Kwok tweet media
English
2
14
71
180.2K
Ryan Ehrlich รีทวีตแล้ว
Jerry Liu
Jerry Liu@jerrywliu·
1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:
GIF
English
13
119
634
88.3K
Ryan Ehrlich รีทวีตแล้ว
Jon Saad-Falcon
Jon Saad-Falcon@JonSaadFalcon·
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning models like Llama 3.3 70B Instruct! 🧵(1 / N)
Jon Saad-Falcon tweet media
English
11
59
223
76.4K
Ryan Ehrlich รีทวีตแล้ว
Silas Alberti
Silas Alberti@silasalberti·
I recently had a stint as a systems engineer. We built our own VM hypervisor otterlink for 10x faster startup and 200x faster snapshots (compared to EC2). One cool part was building a custom file format Blockdiff for VM disk snapshots. Sharing more in our blogpost below:
Silas Alberti tweet media
Cognition@cognition

We needed instant VM snapshots for Devin but EC2 took 30+ minutes. So, @silasalberti built blockdiff—a new file format that makes snapshots 200x faster. Today, we’re open-sourcing blockdiff & sharing how it works 🔗👇

English
22
44
402
75.6K
Ryan Ehrlich รีทวีตแล้ว
Ryan Ehrlich รีทวีตแล้ว
Geoffrey Angus
Geoffrey Angus@GeoffreyAngus·
Struggling with context management? Wish you could just stick it all in your model? We’ve integrated Cartridges, a new method of leveraging sleep-time compute for learning long contexts, into Tokasaurus, an inference engine optimized for high-throughput 🧵
Geoffrey Angus tweet media
English
1
11
45
6.8K
Ryan Ehrlich รีทวีตแล้ว
Sabri Eyuboglu
Sabri Eyuboglu@EyubogluSabri·
An advantage of training a cache/prefix (as opposed to a lora adapter), is that we can serve per-user cartridges using the same optimizations and kernels, which inference engines already use for per-user kv caches. @GeoffreyAngus just integrated cartridges into Tokasaurus (a high-throughput vllm/sglang alternative). The interface is super clean -- users can add any cartridge to their chat completion requests without significant reduction in tok/s
Sabri Eyuboglu tweet media
Geoffrey Angus@GeoffreyAngus

Struggling with context management? Wish you could just stick it all in your model? We’ve integrated Cartridges, a new method of leveraging sleep-time compute for learning long contexts, into Tokasaurus, an inference engine optimized for high-throughput 🧵

English
1
5
18
2.2K
Ryan Ehrlich
Ryan Ehrlich@ryansehrlich·
@simon_jegou @EyubogluSabri Ah, got it. We had tried sampling q vectors directly, it'll be interesting to try sampling from the distribution of hidden states!
English
0
0
0
37
Simon Jegou
Simon Jegou@simon_jegou·
@ryansehrlich @EyubogluSabri Hidden states follow a "gaussian" distribution (this is not really well known), so you can sample a hidden state and then apply Wq and RoPE. Check the code it's open source :)
English
1
0
0
59
Sabri Eyuboglu
Sabri Eyuboglu@EyubogluSabri·
When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x (enabling 26x higher tok/s and lower TTFT) while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests! Github: HazyResearch/cartridges
Sabri Eyuboglu tweet media
English
17
72
347
96.6K
Ryan Ehrlich
Ryan Ehrlich@ryansehrlich·
I really like this idea! Some of our initial experiments explored generating synthetic query vectors and applying a per-layer distillation loss to the output of the trainable attention to be close to the output of the reference model. We couldn't get this to work well and transitioned to the approach in the paper. I think there were two problems: picking the distribution to sample the q vectors from and the optimization task being too specific (maybe applying a per-layer loss is too constraining vs just applying a KL on the logits?) What distribution do you sample the synthetic queries from?
English
1
0
1
53
Simon Jegou
Simon Jegou@simon_jegou·
@EyubogluSabri Very interesting ! Do you think we could replace synthetic conversation generation with synthetic queries to make it faster ? I quickly experimented with this approach last year (see x.com/simon_jegou/st… and x.com/simon_jegou/st…)
Simon Jegou@simon_jegou

I created a DistillationPress that distills the (K,V) cache into a compressed (Kc,Vc) cache by minimizing ||A(q,K,V) - A(q,Kc,Vc)||^2. Checkout my notebook here: github.com/NVIDIA/kvpress…. More work needs to be done, it's just a first step (3/3)

English
1
0
1
243