Ryan Ehrlich

Charles 🎉 Frye@charles_irl

7

34

8.1K

Ryan Ehrlich รีทวีตแล้ว

Anne Ouyang@anneouyang·31 Tem

KernelBench v0.1 is out, featuring: - A guideline on analyzing the validity of results and ruling out physically impossible performance claims. - Support for randomized testing beyond normal distributions. - Fixed problem sizes and improved numerics

English

8

32

198

30.6K

Ryan Ehrlich รีทวีตแล้ว

Charles 🎉 Frye@charles_irl·22 Tem

Tokasaurus, the "little LLM engine that could" by @jordanjuravsky and @EyubogluSabri of @HazyResearch/@ScalingIntelLab, is capable of some pretty impressive perf. We replicated their report of >80k tok/s for 16bit LLaMA 3.1 8B on Large Language Monkeys GSM8K - and you can too!

English

4

9

78

9.7K

Ryan Ehrlich รีทวีตแล้ว

Jordan Juravsky@jordanjuravsky·22 Tem

Check out Tokasaurus on Modal to make Llama-1B brrr! This repeated sampling example shows off two engine features that are important for serving small models: very low CPU overhead and automatic shared prefix exploitation with Hydragen.

Tokasaurus, the "little LLM engine that could" by @jordanjuravsky and @EyubogluSabri of @HazyResearch/@ScalingIntelLab, is capable of some pretty impressive perf. We replicated their report of >80k tok/s for 16bit LLaMA 3.1 8B on Large Language Monkeys GSM8K - and you can too!

English

8

34

3.3K

Ryan Ehrlich รีทวีตแล้ว

Sabri Eyuboglu@EyubogluSabri·18 Tem

Thanks @willccbb!! For those at ICML, I'm giving a talk on Cartridges at the ES-FoMo workshop on Saturday at 10:45 -- come through!! Excited to talk memory, test-time training, and continual learning!

cant stop thinking about this one insanely elegant, seems insanely powerful

English

6

39

9.4K

Ryan Ehrlich รีทวีตแล้ว

Simran Arora@simran_s_arora·18 Tem

thanks @willccbb!! checkout Cartridges at ICML ES-FoMo this week :) excited for what's next

cant stop thinking about this one insanely elegant, seems insanely powerful

English

11

108

11.1K

Ryan Ehrlich@ryansehrlich·18 Tem

@gwuah_ Very happy to hear - we tried hard to keep the paper simple + understandable.

English

0

4

91

Ryan Ehrlich@ryansehrlich·18 Tem

Total fair point about the cost of creating a Cartridge. We spent almost no effort trying to make Cartridge construction efficient -- I'd bet one could make it 10-1000x faster either through a meta network or better training objective :) I think it's an exciting direction for future work!

English

1

54

Aryan Agal@aryanagxl·17 Tem

Oh dang. This is exactly what everyone thought we would do, but no one thought you could just train a KV Cache (atleast I didn't tbh). We have been RAGging our way through documents...so it gives a fair alternative. Only issue with this: Updating cartridges is expensive and you may end up spending tons of compute with every delta in the underlying text. Great for classic textbooks though!

cant stop thinking about this one insanely elegant, seems insanely powerful

English

0

4

270

Ryan Ehrlich รีทวีตแล้ว

Scott Swingle@biobootloader·18 Tem

Cartridges could be this "missing learning paradigm" Karpathy talks about 1) agent does tasks, collects memories that help it do better via ICL 2) memories are trained / compacted into Cartridges 3) Cartridges shared / composed / RAG-ed between other agents

cant stop thinking about this one insanely elegant, seems insanely powerful

English

6

25

3K

Ryan Ehrlich@ryansehrlich·18 Tem

@VramAltman @willccbb That paper was a inspiration for Cartridges! Prompt Baking is an awesome idea/paper!!

English

5

110

Vram Altman@VramAltman·18 Tem

@willccbb reminds me of arxiv.org/abs/2409.13697 which seems underhyped

English

2

16

1.6K

will brown@willccbb·17 Tem

cant stop thinking about this one insanely elegant, seems insanely powerful

English

26

52

841

102.1K

Ryan Ehrlich@ryansehrlich·18 Tem

Thank you for the kind words -- we can't either! We're really excited about models learning new things and remembering their experience, and we think that Cartridges is a step towards that future. My co-author @EyubogluSabri will be giving a talk on Cartridges at @ESFoMo at ICML: if you're interested in long-context or online learning, I'd recommend going! All of this work wouldn't have been possible without @PrimeIntellect @modal_labs @togethercompute and @VoltagePark sponsoring the compute :)

Geoffrey Angus@GeoffreyAngus

cant stop thinking about this one insanely elegant, seems insanely powerful

English

2

6

56

5.5K

Ryan Ehrlich รีทวีตแล้ว

Jacky Kwok@jackyk02·9 Tem

✨ Test-Time Scaling for Robotics ✨ Excited to release 🤖 RoboMonkey, which characterizes test-time scaling laws for Vision-Language-Action (VLA) models and introduces a framework that significantly improves the generalization and robustness of VLAs! 🧵(1 / N) 🌐 Website: robomonkey-vla.github.io 💻 Code: github.com/robomonkey-vla… 🗂️ Datasets and Models: huggingface.co/robomonkey-vla 🚀 Serving Engine: github.com/robomonkey-vla… 📄 Paper: arxiv.org/abs/2506.17811

English

2

14

71

180.2K

Ryan Ehrlich รีทวีตแล้ว

Jerry Liu@jerrywliu·7 Tem

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:

GIF

English

13

119

634

88.3K

Ryan Ehrlich รีทวีตแล้ว

Jon Saad-Falcon@JonSaadFalcon·24 Haz

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning models like Llama 3.3 70B Instruct! 🧵(1 / N)

English

11

59

223

76.4K

Ryan Ehrlich รีทวีตแล้ว

Silas Alberti@silasalberti·23 Haz

I recently had a stint as a systems engineer. We built our own VM hypervisor otterlink for 10x faster startup and 200x faster snapshots (compared to EC2). One cool part was building a custom file format Blockdiff for VM disk snapshots. Sharing more in our blogpost below:

Cognition@cognition

We needed instant VM snapshots for Devin but EC2 took 30+ minutes. So, @silasalberti built blockdiff—a new file format that makes snapshots 200x faster. Today, we’re open-sourcing blockdiff & sharing how it works 🔗👇

English

22

44

402

75.6K

Ryan Ehrlich รีทวีตแล้ว

Charles Packer@charlespacker·16 Haz

sleep-time compute* for kv cache *arxiv.org/abs/2504.13171

A Cartridge is a fixed-length, trainable KV cache optimized offline, or at “sleep-time.” They retain much of the same information as KV caches computed on-the-fly at test time, but achieve 26x higher tokens/sec, lower TTFT, and 39x lower memory usage! x.com/EyubogluSabri/…

English

Sabri Eyuboglu@EyubogluSabri

4

12

1.5K

Ryan Ehrlich รีทวีตแล้ว

carsonfarmer@carsonfarmer·16 Haz

This looks super cool. Our own research team was exploring similar ideas for building an internal corpus of context for our content generation tasks. Now we just got a huge head start on it!

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x (enabling 26x higher tok/s and lower TTFT) while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests! Github: HazyResearch/cartridges

English

16

3

20

1.2K

Ryan Ehrlich รีทวีตแล้ว

Geoffrey Angus@GeoffreyAngus·16 Haz

Struggling with context management? Wish you could just stick it all in your model? We’ve integrated Cartridges, a new method of leveraging sleep-time compute for learning long contexts, into Tokasaurus, an inference engine optimized for high-throughput 🧵

English

Geoffrey Angus@GeoffreyAngus

11

45

6.8K

Ryan Ehrlich รีทวีตแล้ว

Sabri Eyuboglu@EyubogluSabri·16 Haz

An advantage of training a cache/prefix (as opposed to a lora adapter), is that we can serve per-user cartridges using the same optimizations and kernels, which inference engines already use for per-user kv caches. @GeoffreyAngus just integrated cartridges into Tokasaurus (a high-throughput vllm/sglang alternative). The interface is super clean -- users can add any cartridge to their chat completion requests without significant reduction in tok/s

Struggling with context management? Wish you could just stick it all in your model? We’ve integrated Cartridges, a new method of leveraging sleep-time compute for learning long contexts, into Tokasaurus, an inference engine optimized for high-throughput 🧵

English

5

18

2.2K

Ryan Ehrlich@ryansehrlich·14 Haz

@simon_jegou @EyubogluSabri Ah, got it. We had tried sampling q vectors directly, it'll be interesting to try sampling from the distribution of hidden states!

English

37

Simon Jegou@simon_jegou·13 Haz

@ryansehrlich @EyubogluSabri Hidden states follow a "gaussian" distribution (this is not really well known), so you can sample a hidden state and then apply Wq and RoPE. Check the code it's open source :)

English

0

59

Sabri Eyuboglu@EyubogluSabri·9 Haz

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x (enabling 26x higher tok/s and lower TTFT) while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests! Github: HazyResearch/cartridges

English

17

72

347

96.6K

Ryan Ehrlich@ryansehrlich·13 Haz

I really like this idea! Some of our initial experiments explored generating synthetic query vectors and applying a per-layer distillation loss to the output of the trainable attention to be close to the output of the reference model. We couldn't get this to work well and transitioned to the approach in the paper. I think there were two problems: picking the distribution to sample the q vectors from and the optimization task being too specific (maybe applying a per-layer loss is too constraining vs just applying a KL on the logits?) What distribution do you sample the synthetic queries from?

English

0

1

53

Simon Jegou@simon_jegou·11 Haz

@EyubogluSabri Very interesting ! Do you think we could replace synthetic conversation generation with synthetic queries to make it faster ? I quickly experimented with this approach last year (see x.com/simon_jegou/st… and x.com/simon_jegou/st…)

Simon Jegou@simon_jegou

I created a DistillationPress that distills the (K,V) cache into a compressed (Kc,Vc) cache by minimizing ||A(q,K,V) - A(q,Kc,Vc)||^2. Checkout my notebook here: github.com/NVIDIA/kvpress…. More work needs to be done, it's just a first step (3/3)

English