Joshua Gu

46 posts

Joshua Gu

@astrogu_

CS Phd student @MIT_CSAIL, @MITEECS, @MIT 👨‍💻| Previous: @LMCache Lab, @tensormesh, BS @UChicago. Research on AI Systems

Chicago, IL Katılım Aralık 2023

182 Takip Edilen269 Takipçiler

Sabitlenmiş Tweet

Joshua Gu@astrogu_·5d

Recent agentic systems (Claude Code, Codex, RLM, etc.) push context out of the prompt and into the environment (e.g., as files). This helps them maintain long-term knowledge about their goals and functionality. 🚨 While this is a good idea, we show a surprising result: systems that use external environments like this perform much better when given a small, fixed-size, in-context, agent-managed cache that "𝘱𝘦𝘦𝘬𝘴 𝘪𝘯𝘵𝘰" these environments. 🚀 Our paper, 𝗣𝗘𝗘𝗞: 𝙖 𝙨𝙮𝙨𝙩𝙚𝙢 𝙛𝙤𝙧 𝙗𝙪𝙞𝙡𝙙𝙞𝙣𝙜 𝙖𝙣𝙙 𝙢𝙖𝙞𝙣𝙩𝙖𝙞𝙣𝙞𝙣𝙜 𝗮𝗻 𝗼𝗿𝗶𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 𝗰𝗮𝗰𝗵𝗲 𝙛𝙤𝙧 𝙇𝙇𝙈 𝙖𝙜𝙚𝙣𝙩𝙨, introduces this idea. Compared with strong baselines, including RAG, Compaction Agents, and SOTA prompt-learning frameworks, PEEK dominates the cost–quality Pareto frontier: achieving +6.3–34.0% in quality, with fewer iterations and lower cost. Paper: arxiv.org/abs/2605.19932 GitHub: github.com/zhuohangu/peek More in the thread below! (1/N)

English

343

102.7K

Joshua Gu@astrogu_·2d

@IntuitMachine @Microsoft Thanks for sharing! Small correction: PEEK is from @MIT_CSAIL and @Stanford, not Microsoft :)

English

149

Carlos E. Perez@IntuitMachine·2d

🧵 PEEK: The 1k-Token Map That Just Killed the Long-Context Tax Your LLM agent is reading the same 50k-token codebase for the 20th time. It still doesn't know where anything is. PEEK from @Microsoft just changed that with a 1k-token "context map" that: • ↑ 34% accuracy • ↓ 93–145 fewer retries • 5.8× cheaper than prompt tuning Here's how: 🧵 Every time you ask GPT-5 a new question about the same repo, it re-discovers: → File structure → Key classes → How modules connect You're paying for the same orientation work. Again. And again. Industry calls this "the long-context tax." PEEK's breakthrough: Separate "context understanding" from "task execution." Instead of stuffing everything into the prompt or retrieving blindly, agents now maintain a tiny persistent map — like a cheat-sheet they write once and reuse forever. The Context Map has 5 sections: 1⃣ Context Roadmap — high-level structure 2⃣ Context Understanding — key entities/relationships 3⃣ Domain Constants (if needed) 4⃣ Parsing Schemas 5⃣ Reusable Results (cached answers) Budget: exactly 1,024 tokens. Three modules keep it fresh without bloat: 🔍 Distiller → Extracts only transferable orientation knowledge ✏️ Cartographer → Makes clean, deduplicated edits (ADD/DELETE/REPLACE) 🗑️ Evictor → Drops low-priority items when budget fills Separation matters: mixed roles = noise + duplication. Tested on OOLONG + CL-bench (coding benchmarks): MetricGain vs. ACE (SOTA)Accuracy+6–34%Iterations saved93–145 fewerCost reduction1.4–5.8× cheaper Same base model. Same agent. Just 1k tokens of orientation cache. Here's the efficiency secret: Freeze the map after 1–4 queries. You get 80%+ of the gains but near-zero maintenance cost after that. Most "learning" systems never stop updating → wasted compute. PEEK learns fast, then locks in. How PEEK beats the field: ❌ RAG: retrieves fragments, no holistic structure ❌ Summarization: compresses content, not orientation ❌ ACE/prompt tuning: optimizes tasks, not context understanding ✅ PEEK: caches the mental model your agent should have built on day 1 Devil's advocate: PEEK wins when context is structured and queries recur. If you're writing one-off creative fiction or chatting about random PDFs, the map has less to cache. But for repos, enterprise docs, analytics? This is the new baseline. Traditional stack: → Bigger context windows → Better retrieval → Smarter prompts New stack: → Bigger context windows → Better retrieval → Persistent orientation caches Context understanding just became a first-class versioned artifact. Two multipliers you can stack today: 1⃣ PEEK-style maps (↓ redundant reasoning) 2⃣ KV-cache optimizations (↓ redundant token processing) Combine them = multiplicative inference savings. The next wave of agent infra will bake both in by default. If you're building agents that interact with the same long contexts repeatedly: → Stop re-engineering prompts every query → Start caching orientation knowledge The 1k-token map is the missing cache layer. Use it. /end 🧵

English

153

10.5K

Joshua Gu@astrogu_·2d

@lihanc02 @kobe0938 fooood！

Hanchen Li@lihanc02·2d

Had a bet today with @kobe0938 for a good dinner I bet NVidia reaches 2.5T before 20T. He bets 20T before 2.5T. Who do you think will win?

English

980

Joshua Gu@astrogu_·4d

@DbrxMosaicAI is this not just an RLM?

English

Databricks AI Research@DbrxMosaicAI·6d

New research from Databricks: the context window is the only persistent substrate today's LLM agents have, and it floods fast. A single SQL query can return millions of rows that ride along in every subsequent turn, even when only one cell ever mattered. We hit this constraint every day in the agents we run in production, from Genie to Agent Bricks' Supervisor Agent to KARL. In a new post from the Databricks research team, we introduce MemEx: a programmable Python scratchpad that lets agents transform, slice, and persist tool outputs as typed objects in a live kernel. Same observe-act loop. Different action space. Across nine frontier and open-weight models on two enterprise agentic tasks (OfficeQA Pro and Enterprise Structured Retrieval): • Frontier models (Opus 4.6, Sonnet 4.6, Gemini 3.1 Pro) gain 2 to 5 accuracy points at 25 to 30% lower cost • Qwen 122B and Qwen 397B nearly double accuracy at 40 to 50% lower cost • Four of the five points on the OfficeQA Pro cost-accuracy Pareto frontier are MemEx configurations MemEx extends the code-as-action line (CodeAct, Anthropic Programmatic Tool Calling, Cloudflare Code Mode) with persistent scope across turns, eager spawn_agent for parallel sub-agents that share the parent's namespace, typed submit() for validated returns, and live-object scope injection. Built on aroll, the same Databricks agentic rollouts framework already powering those production systems. MemEx is rolling out across Databricks first-party agents and Agent Bricks soon. If you build on Databricks agents today, you'll be able to try it. Full write-up: databricks.com/blog/memex-pro…

English

188

154K

Joshua Gu@astrogu_·4d

@qcyang20xx 🫪🫪🫪

QME

Qingchuan (Tony) Yang@qcyang20xx·12 Mar

𝗣𝗿𝗶𝘃𝗮𝘁𝗲 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝘁𝗲𝘅𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 has had the same problem for a while: privacy, quality, or efficiency - pick two 😵‍💫 We think 𝐄𝐏𝐒𝐕𝐞𝐜 changes that 🚀 Paper: arxiv.org/abs/2602.21218

English

6.9K

Joshua Gu@astrogu_·5d

@lihanc02 not really. skills progressively load static, task-facing instructions. PEEK maintains external context state and is not about deciding when to load more files

English

1.3K

Hanchen Li@lihanc02·5d

@astrogu_ Claude skill does sth similar though? You get to peek the concise versions and then if it is useful you load the full context

English

687

Joshua Gu@astrogu_·5d

English

343

102.7K

Joshua Gu@astrogu_·5d

Acknowledgements Special thanks to @lateinteraction (@MIT_CSAIL, @nlp_mit) and @samrmadden (@MIT_CSAIL) for their invaluable mentorship and guidance. I’m also grateful to my amazing collaborator @qizhengz_alex from @Stanford. The main results of PEEK build on the RLM infrastructure by @a1zhang, and I learned a lot from discussions with Alex. I also want to thank the wonderful MIT DSG and MIT OASYS labs, my fellow @MITEECS students, and my family for their support during the first year of my PhD. I also thank my friends at @UChicago, @Stanford, @UCBerkeley, @Harvard, and @BrownUniversity for their insightful discussions and feedback. Without any of these people, this paper would have either never become good or never been completed. (6/N, end of thread)

English

932

Joshua Gu@astrogu_·5d

Discussion We see this as a first step toward a new solution in a new paradigm for agentic systems: 𝗮 𝗴𝗲𝗻𝘂𝗶𝗻𝗲 𝗮𝗴𝗲𝗻𝘁-𝘀𝗶𝗱𝗲 𝗰𝗮𝗰𝗵𝗲 𝗳𝗼𝗿 𝗟𝗠 𝗮𝗴𝗲𝗻𝘁𝘀. We view PEEK as opening a broader research agenda: how should agents build and maintain a persistent understanding of the external contexts they interact with repeatedly? 👨‍💻 If you build agents that operate over long and recurring external contexts: 👀 Read the paper: arxiv.org/abs/2605.19932 🛠️ Try PEEK: github.com/zhuohangu/peek We’re very excited by all the current interest and future work on 𝗣𝗘𝗘𝗞! (6/N)

English

1.1K

Joshua Gu@astrogu_·27 Mar

Does anyone know good datasets that have multiple (long) contexts, and each context has multiple tasks querying over that same context? Ideally, it’s natively structured this way and doesn’t require self-construction. A reference is github.com/abertsch72/ool….

English

130

Joshua Gu retweetledi

Hanchen Li@lihanc02·10 Mar

x.com/i/article/2031…

ZXX

13.3K

Joshua Gu@astrogu_·19 Şub

As a die-hard fan of Nolan’s Batman, this might be my quote of the year…🦇

Omar Khattab@lateinteraction

You either die a state of the art or live long enough to see yourself become the baseline

English

156

Joshua Gu retweetledi

LMCache Lab@lmcache·18 Eyl

A deep dive on LMCache x NVIDIA Dynamo: blog.lmcache.ai/2025-09-18-dyn… to learn how LMCache integrates with NVIDIA Dynamo to slash KV-cache bottlenecks and push LLM inference efficiency to the next level. We’re also honored to be featured in NVIDIA’s official blog on Dynamo: developer.nvidia.com/blog/how-to-re… #LMCache #NVIDIA #Dynamo #LLM #Inference #KVcache

English

6.1K

Joshua Gu retweetledi

LMCache Lab@lmcache·6 Ağu

LMCache supports gpt-oss (20B/120B) on Day 1! TTFT 1.20s → 0.39s (-67.5%), finish time 15.70s → 7.73s (-50.7%) compared to Vanilla vLLM. Release the true power of GPT-OSS with vllm+LMCache -- full deployment tutorial here: blog.lmcache.ai/2025-08-05-gpt… #LMCache #vLLM #OpenAI #LLM #AIInfra #GPTOSS

English

1.7K

Joshua Gu retweetledi

LMCache Lab@lmcache·3 Ağu

🚀 Big news from LMCache Lab! 📝 3 papers accepted at SOSP ’25 & NSDI ’26, pushing the frontier of LLM-inference efficiency: 1️⃣ Cross-agent KV-cache sharing (NSDI) 🔗 arxiv.org/abs/2411.02820 2️⃣ Custom design for LLM prefillers (SOSP) 🔗 arxiv.org/abs/2505.07203 3️⃣ Workload-adaptive RAG serving (SOSP) 🔗 arxiv.org/abs/2412.10543 These results are the research backbone of LMCache—our open-source KV-caching layer for real-world LLM deployment. Repo: github.com/LMCache/LMCache 🎉 Huge congrats to lead authors @YuhanLiu14 , @this_will_echo & @siddhantrayyy! As Ion Stoica says: “LMCache, a project within the vLLM ecosystem, shows how academic research can drive real-world impact.” If you care about speed, cost, and scale for production LLMs, give LMCache a spin!