Derek Deming

451 posts

Derek Deming

@ddeming_

Building something new; engineering and infra; prev: @MSFTResearch | ex PhD (ABD / dropped out)

Boston, MA Bergabung Şubat 2024

1.8K Mengikuti223 Pengikut

Derek Deming@ddeming_·2d

Cursor may have single handedly ruined the open source model release community… I imagine many companies who release open source models will re-evaluate the licensing and/or open sourcing at all

Harveen Singh Chadha@HarveenChadha

things are about to get interesting from here on

English

Derek Deming me-retweet

Theo - t3.gg@theo·2d

Responses to this are hilarious. Codex implemented this correctly in 15 minutes btw

Theo - t3.gg@theo

Just let Opus go for over an hour on a new feature. When it was done, I asked how I can test it. 20 minutes later, it realized I can't test it because it did the whole thing entirely wrong. Idk how you guys use this model every day for real work 🙃

English

959

143.3K

Derek Deming me-retweet

You Jiacheng@YouJiacheng·2d

HUGE if true. If true, this is probably a larger efficiency gain than ALL publicly available techniques since DeepSeekMoE(Jan 2024) COMBINED. And it can just win modded-nanogpt speedrun. (1e18 is 250s@50%MFU, but the loss is significantly lower than 3.28) cc @classiclarryd

Chen-Hao (Lance) Chao@chenhao_chao

(2/7) 💵 With training costs exceeding $100M for GPT-4, efficient alternatives matter. We show that diffusion LMs unlock a new paradigm for compute-optimal language pre-training.

English

227

48.4K

Derek Deming me-retweet

Kimi.ai@Kimi_Moonshot·6d

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

330

2.1K

13.5K

4.9M

Derek Deming me-retweet

Yulu Gan@yule_gan·13 Mar

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu

English

430

670.2K

Derek Deming@ddeming_·13 Mar

so the next $100B+ hardware cycle will be won by whoever solves the memory wall. The argument here is basically that GPUs were designed for training, not inference. The GPU/TPUs used for inference today are essentially scaled down training chips that are memory-bound and sequential all of the current industry trends (MoE, reasoning, long context, multimodal, etc.) make the memory problem even worse

Chris Laub@ChrisLaubAI

🚨 BREAKING: A Google researcher and a Turing Award winner just published a paper that exposes the real crisis in AI. It's not training. It's inference. And the hardware we're using was never designed for it. The paper is by Xiaoyu Ma and David Patterson. Accepted by IEEE Computer, 2026. No hype. No product launch. Just a cold breakdown of why serving LLMs is fundamentally broken at the hardware level. The core argument is brutal: → GPU FLOPS grew 80X from 2012 to 2022 → Memory bandwidth grew only 17X in that same period → HBM costs per GB are going UP, not down → The Decode phase is memory-bound, not compute-bound → We're building inference on chips designed for training Here's the wildest part: OpenAI lost roughly $5B on $3.7B in revenue. The bottleneck isn't model quality. It's the cost of serving every single token to every single user. Inference is bleeding these companies dry. And five trends are making it worse simultaneously: → MoE models like DeepSeek-V3 with 256 experts exploding memory → Reasoning models generating massive thought chains before answering → Multimodal inputs (image, audio, video) dwarfing text → Long-context windows straining KV caches → RAG pipelines injecting more context per request Their four proposed hardware shifts: → High Bandwidth Flash: 512GB stacks at HBM-level bandwidth, 10X more memory per node → Processing-Near-Memory: logic dies placed next to memory, not on the same chip → 3D Memory-Logic Stacking: vertical connections delivering 2-3X lower power than HBM → Low-Latency Interconnect: fewer hops, in-network compute, SRAM packet buffers Companies that tried SRAM-only chips like Cerebras and Groq already failed and had to add DRAM back. This paper doesn't sell a product. It maps the entire hardware bottleneck and says: the industry is solving the wrong problem. Paper dropped January 2026. Link in the first comment 👇

English

Derek Deming me-retweet

Adam.GPT@TheRealAdamG·10 Mar

Saint Tibo: Giver of Tokens and Reseter of Limits

Tibo@thsottiaux

This codex issue is now fully resolved and stable for the last couple of hours. You have come to expect it, but yes, that means we will be reseting rate limits in a bit. Enjoy.

English

811

32.6K

Derek Deming@ddeming_·10 Mar

Been seeing this error over the last 40 mins and typically happens after a few turns so i assume it has to do w/ compaction(?) @thsottiaux @ajambrosino

English

Derek Deming@ddeming_·8 Mar

Christmas miracle

Peter Gostev@petergostev

Omg it happened! thx @thsottiaux, was down to about 30% which was giving me anxiety

English

Derek Deming me-retweet

Avner May@avnermay·4 Mar

Excited to announce our new LLM inference algorithm, speculative speculative decoding (SSD)! It is fast 🚀 — up to 2x faster than state-of-the-art inference engines (vLLM, SGLang). Working on this with @tanishqkumar07 and @tri_dao was a blast. Details in thread:

Tanishq Kumar@tanishqkumar07

I've been working on a new LLM inference algorithm. It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.

English

670

57.4K

Derek Deming me-retweet

Michael Truell@mntruell·3 Mar

We believe Cursor discovered a novel solution to Problem Six of the First Proof challenge, a set of math research problems that approximate the work of Stanford, MIT, Berkeley academics. Cursor's solution yields stronger results than the official, human-written solution. Notably, we used the same harness that built a browser from scratch a few weeks ago. It ran fully autonomously, without nudging or hints, for four days. This suggests that our technique for scaling agent coordination might generalize beyond coding.

English

264

512

8.3K

Derek Deming@ddeming_·3 Mar

At it again

Bo Wang@BoWang87

ByteDance just published something I've been waiting for someone to build: CUDA Agent! It trained a model that writes fast CUDA kernels. Not just correct ones — actually optimized ones. It beats torch.compile by 2× on simple/medium kernels, ~92% on complex ones, and even outperforms Claude Opus 4.5 and Gemini 3 Pro by ~40% on the hardest setting. The key idea is simple but kind of brilliant: CUDA performance isn’t about correctness, it’s about hardware. Warps, memory bandwidth, bank conflicts — the stuff you only see in a profiler. So instead of rewarding “did it compile?”, they reward actual GPU speed. Real profiling numbers. RL trained directly on performance. That’s a big shift. Paper: arxiv.org/abs/2602.24286 Project: cuda-agent.github.io

English

Derek Deming@ddeming_·28 Şub

gpt-5.3-codex feels to be @OpenAI equivalent of sonnet 3.5 release where we first started seeing real value accrue at the codegen layer. If this is that inflection point for agentic tasks, the upside from here will compound pretty quickly given how much is already baked in

English

Derek Deming me-retweet

Databricks AI Research@DbrxMosaicAI·27 Şub

New research from @databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes. arxiv.org/pdf/2602.19362

English

10.2K

Derek Deming me-retweet

BuccoCapital Bloke@buccocapital·17 Şub

Required reading. Incredible post

Nicolas Bustamante@nicbstme

x.com/i/article/2023…

English

1.3K

490.2K

Derek Deming me-retweet

Andrew Ambrosino@ajambrosino·13 Şub

New in the Codex app: - 5.3-codex-spark - Forking - Pop-out window - Mark unread - More perf and quality stuff - (a secret fun thing) - Tomorrow: first Windows alpha invites

English

106

1.1K

113K

Derek Deming@ddeming_·12 Şub

👀

Catherine Yeo@catherinehyeo

it's finally happening 🦀 (the most beautiful stack i've ever received)

ART

Derek Deming me-retweet

Maximiliano Firtman@firt·9 Şub

Chrome 146 includes an early preview of WebMCP, accessible via a flag, that lets AI agents query and execute services without browsing the web app like a user. Services can be declared through an imperative navigator.modelContext API or declaratively through a form.

English

119

374

2.8K

1.3M

Derek Deming@ddeming_·5 Şub

if you're not using codex you will be left behind 5.3-codex is FAST!! amazing work by the team

English

Derek Deming me-retweet

Vincent@vvvincent_c·5 Şub

tldr: don't read into the number at all rn! we don't either. it's supposed to calculate (avg wall-clock time per task (min)) * (total number of tasks), but a bug caused it to incorrectly include the time when a run is queued but hasn't started yet. when running the gpt-5.2 eval, we ran into many retry errors that caused many runs to stay in the queue for much longer. this queue time was counted toward their working_time which is why the gpt-5.2 numbers are so skewed. wall-clock time in general is a weird metric and we don't optimize for keeping it comparable. some reasons are - inference speed varies by demand - the total wall-clock time which includes events like waiting for long python programs to run. - gpt-5.2 (high) was run on a more token-hungry scaffold (triframe) with 16M tokens while gemini 3 pro and opus 4.5 were run on a simpler scaffold (react) with 8M tokens.

Lisan al Gaib@scaling01

GPT-5.2-high took 26 TIMES LONGER than Claude 4.5 Opus to complete the METR benchmark suite

English

291

51.1K

Jelajahi

@classiclarryd @thsottiaux @ajambrosino @tanishqkumar07 @tri_dao @OpenAI @databricks @elonmusk