Derek Deming

451 posts

Derek Deming banner
Derek Deming

Derek Deming

@ddeming_

Building something new; engineering and infra; prev: @MSFTResearch | ex PhD (ABD / dropped out)

Boston, MA شامل ہوئے Şubat 2024
1.8K فالونگ223 فالوورز
Derek Deming ری ٹویٹ کیا
You Jiacheng
You Jiacheng@YouJiacheng·
HUGE if true. If true, this is probably a larger efficiency gain than ALL publicly available techniques since DeepSeekMoE(Jan 2024) COMBINED. And it can just win modded-nanogpt speedrun. (1e18 is 250s@50%MFU, but the loss is significantly lower than 3.28) cc @classiclarryd
Chen-Hao (Lance) Chao@chenhao_chao

(2/7) 💵 With training costs exceeding $100M for GPT-4, efficient alternatives matter. We show that diffusion LMs unlock a new paradigm for compute-optimal language pre-training.

English
7
13
227
48.4K
Derek Deming ری ٹویٹ کیا
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
330
2.1K
13.5K
4.9M
Derek Deming ری ٹویٹ کیا
Yulu Gan
Yulu Gan@yule_gan·
Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu
Yulu Gan tweet media
English
87
430
3K
670.1K
Derek Deming
Derek Deming@ddeming_·
so the next $100B+ hardware cycle will be won by whoever solves the memory wall. The argument here is basically that GPUs were designed for training, not inference. The GPU/TPUs used for inference today are essentially scaled down training chips that are memory-bound and sequential all of the current industry trends (MoE, reasoning, long context, multimodal, etc.) make the memory problem even worse
Chris Laub@ChrisLaubAI

🚨 BREAKING: A Google researcher and a Turing Award winner just published a paper that exposes the real crisis in AI. It's not training. It's inference. And the hardware we're using was never designed for it. The paper is by Xiaoyu Ma and David Patterson. Accepted by IEEE Computer, 2026. No hype. No product launch. Just a cold breakdown of why serving LLMs is fundamentally broken at the hardware level. The core argument is brutal: → GPU FLOPS grew 80X from 2012 to 2022 → Memory bandwidth grew only 17X in that same period → HBM costs per GB are going UP, not down → The Decode phase is memory-bound, not compute-bound → We're building inference on chips designed for training Here's the wildest part: OpenAI lost roughly $5B on $3.7B in revenue. The bottleneck isn't model quality. It's the cost of serving every single token to every single user. Inference is bleeding these companies dry. And five trends are making it worse simultaneously: → MoE models like DeepSeek-V3 with 256 experts exploding memory → Reasoning models generating massive thought chains before answering → Multimodal inputs (image, audio, video) dwarfing text → Long-context windows straining KV caches → RAG pipelines injecting more context per request Their four proposed hardware shifts: → High Bandwidth Flash: 512GB stacks at HBM-level bandwidth, 10X more memory per node → Processing-Near-Memory: logic dies placed next to memory, not on the same chip → 3D Memory-Logic Stacking: vertical connections delivering 2-3X lower power than HBM → Low-Latency Interconnect: fewer hops, in-network compute, SRAM packet buffers Companies that tried SRAM-only chips like Cerebras and Groq already failed and had to add DRAM back. This paper doesn't sell a product. It maps the entire hardware bottleneck and says: the industry is solving the wrong problem. Paper dropped January 2026. Link in the first comment 👇

English
1
0
0
56
Derek Deming
Derek Deming@ddeming_·
Been seeing this error over the last 40 mins and typically happens after a few turns so i assume it has to do w/ compaction(?) @thsottiaux @ajambrosino
Derek Deming tweet media
English
0
0
0
80
Derek Deming ری ٹویٹ کیا
Avner May
Avner May@avnermay·
Excited to announce our new LLM inference algorithm, speculative speculative decoding (SSD)! It is fast 🚀 — up to 2x faster than state-of-the-art inference engines (vLLM, SGLang). Working on this with @tanishqkumar07 and @tri_dao was a blast. Details in thread:
Tanishq Kumar@tanishqkumar07

I've been working on a new LLM inference algorithm. It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.

English
19
44
670
57.4K
Derek Deming ری ٹویٹ کیا
Michael Truell
Michael Truell@mntruell·
We believe Cursor discovered a novel solution to Problem Six of the First Proof challenge, a set of math research problems that approximate the work of Stanford, MIT, Berkeley academics. Cursor's solution yields stronger results than the official, human-written solution. Notably, we used the same harness that built a browser from scratch a few weeks ago. It ran fully autonomously, without nudging or hints, for four days. This suggests that our technique for scaling agent coordination might generalize beyond coding.
English
264
513
8.3K
1M
Derek Deming
Derek Deming@ddeming_·
gpt-5.3-codex feels to be @OpenAI equivalent of sonnet 3.5 release where we first started seeing real value accrue at the codegen layer. If this is that inflection point for agentic tasks, the upside from here will compound pretty quickly given how much is already baked in
English
0
0
1
85
Derek Deming ری ٹویٹ کیا
Databricks AI Research
Databricks AI Research@DbrxMosaicAI·
New research from @databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes. arxiv.org/pdf/2602.19362
English
1
15
82
10.2K
Derek Deming ری ٹویٹ کیا
Andrew Ambrosino
Andrew Ambrosino@ajambrosino·
New in the Codex app: - 5.3-codex-spark - Forking - Pop-out window - Mark unread - More perf and quality stuff - (a secret fun thing) - Tomorrow: first Windows alpha invites
English
106
32
1.1K
113K
Derek Deming ری ٹویٹ کیا
Maximiliano Firtman
Maximiliano Firtman@firt·
Chrome 146 includes an early preview of WebMCP, accessible via a flag, that lets AI agents query and execute services without browsing the web app like a user. Services can be declared through an imperative navigator.modelContext API or declaratively through a form.
Maximiliano Firtman tweet media
English
119
375
2.8K
1.3M
Derek Deming
Derek Deming@ddeming_·
if you're not using codex you will be left behind 5.3-codex is FAST!! amazing work by the team
English
0
0
0
39
Derek Deming ری ٹویٹ کیا
Vincent
Vincent@vvvincent_c·
tldr: don't read into the number at all rn! we don't either. it's supposed to calculate (avg wall-clock time per task (min)) * (total number of tasks), but a bug caused it to incorrectly include the time when a run is queued but hasn't started yet. when running the gpt-5.2 eval, we ran into many retry errors that caused many runs to stay in the queue for much longer. this queue time was counted toward their working_time which is why the gpt-5.2 numbers are so skewed. wall-clock time in general is a weird metric and we don't optimize for keeping it comparable. some reasons are - inference speed varies by demand - the total wall-clock time which includes events like waiting for long python programs to run. - gpt-5.2 (high) was run on a more token-hungry scaffold (triframe) with 16M tokens while gemini 3 pro and opus 4.5 were run on a simpler scaffold (react) with 8M tokens.
Lisan al Gaib@scaling01

GPT-5.2-high took 26 TIMES LONGER than Claude 4.5 Opus to complete the METR benchmark suite

English
11
17
291
51.1K