b/acc, context platform engineer

227.1K posts

b/acc, context platform engineer banner
b/acc, context platform engineer

b/acc, context platform engineer

@AccBalanced

AI Factories. Balanced Accelerationist. WEKA CAIO, CNCF kubernetes founding board, Post-PKI.

Seat 14D Katılım Temmuz 2008
8.1K Takip Edilen8.7K Takipçiler
Sabitlenmiş Tweet
b/acc, context platform engineer
AI Context Memory/Storage makes Claude Code & OpenClaw work Also fastest growing segment of the hot trillion dollar AI Infra industry Jensen calls it the biggest storage market ever. Micron estimates it at 100 exabytes 🤯 @vikramskr and I break it all down on @semidoped 🎤
Semi Doped Podcast@semidoped

Context memory essentially unlocks Agentic AI Much needed for Opus 4.6's "multi-agent swarms" In this SemiDoped pod, @vikramskr talks to Val Bercovici from Weka about context storage. - How token warehouses save inference costs - A new networking tier? Context Storage Network! - High Bandwidth Flash for context? - Weka's Augmented Memory Grid for context storage - Where this is all headed The convo is info packed. Don't miss out on it! @AccBalanced Chapters (00:00) Introduction to Weka and AI Storage Solutions (05:18) The Evolution of Context Memory in AI (09:30) Understanding Memory Hierarchies and Their Impact (16:24) Latency Challenges in Modern Storage Solutions (21:32) The Role of Networking in AI Storage Efficiency (29:42) Dynamic Resource Utilization in AI Networks (30:04) Introducing the Context Memory Network (31:13) High Bandwidth Flash: A Game Changer (32:54) Weka’s Neural Mesh and Storage Solutions (35:01) Axon: Transforming GPU Storage into Memory (39:00) Augmented Memory Grid Explained (42:00) Pooling DRAM and CXL Innovations (46:02) Token Warehouses and Inference Economics (52:10) The Future of Storage Innovations

English
1
2
12
3.3K
b/acc, context platform engineer retweetledi
vLLM
vLLM@vllm_project·
vLLM v0.18.0 is out! 445 commits from 213 contributors (61 new). 🎉 What's new: gRPC serving, GPU-less multimodal render, NGram spec decode on GPU, Elastic EP Milestone 2, FlashInfer 0.6.6, Responses API streaming tool calls. Thread 👇
vLLM tweet media
English
4
8
78
3K
b/acc, context platform engineer retweetledi
Jordan Nanos
Jordan Nanos@JordanNanos·
Yesterday we open sourced the InferenceX webapp Hopefully this makes it easier to analyze InferenceX data, and provides a simple way for accelerator startups and alternative runtimes to compare directly to the industry standards, and make performance claims + forecasts It also means we can more easily handle feature requests from the community for the public dashboard github.com/SemiAnalysisAI…
English
2
9
29
5.1K
b/acc, context platform engineer retweetledi
Ksenia_TuringPost
Ksenia_TuringPost@TheTuringPost·
AI is already redesigning chip design itself! And the biggest bottleneck left is validation. Here is Bill Dally describing to @JeffDean how @nvidia uses AI to design chips: “We’re already using AI across multiple parts of the chip design process, and it’s delivering real gains. Take standard cell design as an example. Every time we move to a new process, we have to port thousands of cells. That used to take a team of eight engineers around 8–10 months. Now we use a system we built called NVCell, and it can do that work overnigt. The results match or exceed human designs across key metrics like size, power, and delay. It’s a massive productivity boost and removes a major bottleneck when moving to new process nodes. We’re also using reinforcement learning in tools like Prefix RL to tackle classic computer design problems, such as optimizing carry chains. This is a problem people have worked on since the 1950s. The RL system explores the space like a game, evaluating its own designs. Instead of maximizing raw speed, it aims to meet timing constraints while minimizing area and power. It often produces unconventional designs that no human would consider, yet they outperform human solutions by 20–30% on those metrics. At a higher level, we’ve built internal LLMs like ChipNeMo and BugNeMo. These models are trained on our internal design corpus: RTL, architecture specs, documentation. They effectively act as expert assistants. One of the biggest benefits is reducing the load on senior engineers. Instead of repeatedly answering basic questions, junior engineers can query the model and get detailed, interactive explanations. It’s like having an infinitely patient mentor. We also use these models for debugging. They can summarize bug reports, help with attribution, and suggest which module or engineer should take ownership. That speeds up the whole debugging loop. On the exploratory side, we’re using generative methods to run large numbers of design experiments. We can explore different architectural directions, map parameter spaces, and quickly evaluate new ideas. The goal is to compress the time between early exploration and final design. One of the biggest bottlenecks today is design validation. We’d like to use AI to prove correctness much faster. There are also stages where we need to refactor designs, for example when repartitioning logic during floorplanning. Those transitions are complex and error-prone, and they’re good candidates for automation. Now, the dream would be fully end-to-end automation: you specify a new GPU, go skiing for a few days, and come back to a finished design. We’re nowhere near that yet. But across the pipeline, AI is already making the process significantly faster and more productive.”
Ksenia_TuringPost@TheTuringPost

At this nerdiest of all nerdy sessions 💞, Jeff Dean said he doesn’t think we’re running out of data. “I think there’s still an enormous amount of data in the world that we haven’t really used yet for training these models. We train on some video data, for example, but there’s a lot more video out there, along with associated audio, that we’re not necessarily making full use of yet. I also think real-world robotics data, and autonomous vehicle data, is going to be fairly plentiful. And then synthetic data is another resource. If you can generate really interesting, high-quality data, then you can effectively inject more compute and get more training data that way. Now, of course, there’s a reasonable question here: aren’t you eventually just regurgitating the same stuff? If you train on data, then use that model to generate synthetic data, are you just making another version of what you already had? Maybe to some extent. But I still think it can help, especially if the model generating the synthetic data is itself very powerful. At least so far, that does seem to be useful. And beyond that, there are also a lot of techniques we’re not really using much right now that used to be very common in other domains, like convolutional image models years ago. Things like data augmentation are interesting. That’s one way to think about synthetic data. Techniques to prevent overfitting are also interesting. You can use dropout, distillation, and other forms of regularization. So I think there’s still a lot of opportunity to make models better with more compute and more passes over the data, without necessarily running into overfitting.“ A fascinating conversation between @JeffDean and @BillDally @NVIDIAGTC

English
5
15
75
10K
b/acc, context platform engineer retweetledi
Meg McNulty
Meg McNulty@meggmcnulty·
NVIDIA agreed to pay $20 billion to license Groq's technology and hire its team. That tells you everything about where single-workload AI chips are headed. Groq built its architecture around deterministic execution. Every operation is statically scheduled before runtime. All model weights live in on-chip SRAM. For dense transformer inference on a standard model, the latency was extraordinary. But the workloads moved. Mixture-of-experts models like DeepSeek-V3 activate only a fraction of their parameters per token. Inference-time compute scaling means different queries need wildly different amounts of compute. Static scheduling is the opposite of what you want when the workload is variable by design. ASICs take two to three years from architecture freeze to first silicon. Model architectures are now shifting faster than that. A chip designed for dense matrix multiply does not help when the workload is sparse routing across 256 experts. You can read the NVIDIA-Groq deal as a vote of confidence in inference IP. You can also read it as the market telling you that the standalone path for a fixed-workload ASIC was narrowing so fast that selling was the better move. Which reading do you think is right?
Meg McNulty tweet media
English
0
1
8
388
b/acc, context platform engineer retweetledi
Roger Wang
Roger Wang@rogerw0108·
So excited to see more research and breakthrough in omni-modality! This is exactly why we are buidling vLLM-Omni github.com/vllm-project/v… for the next generation of Intelligence!🚀
Fuli Luo@_LuoFuli

MiMo-V2-Pro & Omni & TTS is out. Our first full-stack model family built truly for the Agent era. I call this a quiet ambush — not because we planned it, but because the shift from Chat to Agent paradigm happened so fast, even we barely believed it. Somewhere in between was a process that was thrilling, painful, and fascinating all at once. The 1T base model started training months ago. The original goal was long-context reasoning efficiency. Hybrid Attention carries real innovation, without overreaching — and it turns out to be exactly the right foundation for the Agent era. 1M context window. MTP inference for ultra-low latency and cost. These architectural decisions weren't trendy. They were a structural advantage we built before we needed it. What changed everything was experiencing a complex agentic scaffold — what I'd call orchestrated Context — for the first time. I was shocked on day one. I tried to convince the team to use it. That didn't work. So I gave a hard mandate: anyone on MiMo Team with fewer than 100 conversations tomorrow can quit. It worked. Once the team's imagination was ignited by what agentic systems could do, that imagination converted directly into research velocity. People ask why we move so fast. I saw it firsthand building DeepSeek R1. My honest summary: — Backbone and Infra research has long cycles. You need strategic conviction a year before it pays off. — Posttrain agility is a different muscle: product intuition driving evaluation, iteration cycles compressed, paradigm shifts caught early. — And the constant: curiosity, sharp technical instinct, decisive execution, full commitment — and something that's easy to underestimate: a genuine love for the world you're building for. We will open-source — when the models are stable enough to deserve it. From Beijing, very late, not quite awake.

English
0
6
17
2.9K
b/acc, context platform engineer retweetledi
vLLM
vLLM@vllm_project·
Great to see @AMD select vLLM as one of the designated inference frameworks for the GPU MODE Hackathon. 🎉 The challenge: push Kimi K2.5 1T FP4 end-to-end inference performance on 8× AMD Instinct MI355X — using vLLM or AMD ATOM. Grand prize: $650,000. What makes this different: winning optimizations must be mergeable into AMD ATOM or vLLM upstream. Improvements that land in vLLM benefit the whole community. Phase 1 (kernel optimization) runs through April 6. More details ⬇️
AMD@AMD

Join the GPU MODE Hackathon, sponsored by AMD, and push the boundaries of LLM inference performance on leading open models, optimized for AMD Instinct MI355X GPUs. Finalists will compete for the $1.1M total cash prize pool across two independent tracks, each focused on a specific model and inference stack. Learn more and get registered here: luma.com/cqq4mojz

English
3
19
125
15.3K
b/acc, context platform engineer
@karpathy So you’re telling me, all I have to do, is fundamentally reshape the biggest industry in the world, about four or five times, and you just get one of these?
GIF
English
0
0
0
29
Andrej Karpathy
Andrej Karpathy@karpathy·
Thank you Jensen and NVIDIA! She’s a real beauty! I was told I’d be getting a secret gift, with a hint that it requires 20 amps. (So I knew it had to be good). She’ll make for a beautiful, spacious home for my Dobby the House Elf claw, among lots of other tinkering, thank you!!
NVIDIA AI Developer@NVIDIAAIDev

🙌 Andrej Karpathy’s lab has received the first DGX Station GB300 -- a Dell Pro Max with GB300. 💚 We can't wait to see what you’ll create @karpathy! 🔗 #dgx-station" target="_blank" rel="nofollow noopener">blogs.nvidia.com/blog/gtc-2026-… @DellTech

English
514
828
18.9K
966.4K
b/acc, context platform engineer
@TheTuringPost Networking SRAM has capacity limits, so it remains to be seen how cost effectively this multi-LPU configuration scales for coding agents, and other highly KV cacheable AI workloads.
English
0
0
0
51
Ksenia_TuringPost
Ksenia_TuringPost@TheTuringPost·
Straight from NVIDIA GTC: Jensen Huang just unveiled a new vision for AI infrastructure For the first time, Rubin GPUs+Groq LPUs are paired: > 35× higher inference throughput > 10× more revenue from trillion-parameter models Architecture & why it's needed
English
4
9
33
3K
b/acc, context platform engineer retweetledi
b/acc, context platform engineer retweetledi
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
The bottleneck has so quickly moved from code generation to code review that it is actually a bit jarring. None of the current systems / norms are setup for this world yet.
English
380
185
4.1K
515.8K
b/acc, context platform engineer retweetledi
OpenAI Developers
OpenAI Developers@OpenAIDevs·
Subagents are now available in Codex. You can accelerate your workflow by spinning up specialized agents to: • Keep your main context window clean • Tackle different parts of a task in parallel • Steer individual agents as work unfolds
English
425
765
8K
1.5M
b/acc, context platform engineer retweetledi
Charly Wargnier
Charly Wargnier@DataChaz·
THIS is the wildest open-source project I’ve seen this month. We were all hyped about @karpathy's autoresearch project automating the experiment loop a few weeks ago. (ICYMI → github.com/karpathy/autor…) But a bunch of folks just took it ten steps further and automated the entire scientific method end-to-end. It's called AutoResearchClaw, and it's fully open-source. You pass it a single CLI command with a raw idea, and it completely takes over 🤯 The 23-stage loop they designed is insane: ✦ First, it handles the literature review. - It searches arXiv and Semantic Scholar for real papers - Cross-references them against DataCite and CrossRef. - No fake papers make it through. ✦ Second, it runs the sandbox. - It generates the code from scratch. - If the code breaks, it self-heals. - You don't have to step in. ✦ Finally, it writes the paper. - It structures 5,000+ words into Introduction, Related Work, Method, and Experiments. - Formats the math, generates the comparison charts, - Then wraps the whole thing in official ICML or ICLR LaTeX templates. You can set it to pause for human approval, or you can just pass the --auto-approve flag and walk away. What it spits out at the end: → Full academic paper draft → Conference-grade .tex files → Verified, hallucination-free citations → All experiment scripts and sandbox results This is what autonomous AI agents actually look like in 2026. Free and open-source. Link to repo in 🧵 ↓
Charly Wargnier tweet media
English
78
382
2.4K
206.7K
b/acc, context platform engineer retweetledi
b/acc, context platform engineer retweetledi
vLLM
vLLM@vllm_project·
🎉 Congrats to @MistralAI on releasing Mistral Small 4 — a 119B MoE model (6.5B active per token) that unifies instruct, reasoning, and coding in one checkpoint. Multimodal, 256K context. Day-0 support in vLLM — MLA attention backend, tool calling, and configurable reasoning mode, verified on @nvidia GPUs. 🔗 huggingface.co/mistralai/Mist…
vLLM tweet media
Mistral AI for Developers@MistralDevs

🔥 Meet Mistral Small 4: One model to do it all. ⚡ 128 experts, 119B total parameters, 256k context window ⚡ Configurable Reasoning ⚡ Apache 2.0 ⚡ 40% faster, 3x more throughput Our first model to unify the capabilities of our flagship models into a single, versatile model.

English
7
37
384
28.7K
b/acc, context platform engineer retweetledi
Alex Woodie
Alex Woodie@alex_woodie·
AI will push $1 trillion in hardware spending through 2027, @Nvidia CEO Jensen Huang says at #GTC26
Alex Woodie tweet media
English
1
1
2
173
b/acc, context platform engineer
“The inference inflection has arrived” “The inference inflection has arrived” -Jensen @ GTC26
English
1
0
2
68
b/acc, context platform engineer retweetledi
vLLM
vLLM@vllm_project·
@vllm_project spotted at GTC 2026!🔥
vLLM tweet media
Dansk
4
8
101
4K
b/acc, context platform engineer retweetledi
Dylan Patel
Dylan Patel@dylan522p·
Deepseek v4 still not released Alibaba Qwen going closed Western open weights models slacking In these dark times for open source, who will save us? Alliances must be made, brothers must band together! A world of only closed source AI will lead to consolidation of power! Tyranny!
English
110
56
1.2K
139.4K
b/acc, context platform engineer retweetledi