arch rock

3.3K posts

arch rock

@silverhawk_ny

Sunnyvale, CA Katılım Nisan 2009

662 Takip Edilen86 Takipçiler

arch rock@silverhawk_ny·10h

@FireworksAI_HQ very nice deep dive

English

Fireworks AI@FireworksAI_HQ·23h

x.com/i/article/2048…

ZXX

133

67.7K

arch rock retweetledi

Paul@paulWilliamChan·1d

I wrote a matmul kernel on B200 in pure CUDA/PTX that beats cuBLAS by 6% at M=N=K=8192. Inspired by @gaunernst's blog on Blackwell instructions with benchmarking done on @modal. Blog: paulwillchan.com/articles/outpe… Repo: github.com/Better-Call-Pa…

English

608

26.8K

arch rock retweetledi

卡比卡比@jakevin7·18h

Long Context is everything. 时代变了，Long Context 效率已经成为了第一性原理。最新的 DeepSeek V4 里 MLA 已经被放弃了，Paper里甚至连解释都没有。犹记得当时 DeepSeek V2 引入了 MLA，引起了不小的轰动。 V4 的关键 attention 设计已经是 CSA + HCA 的 hybrid attention了，序列维度压缩成了大家更关心的问题。本质上是模型的注意力架构的重心变了，现在的大概大家关注的是 Long context 的效率。MLA 压缩单token成本的关注点已经不是最重要的问题了。现在到了 1M context，瓶颈不只是每个 token 存多少，而是有太多 token 要参与计算。问题从 KV 的宽度变成了序列的长度。这个过程其实已经在模型厂商是有一个内部的清晰的路线。譬如DS的演进路线感觉就是比较清晰的： MLA → DSA / DeepSeek Sparse Attention → NSA / Native Sparse Attention → V4 的 CSA + HCA 在 V3.1-Terminus 上引入 DSA，包括 lightning indexer 和 fine-grained token selection，并且当时是“DSA instantiated under MLA”。 V4现在则是引入了hybrid attention： Compressed Sparse Attention + Heavily Compressed Attention。不过 V4 还是保留了低秩压缩、shared KV、latent query 等 MLA-like 设计。调研的过程中还发现了一个 Long Context Paper 收录仓库，github.com/Xnhyacinth/Awe…

中文

198

31.4K

arch rock@silverhawk_ny·2d

@myanTokenGeek 有意思的角度

日本語

孟岩-Mike Meng@myanTokenGeek·2d

通常硅谷游记的价值都不高，往往透着一股刘姥姥进大观园之后回村里吹牛的土味。但这篇还真不错。有几个话题值得深挖。硅谷是一个聚集了全世界最聪明的人但没什么智慧的地方。这些聪明人各个单看都是人中龙凤，但聚到一起形成的那个场，决定了他们只能极度内卷、不管不顾往一条路上冲，至于这条路通向的是天国还是地狱，他们没时间多想，想也没用，身不由己。这篇文章里提到 CEO 们开始把自己的房子变成堡垒——这说明他们已经知道大概率正在走向炼狱。其实这篇文章主要的篇幅讲的就是一件事——速度差。文中提到一个现象：生产端效率提高了 10 倍、100 倍，实际营收增长呢？百分之几十，最多一倍。这个现象背后反应的是五个速度之间的差，我认为这五个速度之间的差会决定未来十几年 AI 产业甚至全球经济的局面，这也是我目前看待 AI 革命的认识框架之一。这五个速度是： 1. AI 推理应用在生产端的扩张速度 2. 资本扩张和转移速度 3. AI 模型智能水平的提高速度 4. 硬件生产和基础设施建设速度 5. 市场需求增长速度这五级速度，从 1 到 5，逐级剧烈下降，呈现巨大的速度差。一个反直觉、很多人视而不见的事情：在这个游戏里，不是快的打慢的，是慢的打快的。不是越快越厉害，而是越慢越厉害。速度越慢，权力越大。为什么英伟达现在拿走了全行业 85% 的利润？就是因为硬件创新的周期比 AI 应用慢几百倍，成了行业瓶颈，给了他阶段性垄断的历史机遇。所以现在做垂类 AI 应用的求着 VC 给投资，VC 求着大模型公司给机会，大模型公司求着英伟达给卡，而黄仁勋最大的噩梦是市场需求跟不上 AI 泡沫崩溃。在推上有些人每天亢奋于自己所取得的进展和效率的提升，却从来不谈产出，对此我却经常冷言冷语，原因就在这里。其实我自己也在很认真的学习 AI，但到我这个年龄，对于那些路都看不清楚就瞎努力的人确实是尊敬不起来。在生产效率上内卷的人在这场游戏里是最低端、最弱势、最没有权利的。就像这篇文章里提到的，连硅谷那些每天烧几百上千美金 tokens、第一时间交流最前沿 vibe coding 经验的程序员，都要面临 90% 被失业的下场，你们这些周边还卷个什么劲啊？说句特别刻薄的话，真好似插标卖首。你们还不如来我们 crypto 圈炒炒币，别瞧不起赌狗，你们玩的那个游戏成功率比炒币低得多。如果想玩确定性高一点的游戏，就还得在最慢的第五级赛道上下功夫，去靠近人。话就说这么多，原文各位自己去看，看完了可以再回来品品我说的有没有道理。

Colin Wu@colinwu

中文

217

62.1K

arch rock retweetledi

Ahmad@TheAhmadOsman·2d

Let's dive deeper Do you know that 75% of Qwen 3.5 27B layers are DeltaNet (linear attention) and not softmax / full attention? Because of that, FlashAttention is only able to accelerates ~1/4 of the model

Ahmad@TheAhmadOsman

How to go about learning all of this? 1st: Start with the serving engine view - vLLM: PagedAttention, continuous batching, prefix caching, CUDA graphs - SGLang: RadixAttention/prefix reuse, speculative decoding, MoE, structured/agent workloads - TensorRT-LLM: NVIDIA peak stack, FP8/FP4, Wide-EP, disaggregated serving - FlashInfer: reusable kernel/operator library for attention/GEMM/MoE/sampling 2nd: Go down the stack - Triton tutorials → custom fused kernels - CUTLASS/CuTe → Tensor Core GEMM and Blackwell/Hopper details - FlashAttention papers → attention algorithm/kernel co-design - PagedAttention paper → KV-cache memory management - MoE docs → routing + grouped GEMM + all-to-all - Nsight profiling → stop guessing 3rd: Do this mini-project sequence 1. Implement RMSNorm in Triton; compare to PyTorch 2. Implement fused SiLU × gate 3. Implement simple FP16 matmul; compare to cuBLAS/rocBLAS 4. Implement paged KV lookup for decode attention 5. Add FP8 KV cache with per-block scales 6. Implement toy top-k sampling on GPU 7. Implement tiny MoE dispatch + grouped GEMM 8. Integrate one custom op into vLLM or SGLang and profile end-to-end

English

384

33K

arch rock@silverhawk_ny·2d

@TheAhmadOsman can you elaborate how you achieve this conclusion?

English

Ahmad@TheAhmadOsman·3d

DeepSeek V4 Pro, for how massive it is (1.6T Parameters), is quite undertrained (32T Tokens) Yes, undertrained It has less intelligence density than that of V3.2 which is like 1/3rd of its size

Ahmad@TheAhmadOsman

Kimi has dethroned DeepSeek

English

282

84K

arch rock@silverhawk_ny·2d

@techeconomyana 就是这种架构一旦是scale的次优选项，哪怕你便宜很多，但是和能在最优scaling下面一直进展的顶级模型的差距可能存在越拉越大，这点来说这么早让DS承担了不应该承担的各种名誉和任务并不好

中文

高级分析师@techeconomyana·2d

@silverhawk_ny the bitter lesson

English

高级分析师@techeconomyana·2d

判断：Deepseek迭代周期从2个月变5个月，是单纯是技术战略误判，与华为无关。创新点一次堆太多，导致迭代速度变慢。在注意力机制上，滑窗SWA+压缩稀疏注意力CSA+高压缩HSA。mHC残差连接。这么多花活。叠加英伟达Blackwell 不稳定，训练过程出现多次回滚，自然就慢了。

中文

104

27.4K

arch rock retweetledi

Kaiyue Wen@wen_kaiyue·4d

I won't be at ICLR this year but @xingyudang will help present Fantastic Optimizers arxiv.org/abs/2509.02046! Stop by at Pavilion 4 P4 5309 this afternoon to see what we have found in extensive sweeping and more importantly, what we learned after the paper that leads to Hyperball!

English

108

11.3K

arch rock@silverhawk_ny·3d

@howlemont 非常有道理

日本語

皓樂芒@howlemont·3d

美国数学家Richard Hamming是图灵奖得主，计算机先驱，纠错码之父。他说自己很早在Los Alamos见过Feynman费曼、Oppenheimer奥本海默等。他承认自己当时很嫉妒，凭什么大家都是物理人，你们这些人是大牛人？他后来在贝尔实验室Bell Labs继续观察Shannon香农、这些人，为什么有些人做到了，而其他人只是差点做到了？ “运气”只能解释一半。具体做成哪一个题，当然有运气。 Hamming自己也承认，他和Shannon香农同在贝尔实验室Bell Labs，同一时期一个做coding theory，一个做information theory，确实有运气成分。但Einstein爱因斯坦、Shannon香农这类人可以反复做出好东西。一次可以说是撞上了，反复撞上，就要看准备工作、胆量和选择了。机会会飘过很多人身边，但只有少数人已经在脑子里预留了接口。他讲，重要问题不是结果影响听起来多大。比如时间旅行、传送、反重力，结果影响当然巨大，但手里没有合理的入口，也只能供人幻想。一个真正该上的科题，要同时有分量和入口，如果做成，它会改变一些东西；而你现在又能找到一条可以攻进去的路。很多聪明人输在这里。他们每天忙忙碌碌，设计问题很精致，方法很专业，可心里其实知道，这些东西就算做完，也很难通向更大的东西。 Hamming每周五中午以后给自己留“Great Thoughts Time”，只聊大问题，比如计算机会怎样改变科学？这听起来像偷懒，其实是给自己留10%的雷达时间。你如果一周五天都在处理眼前小事，很容易把效率误认为方向。你会变成一个很勤奋、很可靠、很会交付的人，然后十年后发现，自己一直在小问题上越做越熟。 Hamming还有一个观察：关着门工作的人，完全隔绝噪音，短期内很舒服，貌似产出更高；长期看，你会错过那些不成体系、没法写进报告、但能告诉你“问题变了”的信号。而开着门工作的人，虽然经常被打断，但更可能知道世界的新动向。一流工作需要深度，也需要暴露在真实问题流里。他讲，成名后的危险，也很像今天的创业者、研究者和内容创作者。一旦做出一个大东西，人会不愿意再种小种子，只想一上来就抱大树。Hamming说Shannon香农在信息论之后，可能就被“下一次必须同样伟大”这件事困住了。早期的伟大而会把人冻住。因为你不愿意再做那些小、丑、未成形、别人看不上的起点。但大东西通常就是从这种小起点长出来的。还有一点很多人不爱听。做出来还不够，你要会把它讲出去。 Hamming说科学家讨厌“sell”这个词，觉得好东西应该自然被世界看见。可现实是，所有人都在忙自己的事。你写得不清楚，讲得不清楚，会议上不敢开口，别人就会翻过去。所以表达不是包装，是研究的一部分。一个想做一流工作的人，至少要会三种表达：写清楚，在正式场合里讲清楚，在混乱的场合里也能讲清楚。很多“事后诸葛亮”三周后写报告证明自己早就看对了，但时过境迁了。才华放晚了也会变成旁白。把“伟大”从天赋、环境、运气这些大词拽回到具体的日常动作：你有没有固定时间想大问题。你手里有没有10到20个真正重要、且可能进攻的问题。你遇到一个机会时，能不能立刻看出它碰到了你哪一个老问题。你做一个项目时，有没有顺手把它变成一类问题的方法，而不止交一个答案。你有没有把自己的缺点拿来当借口。你会不会和大的体系合作，借力秘书、同事、老板、听众、组织流程打战役，而不是一生耗在小型战斗里。讲真，你可以说自己缺运气，缺资源，缺年轻，缺老板支持。那你有没有准备好？有没有选对问题？有没有留出想大问题的时间？有没有勇气押上去，把成果讲到别人愿意停下来听？说到最后，很多所谓怀才不遇，可能只是长期没有管理自己。

中文

123

708

130.5K

arch rock retweetledi

LMSYS Org@lmsysorg·3d

🚀 We just published a deep technical blog on how SGLang and Miles delivered Day-0 support for DeepSeek-V4. 199 tok/s on B200 (Pro 1.6T), 266 tok/s on H200 (Flash 284B) at 4K context, and throughput stays strong at 900K context (180 and 240 tok/s respectively). This is a full story behind V4 Pro (1.6T) and Flash (284B): how we built systems for hybrid sparse attention, manifold-constrained hyper-connections (mHC), and FP4 expert weights, plus a full RL training stack that runs at 1.6T scale. What's covered: 1. Inference (caching and attention): ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap. 2. Inference (kernels and deployment): fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation. 3. RL training: full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training. 4. Multi-hardware: NVIDIA Hopper, Blackwell, Grace Blackwell, AMD, NPU.

English

269

56.9K

arch rock retweetledi

tender@tenderizzation·4d

dont forget

Aaron@Norapom04

pov RL kernel engineers whos tasked with implementing backward pass for deepseek v4's attention mechanism

English

206

30.6K

arch rock retweetledi

halex@Halex623·4d

Thinking about how to implement this KV Cache is gonna give me nightmares tonight

English

383

16.5K

arch rock retweetledi

向阳乔木@vista8·4d

文章中论文PDF下载，AI解读的，可能有纰漏，推荐有空可以读原论文： blog.qiaomu.ai/deepseek-v4-te…

中文

6.6K

arch rock retweetledi

Katja Sirazitdinova@katjasrz·4d

JAX is great for model code, but fast LLM inference often needs access to optimized GPU kernels. I contributed 2 FlashInfer tutorials showing how to call FlashInfer kernels from JAX via jax-tvm-ffi, including a Gemma 3 example. Download the notebooks and try them 🚀

English

116

6.1K

arch rock retweetledi

wh@nrehiew_·5d

How I read papers now. This is an explainer by Claude about the new Compressed Sparse Attention v4 uses to compress the KV cache.

wh@nrehiew_

Now reading:

English

700

55K

arch rock retweetledi

elie@eliebakouch·4d

deepseek V4 new attention is very elegant, i find it similar to NSA (Native Sparse Attention) in principle, but NSA does it in parallel, CSA (Compressed Sparse Attention) is more sequential

elie@eliebakouch

this is so amazing, CSA is a new attention arch close to deepseek NSA imo, but sequential instead of in parallel. NSA had this compression of KV, here it's the same, and then they do DSA, and sliding window to keep good local context (also in NSA)

English

352

54.5K

arch rock@silverhawk_ny·5d

@hqinjarsy 只能说明人太多了

中文

Han Qin (姓秦，名汉，字大知)@hqinjarsy·5d

10%听起来不多，但是居然有8000多人。

金融汪@yuyy614893671

Meta计划裁员10%，约合8000名员工根据周四发给员工的一份备忘录，裁员将于 5 月 20 日开始，公司将取消招聘 6000 个空缺职位的计划 Meta 的最新裁员是在该公司进行多次小规模裁员之后进行的，该公司表示，这些裁员是为了在专注于生成式人工智能的同时提高效率所必需的来源：CNBC

中文

1.9K

arch rock@silverhawk_ny·5d

@RihardJarc Why don’t they just cut one time with enough quota and that is it ?

English

Rihard Jarc@RihardJarc·5d

@silverhawk_ny I don’t think they are done for the year even.

English

Rihard Jarc@RihardJarc·5d

$META cutting 10% of the workforce (8.000 roles) and eliminating 6.000 open positions. This will happen across the board in the tech sector and more broadly in knowledge work, Zuck is as always just ahead of the pack. Companies are switch labour costs for AI compute.

English

250

20K

Keşfet

@FireworksAI_HQ @gaunernst @modal @myanTokenGeek @TheAhmadOsman @techeconomyana @xingyudang @howlemont