Yunfan Zhang

50 posts

Yunfan Zhang

@z4y5f3

PhD Student in Computer Science/NLP @Columbia | @DukeU '20

New York City Katılım Haziran 2016

151 Takip Edilen68 Takipçiler

Yunfan Zhang@z4y5f3·14 Nis

@teortaxesTex @_xjdr This is vibe research. They trained the 3B model for only 0.1 B tokens and the validation loss is > 5.0. Need much more pretraining tokens to draw any conclusions.

English

355

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·14 Nis

> Engram is just regularization > Gains come from context-aware gating (dynamic feature adjustment) & an extra residual pathway (better gradient flow), NOT memory content insane if real. @_xjdr can you reproduce?

Zhihu Frontier@ZhihuFrontier

DeepSeek Engram: External Memory Claim Debunked — Disguised Regularization Insights from Zhihu Contributor 栀染👇 🔍 Our Research We tested DeepSeek Engram (reasoning-knowledge separation architecture) — its "billion-parameter external N-gram memory table" was hyped as a game-changer for knowledge lookup. Tech insight: a myth. 🧪 Controlled Experiments • Real: Original Engram (trained memory table) • Randomize: Memory = Gaussian white noise (frozen, no updates) • Uniform: All tokens hash to 1 shared vector (no meaningful lookup) • Dense Baseline: Pure Transformer, no Engram branch 😱Shocking Result & Tech Insight Real > Uniform ≈ Randomize >>>>> Dense Baseline Even "garbage memory" (noise/shared vector) CRUSHES the baseline. Engram is just regularization in memory’s clothing — the memory table is a dummy payload, not a real knowledge store. 📊 Scale Validation (3B Model) • Results hold on 3B-scale models: No strong retrieval behavior even on factual prompts. • The advantage persists during training, with pathway dominance (not memory content) becoming more obvious. 🚀Core Tech Takeaway • Gains come fromcontext-aware gating (dynamic feature adjustment) & an extra residual pathway (better gradient flow), NOT memory content. • No complex CPU offload/PCIe optimizations needed — 1 vector/random noise = same performance, GPU memory drops from hundreds of GB to bytes 📚Reproduction & Logs: transparency.chunjiang.dev 🔗Full Response (CN): zhuanlan.zhihu.com/p/202641983237… #LLM #DeepSeek #AI #Tech

English

155

32.3K

Yunfan Zhang retweetledi

Sarah Haar@Sarah_Haar_·31 Oca

Stop worshiping celebrities & rich people. Worship empathy. Worship effort. Worship peace.

English

163

3.6K

18.2K

190.5K

Yunfan Zhang@z4y5f3·12 Oca

@frontier_foid My top two guesses are: (1) A cascading, coarse to fine attention where an indexer first looks at compressed global information, and then picks finer details / individual tokens? (2) Something interesting in terms of how to organize multiple attention heads, maybe?

English

277

qt cache🪷@frontier_foid·9 Oca

just a little treat for my 100 sweet followers 💖 - the whale is cooking... that sparse attention is literally going to be the most complex architecture we’ve seen in a frontier model yet. actually insane. - also mla is cancelled. they dropped it. too much complexity

English

28.8K

Yunfan Zhang@z4y5f3·25 Ara

@frontier_foid @YouJiacheng After consulting DSv3 and GPT-5, @YouJiacheng is actually right about bi-directional all-gather. You just split a tensor in half, one half travels clockwise along the ring while the other counter-clockwise. Now both links are active 100% of the time.

English

You Jiacheng@YouJiacheng·25 Ara

okay, it turns out TPU Ironwood (4×8×8=256 chips pod) uses pure FSDP for DeepSeek-V3 training. DGX GB300 (256 chips, probably 4× NVL72?) uses (DP, PP, EP) = (4, 2, 32) with 8 pipeline stages and 32 micro batches. Tokens/sec/chip comparison is 2446 vs. 3848 in BF16.

Chubby♨️@kimmonismus

AI chip “top speed” metrics (FLOPS) look great, but real training is limited by memory, data movement, and synchronization bottlenecks. Steve Nouri on linkedIn claims vendor results show Blackwell Ultra up to 1.9× higher training performance per chip vs Ironwood TPU on heavy workloads - largely due to Nvidia’s integrated hardware+interconnect+software stack. NVIDIA still has a moat and they are clearly using it.

English

200

34.2K

Yunfan Zhang@z4y5f3·25 Ara

@frontier_foid @YouJiacheng I am just thinking FSDP gets rid of the bubble in PP and the expert load balance in EP so it is a cleaner implementation. I have not trained anything close to 600B MoEs (would love to, but no GPUs) so I could be wrong.

English

Yunfan Zhang@z4y5f3·25 Ara

@YouJiacheng Yes 1.2 TBps is the sum of 3 axis for TPUv7. Still reading on literature for bidirectional bandwidth with FSDP + ring topology, but thanks for the clarification!

English

You Jiacheng@YouJiacheng·25 Ara

@z4y5f3 *probably the 1.2TB/s is the sum of 3 axes, I don't know, but it's still higher than 0.9TB/s.

English

487

Yunfan Zhang@z4y5f3·11 Ara

@deanwball I don’t agree that AGI is impossible, but the article raises a good question beyond that: if our remaining science/eng problems are set by physical limits, what’s the benefit of racing toward AGI, aside from mass unemployment and concentrating power in a few Bay Area companies?

English

Dean W. Ball@deanwball·11 Ara

@z4y5f3 I know who tim is! I think he’s wrong, as experts in their fields often are (as people often are). The entire point of my rebuttal is “he’s making narrow technical claims and extrapolating them wildly and with far too much confidence.”

English

100

Dean W. Ball@deanwball·11 Ara

finally read this and found myself unpersuaded: 1. I find it wild to assert that "GPUs aren't improving"--I get the point he is making, but find it pedantic. Measured in *precisely* the way he chooses to measure it, fine, maybe? But there are lots of other ways to measure this. People have continued to buy GPUs, right? Are they all dumb? 2. I find it even more wild to assert that the "transformer is close to physically optimal," ignoring the huge number of improvements that have been made since Vaswani et al. It is doubly wild to then assert that "chinese models show you can be much more efficient" than the Western models. A big part of the way Chinese models have demonstrated efficiency gains is through architectural improvements. You can't have it both ways. 3. The author asserts that "efficient large-scale deployments are largely a solved problem," but again, look at points 1 and 2. Are the US companies investing inference optimization stupid? And what about the inference optimizations made by the Chinese companies whose efficiency Dettmers applauds? 4. The author asserts a speculation only partially supported in evolutionary biology (when tech people invoke evolutionary biology, be warned; tautology often lurks) with a supreme confidence not merited by the evidence for his claim--namely, that "human brains are optimal" (lots of things that the author is familiar with are "optimal," it seems). 5. dettmers starts out on the wrong footing, attributing a broad set of views to "Oxford-style" thinking, rationalism, EA, etc. Fine for twitter and conversation, but for a post predicated on "I, The Author, Am The Smart One Here, Who Understands The World As It Is," you should not rely in your thesis statement upon mood affiliation, throwing food at the uncool kids, and similar cheap shots. the fundamentally correct observation is "scaling doesn't get you there," which was (maybe, kind of?) an edgy take 15 months ago (though one many observers of machine learning history, which is filled with changes in paradigm, did in fact make), but the take no longer feels very new. is anyone, except maybe anthropic, really saying "scaling alone gets you there"? these types of arguments, imo, are pretty tired. it's clear the neural networks will keep getting better. "agi" may just be a straw man at this point, as this post in some ways demonstrates. what does "we won't built the strawman" actually imply about the future? it seems like it implies "The Oxford-Style People, Whom I Insulted Earlier, Are Not As Smart As Me." not an especially interesting point. instead of talking about what we will not build, perhaps it is more interesting to discuss, affirmatively, what we *will* build? but this requires thinking about things which are not "optimal." what is this post actually telling you, concretely, about the future, other than "who deserves more status and who deserves less of it"? and that's all well and good. perhaps "the scaling bros" have a little too much intellectual clout. debatable, but certainly not crazy. but then: imagine if dettmers, who apparently believes GPUs made by companies like nvidia basically stopped getting better in 2018, had been investing your retirement savings for the last decade.

Tim Dettmers@Tim_Dettmers

My new blog post discusses the physical reality of computation and why this means we will not see AGI or any meaningful superintelligence: timdettmers.com/2025/12/10/why…

English

339

111.5K

Yunfan Zhang@z4y5f3·11 Ara

@deanwball Finally, Tim is one of the best researchers on ML efficiency. He is also one of the ML researchers that understand hardware the most. It does take a lot of knowledge and expertise to appreciate what he said.

English

Yunfan Zhang@z4y5f3·11 Ara

@deanwball I do think you can brute-force AGI in the next few years by throwing trillions at compute. But it’s also possible that even a self-improving AGI couldn’t push science or engineering meaningfully. We’re already near the physical limits of many things, at least asymptotically.

English

Yunfan Zhang@z4y5f3·27 Eki

@ID_AA_Carmack NVIDIA de-rates BF16 with FP32 accumulate (used by PyTorch) to half rate on their consumer GPUs. If DGX Spark is doing half of quoted BF16 TFlops then this falls in line their consumer GPUs. NVIDIA consumer GPUs do not reach their max TDP in BF16 with FP32 accumulate.

English

1.9K

John Carmack@ID_AA_Carmack·27 Eki

DGX Spark appears to be maxing out at only 100 watts power draw, less than half of the rated 240 watts, and it only seems to be delivering about half the quoted performance (assuming 1 PF sparse FP4 = 125 TF dense BF16) . It gets quite hot even at this level, and I saw a report of spontaneous rebooting on a long run, so was it de-rated before launch?

English

111

1.2K

259.6K

Yunfan Zhang@z4y5f3·28 Eyl

@abeirami Sorry for reviving the old thread but this write-up is excellent! I am wondering whether PPO with multiple rollouts could be viewed as a form of implicit reward calibration?

English

Ahmad Beirami@abeirami·9 Ağu

The main ingredient that led to GRPO's performance leap is the calibration of the reward/value via multiple rollouts per prompt. Let me elaborate on what I mean by that and a cheaper way of doing it offline.

English

658

117.6K

Yunfan Zhang@z4y5f3·30 Ağu

@atulit_gaur Here is a more advanced version of this interview problem: why do DeepSeek V3 MLPs have shapes of [2048, 7168]? Hints are in the V3 paper.

English

2.7K

atulit@atulit_gaur·28 Ağu

Fun question to ask in an ml interview, “Why do embedding dimensions come in neat sizes like 768 or 1024, but never 739?” If they can't answer it, it's fine but if they do, you've stumbled upon a real gem.

English

140

4.6K

932.2K

Yunfan Zhang@z4y5f3·24 Ağu

@JingyuanLiu123 Are they referring to TP=8/16 and EP=256? The same token still routes to different devices at different layers so there's comm overhead, but there should be no bubbles.

English

488

JingyuanLiu@JingyuanLiu123·23 Ağu

i was literally shocked by the huge llm infra diff in US vs China and GPU vs TPU... I was chatting with senior folks about how global batch aux loss is hard under the ppvp constrain, as basically you have to do all f all b to get the fi correct and that's challenging for peak memory. So deepseek's auxfree bias is an awesome design. Then he just told me, oh u do not need PP under TPU for a K2 or DSV3 scale and still achieves great MFU. I was shocked and could not figure out how to do the parallelism. Could TPU experts tell me about that?

English

346

62K

Yunfan Zhang@z4y5f3·22 Ağu

@Stephan_Talk Matches my experience. DeepSeek V3.1 + Claude Code understands complex codebase (in this case, verl) much better than I do. The only drawbacks seem to be 128K context window and certain Claude Code tools are not supported by the API.

English

1.8K

Stephan@Stephan_Talk·22 Ağu

用了一天 deepseek v3.1 + Claude Code 来写代码，我的体感是比Cursor + GPT-5要好，属于人狠话不多那种，工具调用准确率和代码质量都不错需要指出一点是，如果结合 Claude Code 最好是用 deepseek的官方接口，因为它原生支持anthropic协议用 claude-code-router + openrouter 的方式效果会打折扣，因为 claude-code-router 会做两次协议转换： 1. CC输入(anthropic协议) -> ccr将anthropic请求协议转换为openai兼容协议 -> openrouter(OpenAI协议) -> deepseek(OpenAI协议) 2. deepseek(OpenAI协议) -> openrouter -> ccr将响应的openai协议转换为anthropic协议 -> CC输出而直连官方接口不需要这些协议转换，减少了出错的可能： 1. CC输入(anthropic协议) -> deepseek(anthropic协议) 2. deepseek(anthropic协议) -> CC输出(anthropic协议)

Stephan@Stephan_Talk

deepseek v3.1发布，它明确支持anthropic协议，简单配置环境变量即可使Claude code接入deepseek-3.1 如图所示，这次更新的重点在于模型的编程和工具调用能力。SWE-Bench上的分值比V3和R1都高不少以下是官方关于接入CC的文档： api-docs.deepseek.com/guides/anthrop… 另外，我让gemini deep research系统调研了kimi-k2/GLM4.5/Qwen3对CC协议的支持情况，如下链接： docs.google.com/document/u/1/d…

中文

234

48.9K

Yunfan Zhang@z4y5f3·31 Tem

Congratulations to my PhD advisor Prof. Kathleen McKeown on the ACL Lifetime Achievement Award! She is an inspiring, knowledgeable, and deeply kind researcher and mentor. I feel incredibly fortunate to be her student and can’t imagine doing my PhD anywhere else. Well deserved!

ACL 2026@aclmeeting

🕊️ Lifetime Achievement Award at #ACL2025NLP A standing ovation for Prof. Kathy McKeown, recipient of the ACL 2025 Lifetime Achievement Award! 🌟

English

650

Yunfan Zhang@z4y5f3·15 Tem

@ahmaurya @Yuchenj_UW Ideas are better when open. Just look at how Kimi was able to adopt OSS research done by others (MLA, DeepSeekMoE, RLVR, GRPO, Muon, etc) and improve their own models. Post-training OSS models unlocks a great amount of economic value too (e.g. Perplexity runs on Llama, DeepSeek)

English

Shawn Woodstock@ahmaurya·15 Tem

@Yuchenj_UW Mark's primary goal with OSS was to have the ecosystem make the model better for Meta. He stated this outright in a podcast. That might've worked except that OSS in AI doesn't allow open participation, that's still gated by owning enough compute. So he's now scaling compute.

English

382

Yuchen Jin@Yuchenj_UW·15 Tem

Looks like Meta is turning into another OpenAI. As a Chinese, I never thought we’d end up relying on China to keep open source AI alive.

English

138

110

251.7K

Yunfan Zhang@z4y5f3·10 Tem

@nrehiew_ RL is 2x (GRPO) - 3x (PPO) more computationally intensive than SFT on a per token basis. RL also overfits less so you can train for more epochs. xAI may also try to use a large rollout size to train on the hardest problems. I would imagine 1-2M RL samples, potentially much less.

English

584

wh@nrehiew_·10 Tem

For context, Mistral used 700k samples which is ~0.7T tokens for Magistral. If we assume (extremely conservatively 30T pertaining tokens), you need almost 30M samples. Where are you getting 30M prompts from?

wh@nrehiew_

xAI spent the same amount of compute on RL as Pretraining? That is insane

English

117

30.1K

Keşfet

@teortaxesTex @_xjdr @frontier_foid @YouJiacheng @deanwball @ID_AA_Carmack @abeirami @atulit_gaur