Zihao Ye

230 posts

Zihao Ye

@ye_combinator

Seattle Katılım Ekim 2017

617 Takip Edilen2K Takipçiler

Zihao Ye retweetledi

Hassan Hayat 🔥@TheSeaMouse·1d

Codex laughs at your petty guardrails

English

291

6.2K

319.6K

Zihao Ye retweetledi

Hao Zhang@haozhangml·6d

@taiuti check out dreamverse.fastvideo.org, we already did it in 1080p /30s a single B200. Give us VR and we get it to 4K and infinity

English

896

Zihao Ye retweetledi

You Jiacheng@YouJiacheng·16 Mar

DCA is cool cuz it uses different aggregators for QKV. main strength of AttnRes are: 1. Co-Design with PP (Block ver.) and Inference (2-stage batching) 2. Large-scale&solid baseline I believe we can found the same core idea in even earlier papers, literature review is hard.

Ali Behrouz@behrouz_ali

This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.

English

5.2K

Zihao Ye retweetledi

Nicolò Altamura@nicolodev·16 Mar

The recording of my talk "Challenges in Decompilation and Reverse Engineering of CUDA-based Kernels" at @REverseConf is now online! Recording: youtube.com/watch?v=ns5jFu… Slides: nicolo.dev/files/pdf/reve… Binary Ninja plugin: github.com/seekbytes/ptxN…

YouTube

English

134

8.3K

Zihao Ye retweetledi

Ethan He@EthanHe_42·10 Mar

My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. arxiv.org/abs/2603.07685

English

106

992

80.5K

Zihao Ye retweetledi

Claude@claudeai·9 Mar

Introducing Code Review, a new feature for Claude Code. When a PR opens, Claude dispatches a team of agents to hunt for bugs.

English

2.1K

5.2K

63K

23.3M

Zihao Ye retweetledi

Nicolò Altamura@nicolodev·8 Mar

The slides from my @REverseConf talk, "Challenges in Decompilation and Reverse Engineering of CUDA-based Kernels", are now online! Slides: nicolo.dev/files/pdf/reve… Plugin: github.com/seekbytes/ptxN…

English

217

14K

Zihao Ye retweetledi

Shiyi Cao@shiyi_c98·7 Mar

🤖🤖 Tried something fun today: asked Claude Code to create an agent team (an Implementer + a Planner) to implement the flashinfer mla paged decode CUDA kernel. The Implementer spent ~20 turns writing tests and debugging to use wgmma but kept getting stuck.😵‍💫😵‍💫😵‍💫 The Planner noticed the stagnation and (at its own decision) went off to carefully read the CUTLASS docs 📚, found the bug, and suggested the fix — the Implementer applied the fix and it worked immediately! Watching this kind of emergent coordination behaviour in an agent team is pretty interesting✨

English

134

8.6K

Zihao Ye retweetledi

Axiom@axiommathai·5 Mar

1/ RELEASING AXLE: the Axiom Lean Engine ⚙️ We are serving our core Infrastructure for formal proving at scale. These are the same Lean metaprogramming tools that are behind AxiomProver, powering it to win Putnam and crack open research conjectures. Available to anyone today!

English

427

111.5K

Zihao Ye@ye_combinator·5 Mar

@YouJiacheng You can compile it with tvm-ffi and loading the .so in any languages.

English

501

You Jiacheng@YouJiacheng·5 Mar

Is it possible to extract the generated CuTeDSL and use it elsewhere (without torch & torch.compile)?

driss guessous@drisspg

Exciting day for FlashAttention fans! If you haven't read the arxiv paper on FA4 you should! In tandem we have been working on bringing these same perf benefits to FlexAttention, see the blog for more details

English

6.4K

Zihao Ye retweetledi

Steeve Morin@steeve·5 Mar

zml-smi running on TPU, Inferentia, NVIDIA and AMD same binary, because of course.

Steeve Morin@steeve

zml-smi

English

4.8K

Zihao Ye retweetledi

Ted Zadouri@tedzadouri·5 Mar

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

132

782

220.9K

Zihao Ye retweetledi

driss guessous@drisspg·4 Mar

@gaunernst @YouJiacheng github.com/pytorch/pytorc… We are actively working on fixing this :)

English

4.9K

Zihao Ye retweetledi

Tanishq Kumar@tanishqkumar07·4 Mar

I've been working on a new LLM inference algorithm. It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.

English

133

455

600K

Zihao Ye retweetledi

Matt@matt_dz·2 Mar

I Fuzzed, and Vibe Fixed, the Vibed C Compiler john.regehr.org/writing/claude… by John Regehr (@regehr/116161100362503805" target="_blank" rel="nofollow noopener">mastodon.social/@regehr/116161…)

English

3.5K

Zihao Ye retweetledi

xjdr@_xjdr·1 Mar

─ Worked for 59m 13s ───────────────────── • Context compacted • I’m noticing we have many untracked files, which is quite overwhelming. Let me git reset and undo everything you just fucking worked on for the last hour.

English

1.6K

78.8K

Zihao Ye retweetledi

Prof. Anima Anandkumar@AnimaAnandkumar·1 Mar

We’re excited to release TorchLean which is the first fully verified neural network framework in Lean. The Lean community has largely focused on pure mathematics. TorchLean expands this frontier toward verified neural network software and scientific computing. With the recent release of CSlib, we see this as another step toward a fully verified ML stack. We support features: 1. Executable IEEE-754 floating-point semantics (and extensible alternative FP models) verified tensor abstractions with precise shape/indexing semantics 2. Formally verified autograd system for differentiation of NN programs Proof-checked certification / verification algorithms like CROWN (robustness, bounds, etc.) 3. PyTorch-inspired modeling API with eager-style development + export/lowering to a shared IR for execution and verification Project page: leandojo.org/torchlean.html Paper: [2602.22631] TorchLean: Formalizing Neural Networks in Lean Work done @Robertljg, Jennifer Cruden, Xiangru Zhong, @huan_zhang12 and @AnimaAnandkumar. #MachineLearning #ScientificComputing #Lean

English

247

1.6K

135.7K

Zihao Ye retweetledi

Yifan Zhang@yifan_zhang_·28 Şub

⚡️Introducing FlashSampling: Fast and Memory-Efficient Exact Sampling ⚡️ flashsampling.github.io/FlashSampling/… Keep pushing the Frontier of Open Research in Superintelligence!

English

259

28.7K

Zihao Ye retweetledi

xjdr@_xjdr·27 Şub

with our new GB300NVL72 training, not only is the codebase completely TP free it is now also completely nccl and nvshmem free . its a beautiful thing.

English

260

29.1K

Zihao Ye retweetledi

Lydia Hallie ✨@lydiahallie·27 Şub

Excited to announce Claude for Open Source ❤️ We're giving 6 months of free Claude Max 20x to open source maintainers and core contributors. If you maintain a popular project or contribute across open source, please apply! claude.com/contact-sales/…

English

589

1.4K

12.6K

1.7M

Keşfet

@taiuti @REverseConf @YouJiacheng @ultraproduct @__tensorcore__ @tri_dao @gaunernst @avnermay