
🌟 Highlights:
- Blackwell Support: Early Blackwell support via FFA_FA4 backend, leveraging HSTU function & R2P optimizations for next-gen hardware.
- Native Group Collective: DeepEP-inspired fused kernels (GroupCast/GroupReduce) to break RDMA bottlenecks and achieve zero-redundancy communication.
- Mask-Agnostic SOTA: Constant high-throughput performance even for irregular, complex masks. No more OOM or performance drops in "hard mode."
- System-Level Synergy: Dispatch Solver + Adaptive Overlap ensures near-linear scalability on H100 & B200 clusters.

English









