Shuo Yang

80 posts

Shuo Yang

Shuo Yang

@Andy_ShuoYang

2nd year phd at Berkeley; Efficient ML System;

Berkeley Beigetreten Şubat 2023
149 Folgt866 Follower
Angehefteter Tweet
Shuo Yang
Shuo Yang@Andy_ShuoYang·
Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: flashml-org.github.io Code: github.com/FlashML-org/fl…
English
47
236
1.6K
866K
Shuo Yang retweetet
Qiuyang Mang
Qiuyang Mang@MangQiuyang·
Roadmap to FrontierCS 2.0 is live. If continual learning and AI auto-research are going to matter, benchmarks need to test more than one-shot answers. FrontierCS 2.0 moves open-ended evaluation toward feedback-driven environments, repo-level tasks, and controlled evaluator interaction. Example: Try your own agent on the Erdős-unite-distance conjecture disproven by @OpenAI in FrontierCS 2.0:
GIF
Qiuyang Mang@MangQiuyang

x.com/i/article/2065…

English
4
12
47
7.1K
Shuo Yang
Shuo Yang@Andy_ShuoYang·
@AiDevCraft Totally agree. It needs complex algorithm-system co-design
English
2
0
0
22
AiDevCraft
AiDevCraft@AiDevCraft·
The KV-cache framing is right but undersells the consequence: once the world model needs hierarchical or external memory, it stops being a single net and becomes a system with a storage tier. Retrieval policy, eviction, write-through caching — all of these turn from infra problems into ML objectives.
English
1
0
1
12
Shuo Yang retweetet
Yichuan Wang
Yichuan Wang@YichuanM·
Welcome FlashLib to the LEANN(github.com/StarTrail-org/…) party 🚀 Huge shoutout to legend @Andy_ShuoYang — FlashLib(github.com/FlashML-org/fl…) now brings a strong GPU backend to LEANN, supporting a wide range of applications in the LEANN ecosystem. PR is merged. Check out what happened here: github.com/StarTrail-org/…
Shuo Yang@Andy_ShuoYang

FlashLib update: we now support ANN search with IVF-Flat — up to 6.5× faster than cuVS on real-world vector workloads (SIFT-1M) while matching recall. LEANN now supports FlashLib as a backend: 26× faster build, 29× faster single-query, and 298× faster batch search. Huge thanks to @YichuanM for the help! We’re also opening Discord / Slack channels — join us to suggest new operators you want to see, and hardware backends you want FlashLib to support next! Slack: join.slack.com/t/flashml/shar… Discord: discord.gg/ce5Xa5pf

English
2
5
24
4K
Shuo Yang
Shuo Yang@Andy_ShuoYang·
FlashLib update: we now support ANN search with IVF-Flat — up to 6.5× faster than cuVS on real-world vector workloads (SIFT-1M) while matching recall. LEANN now supports FlashLib as a backend: 26× faster build, 29× faster single-query, and 298× faster batch search. Huge thanks to @YichuanM for the help! We’re also opening Discord / Slack channels — join us to suggest new operators you want to see, and hardware backends you want FlashLib to support next! Slack: join.slack.com/t/flashml/shar… Discord: discord.gg/ce5Xa5pf
English
6
16
101
377.6K
Shuo Yang retweetet
Chenfeng_X
Chenfeng_X@Chenfeng_X·
LEANN (github.com/StarTrail-org/…)+ FlashLib (flashml-org.github.io) = 26x faster LEANN. Looking forward to more X + FlashLib!
Shuo Yang@Andy_ShuoYang

FlashLib update: we now support ANN search with IVF-Flat — up to 6.5× faster than cuVS on real-world vector workloads (SIFT-1M) while matching recall. LEANN now supports FlashLib as a backend: 26× faster build, 29× faster single-query, and 298× faster batch search. Huge thanks to @YichuanM for the help! We’re also opening Discord / Slack channels — join us to suggest new operators you want to see, and hardware backends you want FlashLib to support next! Slack: join.slack.com/t/flashml/shar… Discord: discord.gg/ce5Xa5pf

English
1
5
22
2.1K
Shuo Yang
Shuo Yang@Andy_ShuoYang·
@EdenTan20 Really appreciate it! I haven’t customized the agent loop much yet. I’m still mostly using Cursor for development. But I definitely want to explore more customized agent loops in the future😁
English
0
0
1
38
Wenxuan Tan
Wenxuan Tan@EdenTan20·
@Andy_ShuoYang Thanks for the great work! I wonder if you used any custom agent loop for writing the cutedsl kernels? It seems like a lot of work😃
English
1
0
1
61
Shuo Yang
Shuo Yang@Andy_ShuoYang·
Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: flashml-org.github.io Code: github.com/FlashML-org/fl…
English
47
236
1.6K
866K
levi
levi@levidiamode·
@Andy_ShuoYang super interesting, curious to see how this extends to blackwell gpus!
English
1
0
5
989
Shuo Yang
Shuo Yang@Andy_ShuoYang·
That’s a great question — we’ve tested across a range of embedding shapes and also built dedicated stress tests and microbenchmarks for KNN beyond the 198 workload cells. KNN is actually a very good example of how to design and optimize a complex operator: from a roofline perspective, the workload shifts between memory-bound (few queries) and compute-bound (many queries), which means a single static kernel is fundamentally not enough. In practice, we rely on a large set of kernel variants — for large-query regimes we use designs inspired by FlashAttention, while for small-query regimes we use FlashDecoding-style approaches. We also make decisions like register vs shared-memory sorting based on K, and design tile size heuristics based on hardware cache sizes (with additional optimizations on Hopper). tbh KNN is probably the operator I’ve spent the most time on so far.
English
1
0
2
637
Lindsay Rex
Lindsay Rex@WaveTheoryUK·
How did you verify data distributions and edge-case pathologies across the 198 workload cells? After months optimizing KNN myself, I've seen how easy it is to overfit speed claims on synthetic/curated inputs—NVIDIA's general-purpose algos (cuML etc.) are tuned for robustness across messy real-world distributions, high-D noise, varying densities, outliers, etc. Specialized kernels can shine on clean cases but degrade or lose correctness elsewhere. Curious about your test suite: real embeddings (e.g., from CLIP/LLM), skewed clusters, adversarial shapes, or just grid search over N/D/K? Any public repo details on distribution coverage
English
1
0
1
862
Shuo Yang
Shuo Yang@Andy_ShuoYang·
Good question — most of it comes from reformulation on our side. For TruncatedSVD we use a dual-path approach that reduces the problem to much smaller eigendecompositions, instead of operating on the full matrix. On top of that, we select the most efficient eigh variant based on workload and tolerance, which gives additional gains.
English
2
0
6
1.4K
GeekPark
GeekPark@GeekParkHQ·
@Andy_ShuoYang The 208× speedup on TruncatedSVD stands out. is that more a reflection of FlashLib doing something clever, or cuML having a particularly weak baseline there?
English
2
0
3
1.5K
Shuo Yang
Shuo Yang@Andy_ShuoYang·
@xtwirer Really appreciate it! We’re actively expanding the operator set and will share a more detailed technical report on the underlying designs. We’re also thinking about setting up Slack/Discord — in the meantime, feel free to open issues for operators you’d like to see!
English
0
0
3
955
xtwirer.account
xtwirer.account@xtwirer·
@Andy_ShuoYang Incredible work Shuo! thanks for sharing: what other implementations are on the road? would be possible to implement kd tree?. what papers are the algorithms implemention based on? are you planning on creating a discord server for FlashLib?
English
1
0
4
1.1K
Shuo Yang
Shuo Yang@Andy_ShuoYang·
@eigentopology @CShorten30 Thanks, appreciate it! A lot of the gains come from FlashAttention-style ideas — avoiding large intermediate materialization and reducing atomic contention, which turns out to apply broadly across KMeans, KNN, MultinomialNB, t-SNE, etc.
English
0
0
9
1.2K
Shuo Yang
Shuo Yang@Andy_ShuoYang·
We care a lot about real workloads beyond benchmarks. For example, giving FlashLib to a coding agent leads to ~5× system-level speedup when building a GPU vector DB. Also, Flash-KMeans is already used inside video generation pipelines for attention sparsification and KV-cache compression github.com/svg-project/Sp…
English
1
0
13
1.6K
Utkarsh Singh
Utkarsh Singh@Utkarsh51557661·
@Andy_ShuoYang impressive numbers, but hype often oversells real-world performance. benchmarks are good, but what about actual use cases?
English
1
0
3
1.6K
Shuo Yang
Shuo Yang@Andy_ShuoYang·
We also asked a different question: can FlashLib help an AI coding agent build faster systems? We gave Claude Code the same GPU vector-search task under the same 1M-token budget, changing only whether FlashLib was available. With FlashLib, the agent reached 6.2× higher QPS on offline batch search and 5.2× higher QPS on streaming search, while finishing under budget. Without FlashLib, it plateaued and exhausted the budget.
English
4
5
42
3.5K
Shuo Yang
Shuo Yang@Andy_ShuoYang·
The goal is not a one-off kernel trick. We benchmarked FlashLib across 13 primitives × 198 workload cells against cuML 25.10 on H200. FlashLib is faster or tied on 197 / 198 cells, with 126 cells >5× and 12 cells >50×.
Shuo Yang tweet media
English
1
2
38
4.2K