Shuo Yang

0

22

AiDevCraft@AiDevCraft·4d

The KV-cache framing is right but undersells the consequence: once the world model needs hierarchical or external memory, it stops being a single net and becomes a system with a storage tier. Retrieval policy, eviction, write-through caching — all of these turn from infra problems into ML objectives.

English

Haocheng Xi@HaochengXiUCB

0

1

12

Shuo Yang@Andy_ShuoYang·5d

Memory is always the most important part of a world model

New blog post: The Forgetting Wall in Video and World Models Long-horizon video generation is not just limited by compute. It is limited by how much of its own past the model can afford to remember. I wrote about why long videos drift, why KV cache becomes the memory bottleneck, and why compression is a key direction for future video/world models. haochengxi.github.io/posts/forgetti…

English

13

1.7K

Shuo Yang retweetet

Yichuan Wang@YichuanM·4 Haz

@Andy_ShuoYang github.com/StarTrail-org/… Thanks for contributing to LEANN! This is a very strong GPU backend for ANN, bringing fast and scalable vector search to a wide range of applications.

English

2

5

783

Shuo Yang retweetet

Yichuan Wang@YichuanM·3 Haz

Welcome FlashLib to the LEANN(github.com/StarTrail-org/…) party 🚀 Huge shoutout to legend @Andy_ShuoYang — FlashLib(github.com/FlashML-org/fl…) now brings a strong GPU backend to LEANN, supporting a wide range of applications in the LEANN ecosystem. PR is merged. Check out what happened here: github.com/StarTrail-org/…

Shuo Yang@Andy_ShuoYang

FlashLib update: we now support ANN search with IVF-Flat — up to 6.5× faster than cuVS on real-world vector workloads (SIFT-1M) while matching recall. LEANN now supports FlashLib as a backend: 26× faster build, 29× faster single-query, and 298× faster batch search. Huge thanks to @YichuanM for the help! We’re also opening Discord / Slack channels — join us to suggest new operators you want to see, and hardware backends you want FlashLib to support next! Slack: join.slack.com/t/flashml/shar… Discord: discord.gg/ce5Xa5pf

English

5

24

4K

Shuo Yang@Andy_ShuoYang·3 Haz

Github: github.com/FlashML-org/fl…

English

1

3

335

Shuo Yang@Andy_ShuoYang·3 Haz

FlashLib update: we now support ANN search with IVF-Flat — up to 6.5× faster than cuVS on real-world vector workloads (SIFT-1M) while matching recall. LEANN now supports FlashLib as a backend: 26× faster build, 29× faster single-query, and 298× faster batch search. Huge thanks to @YichuanM for the help! We’re also opening Discord / Slack channels — join us to suggest new operators you want to see, and hardware backends you want FlashLib to support next! Slack: join.slack.com/t/flashml/shar… Discord: discord.gg/ce5Xa5pf

English

6

16

101

377.6K

Shuo Yang retweetet

Chenfeng_X@Chenfeng_X·3 Haz

LEANN (github.com/StarTrail-org/…)+ FlashLib (flashml-org.github.io) = 26x faster LEANN. Looking forward to more X + FlashLib!

Shuo Yang@Andy_ShuoYang

FlashLib update: we now support ANN search with IVF-Flat — up to 6.5× faster than cuVS on real-world vector workloads (SIFT-1M) while matching recall. LEANN now supports FlashLib as a backend: 26× faster build, 29× faster single-query, and 298× faster batch search. Huge thanks to @YichuanM for the help! We’re also opening Discord / Slack channels — join us to suggest new operators you want to see, and hardware backends you want FlashLib to support next! Slack: join.slack.com/t/flashml/shar… Discord: discord.gg/ce5Xa5pf

English

5

22

2.1K

Shuo Yang@Andy_ShuoYang·31 May

@EdenTan20 Really appreciate it! I haven’t customized the agent loop much yet. I’m still mostly using Cursor for development. But I definitely want to explore more customized agent loops in the future😁

English

1

38

Wenxuan Tan@EdenTan20·30 May

@Andy_ShuoYang Thanks for the great work! I wonder if you used any custom agent loop for writing the cutedsl kernels? It seems like a lot of work😃

English

0

1

61

Shuo Yang@Andy_ShuoYang·27 May

Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: flashml-org.github.io Code: github.com/FlashML-org/fl…

English

47

236

1.6K

866K

Shuo Yang@Andy_ShuoYang·28 May

@laenorca_stille I have used flash-kmeans for video diffusion: github.com/svg-project/Sp… We will try with image diffusion in the near future!

English

0

66

Dank AlvaradoSpawn@laenorca_stille·28 May

@Andy_ShuoYang Have you used these operators with image diffusion?

English

0

1

51

Shuo Yang@Andy_ShuoYang·27 May

@levidiamode We will test on blackwell soon!

English

1

5

754

levi@levidiamode·27 May

@Andy_ShuoYang super interesting, curious to see how this extends to blackwell gpus!

English

0

5

989

Shuo Yang@Andy_ShuoYang·27 May

That’s a great question — we’ve tested across a range of embedding shapes and also built dedicated stress tests and microbenchmarks for KNN beyond the 198 workload cells. KNN is actually a very good example of how to design and optimize a complex operator: from a roofline perspective, the workload shifts between memory-bound (few queries) and compute-bound (many queries), which means a single static kernel is fundamentally not enough. In practice, we rely on a large set of kernel variants — for large-query regimes we use designs inspired by FlashAttention, while for small-query regimes we use FlashDecoding-style approaches. We also make decisions like register vs shared-memory sorting based on K, and design tile size heuristics based on hardware cache sizes (with additional optimizations on Hopper). tbh KNN is probably the operator I’ve spent the most time on so far.

English

0

2

637

Lindsay Rex@WaveTheoryUK·27 May

How did you verify data distributions and edge-case pathologies across the 198 workload cells? After months optimizing KNN myself, I've seen how easy it is to overfit speed claims on synthetic/curated inputs—NVIDIA's general-purpose algos (cuML etc.) are tuned for robustness across messy real-world distributions, high-D noise, varying densities, outliers, etc. Specialized kernels can shine on clean cases but degrade or lose correctness elsewhere. Curious about your test suite: real embeddings (e.g., from CLIP/LLM), skewed clusters, adversarial shapes, or just grid search over N/D/K? Any public repo details on distribution coverage

English

0

1

862

Shuo Yang@Andy_ShuoYang·27 May

Good question — most of it comes from reformulation on our side. For TruncatedSVD we use a dual-path approach that reduces the problem to much smaller eigendecompositions, instead of operating on the full matrix. On top of that, we select the most efficient eigh variant based on workload and tolerance, which gives additional gains.

English

0

6

1.4K

GeekPark@GeekParkHQ·27 May

@Andy_ShuoYang The 208× speedup on TruncatedSVD stands out. is that more a reflection of FlashLib doing something clever, or cuML having a particularly weak baseline there?

English

0

3

1.5K

Shuo Yang@Andy_ShuoYang·27 May

@xtwirer Really appreciate it! We’re actively expanding the operator set and will share a more detailed technical report on the underlying designs. We’re also thinking about setting up Slack/Discord — in the meantime, feel free to open issues for operators you’d like to see!

English

3

955

xtwirer.account@xtwirer·27 May

@Andy_ShuoYang Incredible work Shuo! thanks for sharing: what other implementations are on the road? would be possible to implement kd tree?. what papers are the algorithms implemention based on? are you planning on creating a discord server for FlashLib?

English

0

4

1.1K

Shuo Yang@Andy_ShuoYang·27 May

@eigentopology @CShorten30 Thanks, appreciate it! A lot of the gains come from FlashAttention-style ideas — avoiding large intermediate materialization and reducing atomic contention, which turns out to apply broadly across KMeans, KNN, MultinomialNB, t-SNE, etc.

English

9

1.2K

Sergio Charles@eigentopology·27 May

@Andy_ShuoYang @CShorten30 Really cool! What was your process for speeding up the operators?

English

0

3

1.3K

Shuo Yang@Andy_ShuoYang·27 May

We care a lot about real workloads beyond benchmarks. For example, giving FlashLib to a coding agent leads to ~5× system-level speedup when building a GPU vector DB. Also, Flash-KMeans is already used inside video generation pipelines for attention sparsification and KV-cache compression github.com/svg-project/Sp…

English

0

13

1.6K

Utkarsh Singh@Utkarsh51557661·27 May

@Andy_ShuoYang impressive numbers, but hype often oversells real-world performance. benchmarks are good, but what about actual use cases?

English

0

3

1.6K

Shuo Yang@Andy_ShuoYang·27 May

We also asked a different question: can FlashLib help an AI coding agent build faster systems? We gave Claude Code the same GPU vector-search task under the same 1M-token budget, changing only whether FlashLib was available. With FlashLib, the agent reached 6.2× higher QPS on offline batch search and 5.2× higher QPS on streaming search, while finishing under budget. Without FlashLib, it plateaued and exhausted the budget.

English

4

5

42

3.5K

Shuo Yang@Andy_ShuoYang·27 May

The goal is not a one-off kernel trick. We benchmarked FlashLib across 13 primitives × 198 workload cells against cuML 25.10 on H200. FlashLib is faster or tied on 197 / 198 cells, with 126 cells >5× and 12 cells >50×.

English