Ke Li 🍁

217 posts

Ke Li 🍁

@KL_Div

Assistant Professor of Computing Science @SFU. Ph.D. from @Berkeley_EECS and Bachelor's from @UofTCompSci. Formerly @GoogleAI and Member of @the_IAS.

Vancouver, Canada Katılım Haziran 2019

415 Takip Edilen6.4K Takipçiler

Sabitlenmiş Tweet

Ke Li 🍁@KL_Div·20 Nis

Introducing WIMLE, a model-based RL method that substantially improves sample efficiency and asymptotic performance on hard tasks. Rather assuming a Gaussian world model, WIMLE trains a world model with IMLE. Joint w/ @mehranag, @Moazeni_Alireza, @yszhang170. See 👇 for links.

English

3.9K

Ke Li 🍁@KL_Div·2 May

A very nicely written blog post on IceCache! Highly recommended for anyone interested in how IceCache works under the hood.

Yuvanesh S@scriptosis

I've fully covered the mathematical foundation of IceCache that was discussed in the paper, and parts that weren't detailed there. IceCache is a novel approach to managing KV caches that uses Dynamic Continuous Indexing (DCI) to organize and retrieve tokens based on their semantic relationships more efficiently. I walked through the complete sparse-retrieval theory step by step , every formula explained from first principles, every design choice motivated, every minute mathematical detail laid out. Implementation is in the next post .... check it out yuvanesh.vercel.app/blogs/IceCache Thank you for this wonderful paper, would love any feedback or guidance @KL_Div @Mao_Yuzhen @q1tong

English

2.6K

Ke Li 🍁@KL_Div·2 May

@xiaolonw @sainingxie @Meta Congratulations!

English

Xiaolong Wang@xiaolonw·1 May

@sainingxie @Meta Thanks Saining!

English

1.7K

Xiaolong Wang@xiaolonw·1 May

Excited to share that Assured Robot Intelligence (ARI) has joined @Meta to help build the future of humanoid intelligence! When we started ARI one year ago, our mission was clear: achieve physical AGI. Through deep customer engagements and real-world deployments, it became clear to us that serving the massive opportunity ahead requires training a truly general-purpose physical agent. We believe this agent will be humanoid — and that scaling will come from learning directly from human experience, not teleoperation alone. Meta’s ecosystem brings together the key components needed to make this vision possible. We will be joining Meta Superintelligence Labs (MSL) to help bring personal superintelligence into the physical world. We are incredibly grateful to the brilliant minds, robotics researchers, engineers, partners, and supporters who have worked with us on this journey. Thank you to our investors and angels, led by @aixventureshq , for believing in our mission. This is just the beginning.

Bloomberg@business

Meta Platforms Inc. has acquired Assured Robot Intelligence, a startup developing artificial intelligence models for robots, as part of a major initiative to build humanoid technology. bloomberg.com/news/articles/…

English

113

696

193.2K

Ke Li 🍁@KL_Div·1 May

@pulkitology Congratulations, Pulkit! Very impressive video!

English

259

Pulkit Agrawal@pulkitology·29 Nis

Eka means unity -- “one,” in Sanskrit and “first” in Finnish. We’re building intelligence for the physical world in its native language: forces. Until now, robotics faced a tradeoff — generality or speed. The real world requires both. Robotics also faced a data problem. Our Vision–Force–Action (VFA) model — the first of its kind — breaks the generality-speed tradeoff and the data barrier. It's a new foundation uniting performance, generality, and safety for putting capable robots in everyone's hands. Today, I am excited to share our journey of pushing robots beyond human limits. Today, dexterity becomes scalable. Today, I welcome you to the Era of Eka. Co-founded with @haarnoja, and so thrilled and grateful to be working with a dream team at @EkaRobotics. Learn more: ekarobotics.com

English

221

315.5K

Ke Li 🍁@KL_Div·23 Nis

@tatavishnurao Yes, it's here: github.com/yuzhenmao/IceC…

English

448

Vishnu@tatavishnurao·23 Nis

@KL_Div I have had this thought but seeing that it is already is at ICLR is truly good. Is there a github repo I can look into for contributions @KL_Div ?

English

521

Ke Li 🍁@KL_Div·23 Nis

LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly sacrificing accuracy? IceCache is a new method for managing KV caches that leverages Dynamic Continuous Indexing (DCI) to efficiently group and retrieve tokens by semantics. Joint work w/ @Mao_Yuzhen, @q1tong and Martin Ester. For details, check out the links below.

English

215

20.8K

Ke Li 🍁@KL_Div·23 Nis

Project website: yuzhenmao.github.io/IceCache/ Paper: arxiv.org/abs/2604.10539 Time and Location at ICLR 2026: iclr.cc/virtual/2026/p…

English

1.1K

Ke Li 🍁@KL_Div·20 Nis

@mehranag @Moazeni_Alireza @yszhang170 Project website: mehranagh20.github.io/wimle/ Paper: arxiv.org/abs/2602.14351 Time and Location at ICLR 2026: iclr.cc/virtual/2026/p…

English

514

Ke Li 🍁@KL_Div·20 Nis

English

3.9K

Ke Li 🍁@KL_Div·10 Şub

@MingyuanZhou Agreed - normalization as an idea is definitely not new (c.f. Sinkhorn iterations). I was merely pointing out that difference relative to GMMN could explain why GMMN didn't take off.

English

184

Mingyuan Zhou@MingyuanZhou·10 Şub

@KL_Div In our NeurIPS 2021 paper, we define (approximate) forward and backward conditional transport by row- or column-normalizing the distance matrix between true and fake samples, yielding balanced mode-covering and mode-seeking behaviors.

English

217

Ke Li 🍁@KL_Div·10 Şub

To be fair to the authors, I think the normalization of the kernel is key. If the normalization weren't there, the kernel would not depend on other samples. In that case, the drift would be the same regardless of whether (1) all fake samples are far away from a real sample (which is common at the beginning of training), or (2) one fake sample is much closer to a real sample compared to other fake samples (which is common later on in training). One would want the drift to be large in the former case and small in the latter case. But without normalization, there would be no way to make that happen.

Ivan Skorokhodov@isskoro

The recent Drifting Models paper from Kaiming's group got very hyped over the past few days as a new generative modeling paradigm, but in fact, it can actually be seen as a scaled-up/generalized version of the good old GMMN from 2015 (and the authors themselves acknowledge this in the paper in Appendix C.2, noting that GMMN can be seen as Drifting Models for a particular choice of the kernel). Also, I am very skeptical about its scalability (for higher diversity / higher resolution datasets, larger models, and videos). The way Drifting Models work is actually very simple: - 1. Sample random noise z ~ N(0, I) - 2. Feed it to the generator and get a fake sample x' = G(z) - 3. For each fake sample x', compute its similarity (in the feature space of some encoder) to each of the real samples x_i from the current batch. - 4. Push it closer toward these real samples using the similarities as weights (i.e. so that we push to the nearest ones the most). - 5. To make sure that we don't have any sort of mode collapse, repel each fake sample from other fake samples via the same scheme. - 6. Profit Now, GMMN follows exactly the same scheme, with the only difference being that it uses a different (unnormalized) function in the "distance computation" and doesn't allow for cleanly plugging in normalization/scaling in the similarity scores or CFG. Why didn't GMMN take off and why am I skeptical about Drifting Models? The issue is that it makes it much harder to compute any meaningful similarity when your dataset gets more diverse (happens when you switch to foundational T2I/T2V model training), or the batch size gets smaller (happens when your model size or training resolution increases), or your feature encoder produces less comparable representations (happens for videos or more diverse datasets). You can sure get informative similarities for 4096 batch size on the object-centric, limited diversity ImageNet with ResNet-50 feature encoder, but for smth like video generation, we train on hundreds of millions of videos or, at high resolutions + larger model sizes, with a batch size of 1 per GPU (not sure if will be fast to do inter-GPU distance computations). From the theoretical perspective, even though the final objective and the practical training scheme are the same, the mathematical machinery to formulate the framework is very different and enables direct access to the drifting field (e.g., to easily enable CFG which the authors already did). But I guess what I like the most about this paper is that Kaiming's group is boldly pushing against the mainstream ideas of the community, and hopefully it will inspire others to also take a look at the fundamentals and stop cargo-culting diffusion models.

English

101

26K

Ke Li 🍁@KL_Div·10 Şub

To clarify, my post was in response to @isskoro's post, which was on the relationship between drifting models and GMMN and why GMMN didn't take off. I agree with you that the idea of normalizing kernels is not new; I was merely pointing out why this difference could explain why GMMN didn't take off.

English

815

Peyman Milanfar@docmilanfar·10 Şub

@KL_Div Normalizing the kernel is standard when you’re computing weighted averages. That’s nothing that distinguishes the approach at all.

English

3.3K

Ke Li 🍁@KL_Div·10 Şub

@jon_barron I think they use minibatches, so n is the batch size, but yes, the computational complexity is quadratic. It's indeed possible to use fast nearest neighbor approximations - in fact we did that in IMLE.

English

340

Jon Barron@jon_barron·10 Şub

@KL_Div Does a normalization requirement mean O(n^2) cost where n is training dataset size? Barring fast nearest neighbor approximations etc.

English

1.1K

Ke Li 🍁@KL_Div·8 Şub

Thanks for pointing out the similarity between drifting and Implicit Maximum Likelihood Estimation! I worked out the mathematical connection - the crux is that drifting fields are similar to the gradient of a soft version of the IMLE loss. So drifting is defined in terms of the gradient, whereas IMLE is defined in terms of the objective, but the behaviour should be similar. It's reminiscent of the formulation of classical mechanics vs. Lagrangian mechanics from physics. One difference is that in drifting the weights on the positive samples and the negative samples are different, whereas they are the same in IMLE. It'd be interesting to see if the negative weights can be replaced with positive weights.

Hansheng Chen@HanshengCh

Cool new paper by @Goodeat258 and Kaiming's team! arxiv.org/abs/2602.04770 Reminds me of @KL_Div's Implicit Maximum Likelihood Estimation paper

English

768

180.9K

Ke Li 🍁@KL_Div·19 Oca

@pathak2206 @SkildAI That's great news - congratulations, Deepak!

English

215

Deepak Pathak@pathak2206·15 Oca

At @SkildAI, we’ve raised $1.4B, bringing our valuation to over $14B. We’re on a generational mission, and I’m grateful to be working alongside an exceptional team. Thanks to our investors for the long-term conviction towards omni-bodied intelligence 🚀 bloomberg.com/news/articles/…

Skild AI@SkildAI

Announcing Series C We’ve raised $1.4B, valuing the company at over $14B With this capital, we will accelerate our mission to build omni-bodied intelligence 🚀 skild.ai/blogs/series-c

English

632

159.4K

Ke Li 🍁@KL_Div·22 Eki

More results are below. This was a fun collaboration with @_Linjunru, @researchirag, @mikacuy, @Stearns2Colton, @XuanLuo14 and @GuibasLeonidas :) 5/5

English

874

Ke Li 🍁@KL_Div·22 Eki

As shown, the approach significantly outperforms recent baselines. 4/5

English

Ke Li 🍁@KL_Div·22 Eki

If you are at #ICCV2025, check out our work on interpolating between two states of a 3D scene with large motion at poster 269 on Tuesday afternoon. It proposes a general-purpose method that can disambiguate points with similar appearance. Website: junrul.github.io/gmc/ 1/5

English

2.1K

Keşfet

@xiaolonw @sainingxie @Meta @aixventureshq @pulkitology @haarnoja @EkaRobotics @tatavishnurao