Ke Li 🍁

217 posts

Ke Li 🍁 banner
Ke Li 🍁

Ke Li 🍁

@KL_Div

Assistant Professor of Computing Science @SFU. Ph.D. from @Berkeley_EECS and Bachelor's from @UofTCompSci. Formerly @GoogleAI and Member of @the_IAS.

Vancouver, Canada Katılım Haziran 2019
415 Takip Edilen6.4K Takipçiler
Sabitlenmiş Tweet
Ke Li 🍁
Ke Li 🍁@KL_Div·
Introducing WIMLE, a model-based RL method that substantially improves sample efficiency and asymptotic performance on hard tasks. Rather assuming a Gaussian world model, WIMLE trains a world model with IMLE. Joint w/ @mehranag, @Moazeni_Alireza, @yszhang170. See 👇 for links.
Ke Li 🍁 tweet media
English
1
7
29
3.9K
Xiaolong Wang
Xiaolong Wang@xiaolonw·
Excited to share that Assured Robot Intelligence (ARI) has joined @Meta to help build the future of humanoid intelligence! When we started ARI one year ago, our mission was clear: achieve physical AGI. Through deep customer engagements and real-world deployments, it became clear to us that serving the massive opportunity ahead requires training a truly general-purpose physical agent. We believe this agent will be humanoid — and that scaling will come from learning directly from human experience, not teleoperation alone. Meta’s ecosystem brings together the key components needed to make this vision possible. We will be joining Meta Superintelligence Labs (MSL) to help bring personal superintelligence into the physical world. We are incredibly grateful to the brilliant minds, robotics researchers, engineers, partners, and supporters who have worked with us on this journey. Thank you to our investors and angels, led by @aixventureshq , for believing in our mission. This is just the beginning.
Bloomberg@business

Meta Platforms Inc. has acquired Assured Robot Intelligence, a startup developing artificial intelligence models for robots, as part of a major initiative to build humanoid technology. bloomberg.com/news/articles/…

English
113
57
696
193.2K
Pulkit Agrawal
Pulkit Agrawal@pulkitology·
Eka means unity -- “one,” in Sanskrit and “first” in Finnish. We’re building intelligence for the physical world in its native language: forces. Until now, robotics faced a tradeoff — generality or speed. The real world requires both. Robotics also faced a data problem. Our Vision–Force–Action (VFA) model — the first of its kind — breaks the generality-speed tradeoff and the data barrier. It's a new foundation uniting performance, generality, and safety for putting capable robots in everyone's hands. Today, I am excited to share our journey of pushing robots beyond human limits. Today, dexterity becomes scalable. Today, I welcome you to the Era of Eka. Co-founded with @haarnoja, and so thrilled and grateful to be working with a dream team at @EkaRobotics. Learn more: ekarobotics.com
English
65
221
2K
315.5K
Vishnu
Vishnu@tatavishnurao·
@KL_Div I have had this thought but seeing that it is already is at ICLR is truly good. Is there a github repo I can look into for contributions @KL_Div ?
English
1
0
1
521
Ke Li 🍁
Ke Li 🍁@KL_Div·
LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly sacrificing accuracy? IceCache is a new method for managing KV caches that leverages Dynamic Continuous Indexing (DCI) to efficiently group and retrieve tokens by semantics. Joint work w/ @Mao_Yuzhen, @q1tong and Martin Ester. For details, check out the links below.
English
5
15
215
20.8K
Ke Li 🍁
Ke Li 🍁@KL_Div·
Introducing WIMLE, a model-based RL method that substantially improves sample efficiency and asymptotic performance on hard tasks. Rather assuming a Gaussian world model, WIMLE trains a world model with IMLE. Joint w/ @mehranag, @Moazeni_Alireza, @yszhang170. See 👇 for links.
Ke Li 🍁 tweet media
English
1
7
29
3.9K
Ke Li 🍁
Ke Li 🍁@KL_Div·
@MingyuanZhou Agreed - normalization as an idea is definitely not new (c.f. Sinkhorn iterations). I was merely pointing out that difference relative to GMMN could explain why GMMN didn't take off.
English
0
0
1
184
Mingyuan Zhou
Mingyuan Zhou@MingyuanZhou·
@KL_Div In our NeurIPS 2021 paper, we define (approximate) forward and backward conditional transport by row- or column-normalizing the distance matrix between true and fake samples, yielding balanced mode-covering and mode-seeking behaviors.
English
1
0
2
217
Ke Li 🍁
Ke Li 🍁@KL_Div·
To be fair to the authors, I think the normalization of the kernel is key. If the normalization weren't there, the kernel would not depend on other samples. In that case, the drift would be the same regardless of whether (1) all fake samples are far away from a real sample (which is common at the beginning of training), or (2) one fake sample is much closer to a real sample compared to other fake samples (which is common later on in training). One would want the drift to be large in the former case and small in the latter case. But without normalization, there would be no way to make that happen.
Ivan Skorokhodov@isskoro

The recent Drifting Models paper from Kaiming's group got very hyped over the past few days as a new generative modeling paradigm, but in fact, it can actually be seen as a scaled-up/generalized version of the good old GMMN from 2015 (and the authors themselves acknowledge this in the paper in Appendix C.2, noting that GMMN can be seen as Drifting Models for a particular choice of the kernel). Also, I am very skeptical about its scalability (for higher diversity / higher resolution datasets, larger models, and videos). The way Drifting Models work is actually very simple: - 1. Sample random noise z ~ N(0, I) - 2. Feed it to the generator and get a fake sample x' = G(z) - 3. For each fake sample x', compute its similarity (in the feature space of some encoder) to each of the real samples x_i from the current batch. - 4. Push it closer toward these real samples using the similarities as weights (i.e. so that we push to the nearest ones the most). - 5. To make sure that we don't have any sort of mode collapse, repel each fake sample from other fake samples via the same scheme. - 6. Profit Now, GMMN follows exactly the same scheme, with the only difference being that it uses a different (unnormalized) function in the "distance computation" and doesn't allow for cleanly plugging in normalization/scaling in the similarity scores or CFG. Why didn't GMMN take off and why am I skeptical about Drifting Models? The issue is that it makes it much harder to compute any meaningful similarity when your dataset gets more diverse (happens when you switch to foundational T2I/T2V model training), or the batch size gets smaller (happens when your model size or training resolution increases), or your feature encoder produces less comparable representations (happens for videos or more diverse datasets). You can sure get informative similarities for 4096 batch size on the object-centric, limited diversity ImageNet with ResNet-50 feature encoder, but for smth like video generation, we train on hundreds of millions of videos or, at high resolutions + larger model sizes, with a batch size of 1 per GPU (not sure if will be fast to do inter-GPU distance computations). From the theoretical perspective, even though the final objective and the practical training scheme are the same, the mathematical machinery to formulate the framework is very different and enables direct access to the drifting field (e.g., to easily enable CFG which the authors already did). But I guess what I like the most about this paper is that Kaiming's group is boldly pushing against the mainstream ideas of the community, and hopefully it will inspire others to also take a look at the fundamentals and stop cargo-culting diffusion models.

English
6
4
101
26K
Ke Li 🍁
Ke Li 🍁@KL_Div·
To clarify, my post was in response to @isskoro's post, which was on the relationship between drifting models and GMMN and why GMMN didn't take off. I agree with you that the idea of normalizing kernels is not new; I was merely pointing out why this difference could explain why GMMN didn't take off.
English
0
0
9
815
Peyman Milanfar
Peyman Milanfar@docmilanfar·
@KL_Div Normalizing the kernel is standard when you’re computing weighted averages. That’s nothing that distinguishes the approach at all.
English
1
1
18
3.3K
Ke Li 🍁
Ke Li 🍁@KL_Div·
@jon_barron I think they use minibatches, so n is the batch size, but yes, the computational complexity is quadratic. It's indeed possible to use fast nearest neighbor approximations - in fact we did that in IMLE.
English
0
0
2
340
Jon Barron
Jon Barron@jon_barron·
@KL_Div Does a normalization requirement mean O(n^2) cost where n is training dataset size? Barring fast nearest neighbor approximations etc.
English
1
0
0
1.1K
Ke Li 🍁
Ke Li 🍁@KL_Div·
Thanks for pointing out the similarity between drifting and Implicit Maximum Likelihood Estimation! I worked out the mathematical connection - the crux is that drifting fields are similar to the gradient of a soft version of the IMLE loss. So drifting is defined in terms of the gradient, whereas IMLE is defined in terms of the objective, but the behaviour should be similar. It's reminiscent of the formulation of classical mechanics vs. Lagrangian mechanics from physics. One difference is that in drifting the weights on the positive samples and the negative samples are different, whereas they are the same in IMLE. It'd be interesting to see if the negative weights can be replaced with positive weights.
Ke Li 🍁 tweet mediaKe Li 🍁 tweet media
Hansheng Chen@HanshengCh

Cool new paper by @Goodeat258 and Kaiming's team! arxiv.org/abs/2602.04770 Reminds me of @KL_Div's Implicit Maximum Likelihood Estimation paper

English
10
76
768
180.9K
Deepak Pathak
Deepak Pathak@pathak2206·
At @SkildAI, we’ve raised $1.4B, bringing our valuation to over $14B. We’re on a generational mission, and I’m grateful to be working alongside an exceptional team. Thanks to our investors for the long-term conviction towards omni-bodied intelligence 🚀 bloomberg.com/news/articles/…
Skild AI@SkildAI

Announcing Series C We’ve raised $1.4B, valuing the company at over $14B With this capital, we will accelerate our mission to build omni-bodied intelligence 🚀 skild.ai/blogs/series-c

English
46
55
632
159.4K
Ke Li 🍁
Ke Li 🍁@KL_Div·
As shown, the approach significantly outperforms recent baselines. 4/5
English
1
0
2
1K
Ke Li 🍁
Ke Li 🍁@KL_Div·
If you are at #ICCV2025, check out our work on interpolating between two states of a 3D scene with large motion at poster 269 on Tuesday afternoon. It proposes a general-purpose method that can disambiguate points with similar appearance. Website: junrul.github.io/gmc/ 1/5
English
1
0
12
2.1K