Ke Li 🍁

209 posts

Ke Li 🍁 banner
Ke Li 🍁

Ke Li 🍁

@KL_Div

Assistant Professor of Computing Science @SFU. Ph.D. from @Berkeley_EECS and Bachelor's from @UofTCompSci. Formerly @GoogleAI and Member of @the_IAS.

Vancouver, Canada Katılım Haziran 2019
403 Takip Edilen6.3K Takipçiler
Sabitlenmiş Tweet
Ke Li 🍁
Ke Li 🍁@KL_Div·
Diffusion models turn the data into a mixture of isotropic Gaussians, and so struggle to capture the underlying structure when trained on small datasets. In our new #ECCV2024 paper, we introduce RS-IMLE, a generative model that gets around this issue. Website: serchirag.github.io/rs-imle Code: github.com/SerChirag/rs-i… Joint work w/ @researchirag and @PengShichong If you are at #ECCV2024, come and check out poster 279 on Thursday afternoon from 4:30pm-6:30pm. (1/6) Thread 👇
English
6
123
763
73.2K
Ke Li 🍁
Ke Li 🍁@KL_Div·
@MingyuanZhou Agreed - normalization as an idea is definitely not new (c.f. Sinkhorn iterations). I was merely pointing out that difference relative to GMMN could explain why GMMN didn't take off.
English
0
0
1
172
Mingyuan Zhou
Mingyuan Zhou@MingyuanZhou·
@KL_Div In our NeurIPS 2021 paper, we define (approximate) forward and backward conditional transport by row- or column-normalizing the distance matrix between true and fake samples, yielding balanced mode-covering and mode-seeking behaviors.
English
1
0
2
206
Ke Li 🍁
Ke Li 🍁@KL_Div·
To be fair to the authors, I think the normalization of the kernel is key. If the normalization weren't there, the kernel would not depend on other samples. In that case, the drift would be the same regardless of whether (1) all fake samples are far away from a real sample (which is common at the beginning of training), or (2) one fake sample is much closer to a real sample compared to other fake samples (which is common later on in training). One would want the drift to be large in the former case and small in the latter case. But without normalization, there would be no way to make that happen.
Ivan Skorokhodov@isskoro

The recent Drifting Models paper from Kaiming's group got very hyped over the past few days as a new generative modeling paradigm, but in fact, it can actually be seen as a scaled-up/generalized version of the good old GMMN from 2015 (and the authors themselves acknowledge this in the paper in Appendix C.2, noting that GMMN can be seen as Drifting Models for a particular choice of the kernel). Also, I am very skeptical about its scalability (for higher diversity / higher resolution datasets, larger models, and videos). The way Drifting Models work is actually very simple: - 1. Sample random noise z ~ N(0, I) - 2. Feed it to the generator and get a fake sample x' = G(z) - 3. For each fake sample x', compute its similarity (in the feature space of some encoder) to each of the real samples x_i from the current batch. - 4. Push it closer toward these real samples using the similarities as weights (i.e. so that we push to the nearest ones the most). - 5. To make sure that we don't have any sort of mode collapse, repel each fake sample from other fake samples via the same scheme. - 6. Profit Now, GMMN follows exactly the same scheme, with the only difference being that it uses a different (unnormalized) function in the "distance computation" and doesn't allow for cleanly plugging in normalization/scaling in the similarity scores or CFG. Why didn't GMMN take off and why am I skeptical about Drifting Models? The issue is that it makes it much harder to compute any meaningful similarity when your dataset gets more diverse (happens when you switch to foundational T2I/T2V model training), or the batch size gets smaller (happens when your model size or training resolution increases), or your feature encoder produces less comparable representations (happens for videos or more diverse datasets). You can sure get informative similarities for 4096 batch size on the object-centric, limited diversity ImageNet with ResNet-50 feature encoder, but for smth like video generation, we train on hundreds of millions of videos or, at high resolutions + larger model sizes, with a batch size of 1 per GPU (not sure if will be fast to do inter-GPU distance computations). From the theoretical perspective, even though the final objective and the practical training scheme are the same, the mathematical machinery to formulate the framework is very different and enables direct access to the drifting field (e.g., to easily enable CFG which the authors already did). But I guess what I like the most about this paper is that Kaiming's group is boldly pushing against the mainstream ideas of the community, and hopefully it will inspire others to also take a look at the fundamentals and stop cargo-culting diffusion models.

English
6
4
101
25.7K
Ke Li 🍁
Ke Li 🍁@KL_Div·
To clarify, my post was in response to @isskoro's post, which was on the relationship between drifting models and GMMN and why GMMN didn't take off. I agree with you that the idea of normalizing kernels is not new; I was merely pointing out why this difference could explain why GMMN didn't take off.
English
0
0
9
800
Peyman Milanfar
Peyman Milanfar@docmilanfar·
@KL_Div Normalizing the kernel is standard when you’re computing weighted averages. That’s nothing that distinguishes the approach at all.
English
1
1
18
3.3K
Ke Li 🍁
Ke Li 🍁@KL_Div·
@jon_barron I think they use minibatches, so n is the batch size, but yes, the computational complexity is quadratic. It's indeed possible to use fast nearest neighbor approximations - in fact we did that in IMLE.
English
0
0
2
328
Jon Barron
Jon Barron@jon_barron·
@KL_Div Does a normalization requirement mean O(n^2) cost where n is training dataset size? Barring fast nearest neighbor approximations etc.
English
1
0
0
1K
Ke Li 🍁
Ke Li 🍁@KL_Div·
Thanks for pointing out the similarity between drifting and Implicit Maximum Likelihood Estimation! I worked out the mathematical connection - the crux is that drifting fields are similar to the gradient of a soft version of the IMLE loss. So drifting is defined in terms of the gradient, whereas IMLE is defined in terms of the objective, but the behaviour should be similar. It's reminiscent of the formulation of classical mechanics vs. Lagrangian mechanics from physics. One difference is that in drifting the weights on the positive samples and the negative samples are different, whereas they are the same in IMLE. It'd be interesting to see if the negative weights can be replaced with positive weights.
Ke Li 🍁 tweet mediaKe Li 🍁 tweet media
Hansheng Chen@HanshengCh

Cool new paper by @Goodeat258 and Kaiming's team! arxiv.org/abs/2602.04770 Reminds me of @KL_Div's Implicit Maximum Likelihood Estimation paper

English
11
76
768
180K
Deepak Pathak
Deepak Pathak@pathak2206·
At @SkildAI, we’ve raised $1.4B, bringing our valuation to over $14B. We’re on a generational mission, and I’m grateful to be working alongside an exceptional team. Thanks to our investors for the long-term conviction towards omni-bodied intelligence 🚀 bloomberg.com/news/articles/…
Skild AI@SkildAI

Announcing Series C We’ve raised $1.4B, valuing the company at over $14B With this capital, we will accelerate our mission to build omni-bodied intelligence 🚀 skild.ai/blogs/series-c

English
46
56
623
154.2K
Ke Li 🍁
Ke Li 🍁@KL_Div·
As shown, the approach significantly outperforms recent baselines. 4/5
English
1
0
2
986
Ke Li 🍁
Ke Li 🍁@KL_Div·
If you are at #ICCV2025, check out our work on interpolating between two states of a 3D scene with large motion at poster 269 on Tuesday afternoon. It proposes a general-purpose method that can disambiguate points with similar appearance. Website: junrul.github.io/gmc/ 1/5
English
1
0
12
2K
Ke Li 🍁
Ke Li 🍁@KL_Div·
@janusch_patas You might be interested in this: zvict.github.io/papr/ (essentially the tails of Gaussians cause vanishing gradients, and you can get around it by learning an interpolation kernel)
English
1
0
3
627
MrNeRF
MrNeRF@janusch_patas·
Official Launch of the MrNeRF 3DGS Bounty 2: We're offering 🏆 $1600 + $500 bonus for improving initialization & training without densification for 3D Gaussian Splatting! RT & tag friends who might crush this. Details in thread 👇
English
4
21
118
35.1K
Ke Li 🍁
Ke Li 🍁@KL_Div·
How can LLMs be made to handle longer contexts efficiently? Most prior methods require retraining - can we do without? In IceFormer, we showed how by repurposing Prioritized DCI, a nearest neighbour search algorithm. Find out more at @Mao_Yuzhen’s oral at the ICML LCFM workshop tomorrow at 12:15pm (longcontextfm.github.io/schedule/). Details are available at yuzhenmao.github.io/IceFormer/.
English
1
0
6
1K
Ke Li 🍁 retweetledi
SFU School of Computing Science
What a day at the Vision & Learning Workshop at #ICML2025! With an incredible lineup of speakers to lighting talks on recent advancements in machine learning, researchers shared insights into the future of AI research. A huge thank you to everyone who made it a success!
SFU School of Computing Science tweet mediaSFU School of Computing Science tweet mediaSFU School of Computing Science tweet mediaSFU School of Computing Science tweet media
English
1
30
40
2.4K
Bolei Zhou
Bolei Zhou@zhoubolei·
I've officially become an Associate Professor with tenure at @UCLA @UCLAengineering as we kick off the new academic year on July 1! Deepest gratitude to my mentors, my amazing students, and wonderful collaborators. Incredible journey so far—more exciting research ahead! 🚀
Bolei Zhou tweet media
English
52
4
476
19K
Ke Li 🍁 retweetledi
Felix (Yuxiang) Fu
Felix (Yuxiang) Fu@felix_yuxiang·
Interested in how to generate realistic human trajectories using diffusion with just one step🤔? This is now possible with MoFlow, a one-step Flow Matching method accompanied with Implicit Maximum Likelihood Estimation based distillation. 🚀 Join us at #CVPR2025 in Nashville!
English
1
3
6
1K
Ke Li 🍁
Ke Li 🍁@KL_Div·
The case with positive \epsilon is a bit more subtle because \epsilon should be dynamic rather than static. For example, if we consider a decreasing sequence of \epsilon, the middle mode of the trimodal distribution would move towards x_i as \epsilon decreases, and eventually the middle mode and the mode corresponding to x_i would merge. In this analysis, for a given \epsilon, we should actually compare two trimodal distributions, the one that you considered, and another where the middle mode is closer to x_i than in the first distribution. As long as the distance between the middle mode to x_i is greater than \epsilon, the latter would attain a smaller loss than the former. So as \epsilon tends to zero, the middle mode tends to x_i. Whether the empirical data distribution is always globally optimal with \epsilon=0 or a decreasing sequence of \epsilon is a good question. Previously we analyzed the loss in the unconditional \epsilon=0 setting from the perspective of approximating likelihood (akin to how variational inference is justified, namely as a lower bound to log-likelihood): proceedings.mlr.press/v202/aghabozor……. If you are interested in discussing further, feel free to shoot me an email.
English
0
0
1
73
Siddharth Ancha
Siddharth Ancha@siddancha·
You're absolutely right! I was previously ignoring higher order terms, which are exponentially small in m, and pretending that both losses were equal (i.e. zero). So I was wrong about it being a counterexample for ε=0. But for all ε>0, the tri-modal distribution indeed has a lower loss (≈1) than the bimodal one (≈2), because the lowest order terms are different. Then I wonder: for a well-chosen ε (or simply for ε=0), is the empirical data distribution always globally optimal under the IMLE/C-RS-IMLE loss? Could we prove such a theorem, or do some counterexamples exist? 🤔
English
1
0
0
116
Krishan Rana
Krishan Rana@krshnrana·
Are Diffusion and Flow Matching the best generative modelling algorithms for behaviour cloning in robotics? ✅Multimodality ❌Fast, Single-Step Inference ❌Sample Efficient 💡 We introduce IMLE Policy, a novel behaviour cloning approach that can satisfy all the above. 🧵👇
English
8
42
309
39.6K