Kevin Li

248 posts

Kevin Li banner
Kevin Li

Kevin Li

@curiouskid423

multimodal research @adobe | co-lead of firefly image 5 | ex vision research @berkeley_ai | 🇹🇼 | opinions are my own

San Jose, CA Katılım Ocak 2017
646 Takip Edilen222 Takipçiler
Sabitlenmiş Tweet
Kevin Li
Kevin Li@curiouskid423·
(1/n) 🚀 Your VLM can be a great multimodal encoder for image editing and generation if you use the middle layers wisely (yes, plural 😉). We are thrilled to present UniFusion - the first architecture uses only VLM as input-condition encoder without auxiliary signals from VAE or CLIP to do image editing with competitive quality, to the best of our knowledge. Here’re what you get with VLM as your unified encoder: 🎯 zero-shot multi-reference image generation when trained only on single-ref pairs 🎯 cross-task capability transfer -- editing task helps text-to-image generation qualitatively and quantitatively 🎯 a competitive text-to-image and editing joint model that beats Flux.1 [dev] and Bagel, respectively, with a smaller model and less data 👇More details in the thread.
Kevin Li tweet media
English
2
14
23
12.2K
Kevin Li retweetledi
Reid Wiseman
Reid Wiseman@astro_reid·
There are no words.
Reid Wiseman tweet media
English
7.9K
86.4K
649.7K
38.4M
Brian Chao
Brian Chao@BrianCChao·
This project started with a simple question: why are we still running full attention calculations for background pixels we aren’t even looking at? In many applications, such as interactive gaming and robotics simulation, only selective regions require high-resolution generation. Our new work, Foveated Diffusion, brings the biological efficiency of the human visual system to Diffusion Transformers by directly reducing the token count through a perceptually-motivated design, adding a new axis to the scaling laws of generative AI. See the full breakdown in @GordonWetzstein's post below:
Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English
6
13
129
23.2K
Kevin Li
Kevin Li@curiouskid423·
@chihyaoma That’s super impressive! Congrats!
English
0
0
1
115
Kevin Chih-Yao Ma
Kevin Chih-Yao Ma@chihyaoma·
A new journey but with several old friends and some amazing people. When I joined, image generation was an exceptional team but tiny. In H2 2025, we scaled quickly to ~10–15 people, with many joining in Q4. Just 3 months later, we built MAI-Image-2 — top 3 just behind Google and OpenAI. It was our first serious attempt at nearly every part of the stack: new codebase, new data, new evals, new infra, first pretraining, first few post-training runs, etc. What a fun journey to build something from the ground up again!
Microsoft AI@MicrosoftAI

Meet MAI‑Image‑2. Built with creatives, for real creative work. Ranked #5 on @arena’s text‑to‑image leaderboard. Available now: msft.it/6014QUCBe

English
10
5
68
9K
Kevin Li
Kevin Li@curiouskid423·
@giffmana bigger model makes human anatomy easier, better rewriting helps a ton - it helps you avoid falling out the vocabulary distribution even if your dataset & caption lacks diversity, which could lead to artifacts... and ofc adding more image data helps :)
English
0
0
0
884
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
I have a question about last year's image-generation progress, wonder what y'all think. How did we go from all models consistently getting fingers wrong, to all models consistently getting them right? This "flip" seems to have happened basically across all companies/models at the ~same time. Even "random" non-frontier papers seem to get it right? Or they just cherry-pick the figures?
Lucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet media
English
87
15
483
110.5K
Kevin Li
Kevin Li@curiouskid423·
@threebarebears watched yesterday and it was amazing ✨ i love the plot. the rendering of animal fur and lighting… is just impressive!!
English
0
0
1
1.1K
Daniel Chong
Daniel Chong@threebarebears·
THANK YOU for loving #Hoppers this weekend! 💞 To celebrate, here's a 2D animation test we did back in 2020 for inspiration (by Lorenzo Fresta) 🦫🦫
English
296
6.5K
53K
1M
Kevin Li retweetledi
Samarth Sinha
Samarth Sinha@_sam_sinha_·
Don’t hate on diffusion! But there is more to the world than 2023 level high aesthetic moody images. That will always be key for many users, but our goal is to think bigger and build the next great frontier, not reinvent the same old (btw are aesthetics are also v good)
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

I hope this is an extinction event for diffusion slop Diffusion is the greatest nerdsnipe in recent history so many bright minds led astray by pretty mafs well, with new complex attention designs they'll hopefully come back into the fold

English
0
2
22
2.9K
Peter Tong
Peter Tong@TongPetersb·
Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]
Peter Tong tweet media
English
36
220
1.1K
212K
Kevin Li
Kevin Li@curiouskid423·
@TongPetersb Great work!! It's a bit surprising that both VAEs have so much lower DPG Bench and GenEval score... did we enforce same number of iterations across all encoder ablations, or does VAE continue to fall short even with more training?
English
1
0
0
123
Peter Tong
Peter Tong@TongPetersb·
The first challenge: how do we represent vision? RAE, VAE, or raw pixels? We careful4ly ablated each option and found that RAE with models like SigLIP 2 excels at both visual understanding and generation, while having minimal impact on text ability. Simplicity wins! [2/9]
Peter Tong tweet media
English
3
4
53
15.6K
Kevin Li retweetledi
Stefano Ermon
Stefano Ermon@StefanoErmon·
Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.
English
322
577
4.2K
1M
Kevin Li retweetledi
The Dor Brothers
The Dor Brothers@thedorbrothers·
We just made a $200,000,000 AI movie in just one day. Yes, this is 100% AI.
English
8.5K
8.8K
59.7K
20.1M
Kevin Li retweetledi
jacob
jacob@jsnnsa·
I've written 250k+ lines of game engine code. Here's why Genie 3 isn't what people think it is: World models are something genuinely new. A third category of media we don't have a name for yet. Near-term they're too slow and expensive for consumers. But for training robots? Incredible. Simulating a million kitchen scenarios is exactly what embodied AI needs. Medium-term is where it gets interesting. Add sound generation, longer context, more control and you have something Netflix should be terrified of. Imagine exploring Westeros between seasons. Wandering the Stranger Things universe. That's a real product, and it's coming. But that's interactive storytelling. Gamers play because it's fun to get better at something. Progression systems. Mechanical mastery. Nostalgia, where things work exactly how they always worked. They sink months into a single title. Years. And here's the thing: they mostly don't care about graphics or narrative. Every single one of these motivations sits at the exact weak spot of world models. Games require determinism. Multiplayer needs every client to agree on physics, every frame. Speedrunners need frame-perfect consistency across thousands of attempts. Competitive play needs rules that don't drift. You can't have ranked when reality is probabilistic. World models are competing with passive media. Long-term, they'll probably eat the renderer. Generating pixels instead of rasterizing triangles. But game logic, systems, authored constraints? That's a different problem entirely. And one perfectly suited to codegen agents.
English
129
93
1.1K
183.1K
Kevin Li retweetledi
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
PSA: never, ever write "we use the same learning rate across all methods for fair comparison" I read this as "do not trust any of our conclusions" and then i move on. If learning rate tuning is not mentioned, it takes me a little more time to notice that, but i also move on.
English
34
35
796
223.9K
Kevin Li
Kevin Li@curiouskid423·
Our field tries to build unified models while being stuck with “separate encoders” for so long — VAE for generation, SigLIP for understanding, LLM/T5 for text encoding... Semantic representations are more than capable to scale to T2I Generation and even Editing, as we've shown in UniFusion from a different angle. Great to see @TongPetersb and @sainingxie's lab continuously pushing for this direction! Really enjoy reading these work 🚀
Peter Tong@TongPetersb

With RAE, visual understanding and generation operate in the same shared representation space. We show that generative training doesn't hurt understanding, and crucially, this shared space enables the LLM to perform Test-Time Scaling directly in the latent space.

English
1
0
3
87
Kevin Li retweetledi
Junyang Lin
Junyang Lin@JustinLin610·
this year i would like to slow things down a little bit and make it better. invest more on research that might take u to nothing
English
34
15
385
28.8K
Kevin Li
Kevin Li@curiouskid423·
@jalansh_m @tomieinlove yea bro i dont really look at prices anymore when shopping in TW or ordering food in restaurants 😂
English
0
0
1
23
tomie
tomie@tomieinlove·
Going to Taiwan after being accustomed to Bay Area prices feels like entering a post-scarcity society. Five dollar Ubers. Dinner for less than what a cup of black coffee would cost in SF. Vintage designer jackets the price of overnight parking in the Mission. It’s absurd.
English
161
286
8.9K
470.4K