Kevin Li (@curiouskid423) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Kevin Li@curiouskid423·16 Eki

(1/n) 🚀 Your VLM can be a great multimodal encoder for image editing and generation if you use the middle layers wisely (yes, plural 😉). We are thrilled to present UniFusion - the first architecture uses only VLM as input-condition encoder without auxiliary signals from VAE or CLIP to do image editing with competitive quality, to the best of our knowledge. Here’re what you get with VLM as your unified encoder: 🎯 zero-shot multi-reference image generation when trained only on single-ref pairs 🎯 cross-task capability transfer -- editing task helps text-to-image generation qualitatively and quantitatively 🎯 a competitive text-to-image and editing joint model that beats Flux.1 [dev] and Bagel, respectively, with a smaller model and less data 👇More details in the thread.

English

2

14

23

12.2K

Kevin Li retweetledi

Reid Wiseman@astro_reid·5 Nis

There are no words.

English

7.9K

86.4K

649.7K

38.4M

Kevin Li@curiouskid423·25 Mar

@BrianCChao Congrats Brian!! 🐐

English

1

0

146

Brian Chao@BrianCChao·25 Mar

This project started with a simple question: why are we still running full attention calculations for background pixels we aren’t even looking at? In many applications, such as interactive gaming and robotics simulation, only selective regions require high-resolution generation. Our new work, Foveated Diffusion, brings the biological efficiency of the human visual system to Diffusion Transformers by directly reducing the token count through a perceptually-motivated design, adding a new axis to the scaling laws of generative AI. See the full breakdown in @GordonWetzstein's post below:

Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English

6

13

129

23.2K

Kevin Li@curiouskid423·19 Mar

@chihyaoma That’s super impressive! Congrats!

English

0

1

115

Kevin Chih-Yao Ma@chihyaoma·19 Mar

A new journey but with several old friends and some amazing people. When I joined, image generation was an exceptional team but tiny. In H2 2025, we scaled quickly to ~10–15 people, with many joining in Q4. Just 3 months later, we built MAI-Image-2 — top 3 just behind Google and OpenAI. It was our first serious attempt at nearly every part of the stack: new codebase, new data, new evals, new infra, first pretraining, first few post-training runs, etc. What a fun journey to build something from the ground up again!

Microsoft AI@MicrosoftAI

Meet MAI‑Image‑2. Built with creatives, for real creative work. Ranked #5 on @arena’s text‑to‑image leaderboard. Available now: msft.it/6014QUCBe

English

10

5

68

9K

Kevin Li@curiouskid423·18 Mar

Excited to share that UniFusion is accepted to ICLR Multimodal Intelligence workshop🎉 Cross-modality semantic embedding space is the future. See you all in Brazil 🌴 Camera-ready: openreview.net/pdf?id=V9l4tDF…

Kevin Li@curiouskid423

(1/n) 🚀 Your VLM can be a great multimodal encoder for image editing and generation if you use the middle layers wisely (yes, plural 😉). We are thrilled to present UniFusion - the first architecture uses only VLM as input-condition encoder without auxiliary signals from VAE or CLIP to do image editing with competitive quality, to the best of our knowledge. Here’re what you get with VLM as your unified encoder: 🎯 zero-shot multi-reference image generation when trained only on single-ref pairs 🎯 cross-task capability transfer -- editing task helps text-to-image generation qualitatively and quantitatively 🎯 a competitive text-to-image and editing joint model that beats Flux.1 [dev] and Bagel, respectively, with a smaller model and less data 👇More details in the thread.

English

0

3

4

388

Kevin Li@curiouskid423·18 Mar

@giffmana bigger model makes human anatomy easier, better rewriting helps a ton - it helps you avoid falling out the vocabulary distribution even if your dataset & caption lacks diversity, which could lead to artifacts... and ofc adding more image data helps :)

English

0

884

Lucas Beyer (bl16)@giffmana·17 Mar

I have a question about last year's image-generation progress, wonder what y'all think. How did we go from all models consistently getting fingers wrong, to all models consistently getting them right? This "flip" seems to have happened basically across all companies/models at the ~same time. Even "random" non-frontier papers seem to get it right? Or they just cherry-pick the figures?

English

87

15

483

110.5K

Kevin Li@curiouskid423·9 Mar

@threebarebears watched yesterday and it was amazing ✨ i love the plot. the rendering of animal fur and lighting… is just impressive!!

English

0

1

1.1K

Daniel Chong@threebarebears·9 Mar

THANK YOU for loving #Hoppers this weekend! 💞 To celebrate, here's a 2D animation test we did back in 2020 for inspiration (by Lorenzo Fresta) 🦫🦫

English

296

6.5K

53K

1M

Kevin Li retweetledi

Samarth Sinha@_sam_sinha_·8 Mar

Don’t hate on diffusion! But there is more to the world than 2023 level high aesthetic moody images. That will always be key for many users, but our goal is to think bigger and build the next great frontier, not reinvent the same old (btw are aesthetics are also v good)

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

I hope this is an extinction event for diffusion slop Diffusion is the greatest nerdsnipe in recent history so many bright minds led astray by pretty mafs well, with new complex attention designs they'll hopefully come back into the fold

English

0

2

22

2.9K

Peter Tong@TongPetersb·4 Mar

Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]

English

36

220

1.1K

212K

Kevin Li@curiouskid423·5 Mar

@TongPetersb Great work!! It's a bit surprising that both VAEs have so much lower DPG Bench and GenEval score... did we enforce same number of iterations across all encoder ablations, or does VAE continue to fall short even with more training?

English

1

0

123

Peter Tong@TongPetersb·4 Mar

The first challenge: how do we represent vision? RAE, VAE, or raw pixels? We careful4ly ablated each option and found that RAE with models like SigLIP 2 excels at both visual understanding and generation, while having minimal impact on text ability. Simplicity wins! [2/9]

English

3

4

53

15.6K

Kevin Li retweetledi

Stefano Ermon@StefanoErmon·24 Şub

Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.

English

322

577

4.2K

1M

Kevin Li retweetledi

The Dor Brothers@thedorbrothers·16 Şub

We just made a $200,000,000 AI movie in just one day. Yes, this is 100% AI.

English

8.5K

8.8K

59.7K

20.1M

Kevin Li retweetledi

jacob@jsnnsa·30 Oca

I've written 250k+ lines of game engine code. Here's why Genie 3 isn't what people think it is: World models are something genuinely new. A third category of media we don't have a name for yet. Near-term they're too slow and expensive for consumers. But for training robots? Incredible. Simulating a million kitchen scenarios is exactly what embodied AI needs. Medium-term is where it gets interesting. Add sound generation, longer context, more control and you have something Netflix should be terrified of. Imagine exploring Westeros between seasons. Wandering the Stranger Things universe. That's a real product, and it's coming. But that's interactive storytelling. Gamers play because it's fun to get better at something. Progression systems. Mechanical mastery. Nostalgia, where things work exactly how they always worked. They sink months into a single title. Years. And here's the thing: they mostly don't care about graphics or narrative. Every single one of these motivations sits at the exact weak spot of world models. Games require determinism. Multiplayer needs every client to agree on physics, every frame. Speedrunners need frame-perfect consistency across thousands of attempts. Competitive play needs rules that don't drift. You can't have ranked when reality is probabilistic. World models are competing with passive media. Long-term, they'll probably eat the renderer. Generating pixels instead of rasterizing triangles. But game logic, systems, authored constraints? That's a different problem entirely. And one perfectly suited to codegen agents.

English

129

93

1.1K

183.1K

Kevin Li retweetledi

Lucas Beyer (bl16)@giffmana·30 Oca

PSA: never, ever write "we use the same learning rate across all methods for fair comparison" I read this as "do not trust any of our conclusions" and then i move on. If learning rate tuning is not mentioned, it takes me a little more time to notice that, but i also move on.

English

34

35

796

223.9K

Kevin Li@curiouskid423·24 Oca

Check out our work UniFusion, where we show great image editing is possible with *only* VLM semantic features! (with plenty of zero-shot capabilities!): x.com/curiouskid423/…

Kevin Li@curiouskid423

(1/n) 🚀 Your VLM can be a great multimodal encoder for image editing and generation if you use the middle layers wisely (yes, plural 😉). We are thrilled to present UniFusion - the first architecture uses only VLM as input-condition encoder without auxiliary signals from VAE or CLIP to do image editing with competitive quality, to the best of our knowledge. Here’re what you get with VLM as your unified encoder: 🎯 zero-shot multi-reference image generation when trained only on single-ref pairs 🎯 cross-task capability transfer -- editing task helps text-to-image generation qualitatively and quantitatively 🎯 a competitive text-to-image and editing joint model that beats Flux.1 [dev] and Bagel, respectively, with a smaller model and less data 👇More details in the thread.

English

0

2

70

Kevin Li@curiouskid423·24 Oca

Our field tries to build unified models while being stuck with “separate encoders” for so long — VAE for generation, SigLIP for understanding, LLM/T5 for text encoding... Semantic representations are more than capable to scale to T2I Generation and even Editing, as we've shown in UniFusion from a different angle. Great to see @TongPetersb and @sainingxie's lab continuously pushing for this direction! Really enjoy reading these work 🚀

Peter Tong@TongPetersb

With RAE, visual understanding and generation operate in the same shared representation space. We show that generative training doesn't hurt understanding, and crucially, this shared space enables the LLM to perform Test-Time Scaling directly in the latent space.

English

1

0

3

87

Kevin Li retweetledi

Junyang Lin@JustinLin610·17 Oca

this year i would like to slow things down a little bit and make it better. invest more on research that might take u to nothing

English

34

15

385

28.8K

Kevin Li@curiouskid423·4 Oca

@jalansh_m @tomieinlove yea bro i dont really look at prices anymore when shopping in TW or ordering food in restaurants 😂

English

0

1

23

Jalansh Munshi@jalansh_m·2 Oca

@tomieinlove @curiouskid423 is this what you also recently felt? 🙃😆

English

1

0

2

329

tomie@tomieinlove·2 Oca

Going to Taiwan after being accustomed to Bay Area prices feels like entering a post-scarcity society. Five dollar Ubers. Dinner for less than what a cup of black coffee would cost in SF. Vintage designer jackets the price of overnight parking in the Mission. It’s absurd.

English

161

286

8.9K

470.4K

Kevin Li@curiouskid423·5 Ara

Check out this great work from Elvis!

Elvis Hsieh@elvis_hsieh77

How do we get a robot to use a screwdriver 🪛 and fasten a nut 🔩 ? Introducing DexScrew, our sim-to-real framework that enables a dexterous hand 🖐️ to autonomously perform complex and contact-rich tasks even when we cannot accurately simulate them. We open-source the full-stack implementation: dexscrew.github.io

English

1

0

1

148

Kevin Li

Keşfet