Brian Chao (@BrianCChao) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Brian Chao@BrianCChao·25 Mar

This project started with a simple question: why are we still running full attention calculations for background pixels we aren’t even looking at? In many applications, such as interactive gaming and robotics simulation, only selective regions require high-resolution generation. Our new work, Foveated Diffusion, brings the biological efficiency of the human visual system to Diffusion Transformers by directly reducing the token count through a perceptually-motivated design, adding a new axis to the scaling laws of generative AI. See the full breakdown in @GordonWetzstein's post below:

Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English

6

13

129

23.1K

Brian Chao@BrianCChao·1d

Always this: generate less content where you don't look. I love anything LoD. Foveated Splats and Foveated Diffusion share exactly the same motivation.

spark@sparkjsdev

To keep the main loop smooth, Spark updates LoD updates asynchronously using Rust compiled to WebAssembly. We also use fixed foveated rendering to bias the splat budget toward the center of your view. More detail where you look, less where you don't. 🎯 Demo: wlt-ai-cdn.art/spark-2.0/2604…

English

0

4

316

Brian Chao@BrianCChao·1d

yeah, tbh I was not really impressed by the deep research results. I felt like deep research always either finds really obscure papers or too fundamental/general ones. that's the reason I made my own skill so Claude can do literature search based on how *I* think. I haven't used it recently, though. maybe they got better.

English

0

3

644

Pasquale Minervini@PMinervini·1d

@BrianCChao hey you may like this! github.com/pminervini/dee…

English

1

0

16

1.1K

Brian Chao@BrianCChao·1d

Sharing a very simple Claude skill I created for ML literature survey. My experience with existing skills or ML paper search engines is that don't really capture how researchers *think* when doing literature search. Literature search is not just looking for keywords, but being creative, drawing parallels from different fields, and thinking two or three steps ahead. I iterated this skill with Claude a couple of times to refine it and I am pretty satisfied with its current hit rate. Topics I surveyed include efficient video tokenization, mixed-resolution diffusion / tokenization, etc, and it gave me pretty accurate results and found papers that went under my radar. Hope this is useful! github.com/bchao1/paper-f…

English

7

66

708

34.5K

Brian Chao@BrianCChao·1d

@jetnew_sg I think a good addition would be finding similar concepts in adjacent fields. For example, a lot of current vision model designs are actually inspired heavily by LLM research, so it'd be useful to find parallels in NLP when searching for CV literature.

English

0

411

Jet New@jetnew_sg·1d

@BrianCChao Thanks for sharing! What remaining gaps do you think it has that you think could be implemented for future directions?

English

1

0

798

Brian Chao@BrianCChao·31 Mar

@themylesfiles @Stanford thanks for sharing our work!

English

0

1

87

myles@themylesfiles·30 Mar

Your eyes only see in high resolution in a tiny 2° patch - everything else is blurred and your brain fills in the gaps. This paper by researchers at @Stanford exploits that for diffusion models: render full detail only at the gaze point, downsample the periphery. 2-4x faster generation and users literally can't tell the difference. I built a @marimo_io notebook to try it yourself: move a gaze point around an image and watch how aggressively you can cut peripheral resolution before you notice. molab.marimo.io/notebooks/nb_2…

alphaXiv@askalphaxiv

"Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation" This paper introduces the logic of human vision to diffusion models, where you generate full detail only when the viewer is looking, and becomes low detail in the periphery. With this setup, you can get up to 2x faster image generation and 4x faster video generation with little perceptual drop!

English

1

7

54

5.4K

Brian Chao@BrianCChao·27 Mar

@cataluna84 @GordonWetzstein I’d love to present our work here

English

1

0

1

20

Mayank Bhaskar@cataluna84·26 Mar

Congratulations on the awesome release of Foveated Diffusion: Efficient Spatially Aware Image and Video Generation! Would you or any of your co-authors like to present your papers in the Cohere Labs Computer Vision community? Cohere Labs Community Page: sites.google.com/cohere.com/coh… Here's the playlist to previous talks, if you are interested: youtube.com/playlist?list=…

English

1

0

2

439

Gordon Wetzstein@GordonWetzstein·25 Mar

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English

24

114

1.1K

158.1K

Brian Chao@BrianCChao·26 Mar

@lukedneumann there are other works on perception that uses similar concepts (i.e, mixed-resolution patches”). keywords include adaptive patch sizes for ViTs, foveated ViTs, or a recent paper called AutoGaze.

English

0

1

29

Luke Neumann@lukedneumann·26 Mar

@BrianCChao Not synthetic, no. The purpose would be helping with the digestion of an organic dataset of that quality.

English

1

0

33

Brian Chao@BrianCChao·25 Mar

This project started with a simple question: why are we still running full attention calculations for background pixels we aren’t even looking at? In many applications, such as interactive gaming and robotics simulation, only selective regions require high-resolution generation. Our new work, Foveated Diffusion, brings the biological efficiency of the human visual system to Diffusion Transformers by directly reducing the token count through a perceptually-motivated design, adding a new axis to the scaling laws of generative AI. See the full breakdown in @GordonWetzstein's post below:

Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English

6

13

129

23.1K

Brian Chao@BrianCChao·26 Mar

@lukedneumann if you are curating a synthetic dataset, yes of course. in fact one of the compelling use cases for this method is generative simulation

English

1

0

1

35

Luke Neumann@lukedneumann·26 Mar

@BrianCChao Would there be a benefit to applying this process at the encoding stage? So essentially baking this efficiency into an ultra high resolution (8K/60fps/HDR) dataset.

English

1

0

39

Brian Chao@BrianCChao·26 Mar

codecs compress already generated images into lossy versions of it. the generation part is still costly because everything is still generated in high resolution. here, we speed up the generation itself so you can still pass the generated images through modern codecs.

English

1

0

1

59

Luke Neumann@lukedneumann·26 Mar

@BrianCChao I guess my question is this: Modern codecs (H.265/.266) already do this pretty well. Is the breakthrough here more about the translation of this "focal point" downstream?

English

1

0

48

Brian Chao@BrianCChao·26 Mar

@deepfates @sameQCU in fact, our method is theoretically totally compatible with whatever attention speedup mechanisms since we are directly reducing the number of tokens. you can plug in whatever sparse attention mechanism you'd like to attain further speedup.

English

0

19

🎭@deepfates·26 Mar

@sameQCU okay good I won't delete but I will update my mental model

English

1

0

3

86

🎭@deepfates·25 Mar

Smart. Nature figured this out long ago. We will see much more of this kind of biologically inspired design

Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English

3

1

112

7.4K

Brian Chao@BrianCChao·25 Mar

thank you!! interactive video was definitively a use case we were targeting — there are also so much more applications like generative simulation in robotics! imagine a distilled autoregressive foveated video diffusion model where only the robotic arm and manipulated object are generated at high resolution 🦾

English

0

5

105

Anton Obukhov@AntonObukhov1·25 Mar

Generative foveating, so cool This could be the solution to interactive video generation in VR - low-quality content outside the fovea region won't even be noticeable Too bad Meta is already pulling the plug on VR

Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English

3

0

5

1K

Brian Chao@BrianCChao·25 Mar

i think this would be an interesting idea. some closely related works include “matryoshka models” where the text/images are generated at different levels of detail (arxiv.org/abs/2405.17430). we do have to think about how humans perceive text. gaze is a natural spatial signal for visual data, but for text it’s hard to know which part of the passage the reader will focus on in advance.

English

0

1

2

148

Lazarz@Laz4rz·25 Mar

@BrianCChao What about applying it on text?

English

1

0

1

183

Brian Chao@BrianCChao·25 Mar

@curiouskid423 thank you!! great discussions on VAEs on the rooftop btw. Why aren’t there variable-length VAEs yet??

English

0

1

134

Kevin Li@curiouskid423·25 Mar

@BrianCChao Congrats Brian!! 🐐

English

1

0

146

Brian Chao@BrianCChao·25 Mar

I think IPE features or other PE flavors that encode scale information will help training immensely. Another fundamental question, however, is that once we have the mixed-resolution latents, how can we decode them back to pixel space? This involves some upsampling of the low-res parts (either in pixel space or latent space) in the final stage since the most widely used VAEs onkly work with regular grids. I don't have a solution for that yet, but some ideas include: adapting VAE to handle variable length or applying Foveated Diffusion directly to pixel-space diffusion so that no decoding is required.

English

0

99

Brian Chao@BrianCChao·25 Mar

Re handling mixed-resolution RoPE: without the RoPE subsampling for low-resolution tokens, the generated images are trashy (I'll refer to this paper: arxiv.org/abs/2511.19778 for more details). With subsampling we are finetuning from a starting point of much higher quality (which we refer to as the "naive mixed-resolution" baseline in the paper). Re communications between frequencies: exactly as you said, the asymmetry between low-res and high-res RoPE already implicitly encodes the scale information. This was surprising to us when we first saw the results, as we initially thought we'd need some flag (like the attenuation you mentioned) to indicate which tokens are low-res. I do think that explicitly adding a flag like you suggested would make the DiT learn way faster, though.

English

1

0

2

133

Brian Chao@BrianCChao·25 Mar

It was so amazing working with @YarivLior!! Our eyes process wide field-of-view imagery with mind-boggling efficiency, largely due to the fact that peripheral information is never perceived at the highest resolution and is only used for global context. Why can't machines do the same? We also had a lot of fun making the interactive demos on our project website. Try it out yourself: bchao1.github.io/foveated-diffu…

Lior Yariv@YarivLior

Why pay full compute for pixels you're not even looking at? In our new work, Foveated Diffusion, we introduce a new concept for efficient image and video generation, motivated by how the human visual system works. (See full thread below)

English

1

4

41

4.8K

Brian Chao@BrianCChao·25 Mar

@YarivLior @howard_xhc @GordonWetzstein Such a fun project to work on and a great pleasure to work with you @YarivLior!! More to come 🫡

English

0

1

49

Lior Yariv@YarivLior·25 Mar

Check out our project page for more results and interactive demo — bchao1.github.io/foveated-diffu… Joint work with @BrianCChao, @howard_xhc, and @GordonWetzstein.

English

1

0

2

294

Lior Yariv@YarivLior·25 Mar

Why pay full compute for pixels you're not even looking at? In our new work, Foveated Diffusion, we introduce a new concept for efficient image and video generation, motivated by how the human visual system works. (See full thread below)

Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English

9

16

125

19.1K

Brian Chao@BrianCChao·25 Mar

Thanks Jon! This is a great question. We did something related to what you alluded to, where we subsample key tokens for low-resolution query tokens in attention calculation and fine-tuned the model so that the DiT can learn to handle multi-resolution RoPE. This solves the content scale mismatch between resolutions, but tiny border artifacts are still observable. We only used soft-blending in the final compositing stage, but I’d say that’s more of a post-processing fix because we never trained with soft masks. Filtering RoPE is very interesting, and recent works have also shown that manipulating high-frequency RoPE components actually don’t change the content a lot, so that’s definitely worth investigating. I think yet another direction is to rethink the design of VAEs so that they can decode variable-length, mixed-resolution sequences directly instead of having to blend mixed-resolution crops in pixel space like we did in this paper.

English

1

0

3

319

Jon Barron@jon_barron·25 Mar

Looks cool! Do you think there's value in having the mask being Gaussian instead of a hard disc? It seems like this would let you analytically prefilter the RoPEs (I think) which would downweight high frequencies instead of either dropping or keeping them, which might make the discontinuity between the foveated and non-foveated regions less visible.

English

1

0

20

3.2K

Brian Chao

Keşfet