Brian Chao

87 posts

Brian Chao

Brian Chao

@BrianCChao

Ph.D. ing @Stanford · @NSF Graduate Fellow · I work on spatial computing

Katılım Ağustos 2021
331 Takip Edilen466 Takipçiler
Sabitlenmiş Tweet
Brian Chao
Brian Chao@BrianCChao·
This project started with a simple question: why are we still running full attention calculations for background pixels we aren’t even looking at? In many applications, such as interactive gaming and robotics simulation, only selective regions require high-resolution generation. Our new work, Foveated Diffusion, brings the biological efficiency of the human visual system to Diffusion Transformers by directly reducing the token count through a perceptually-motivated design, adding a new axis to the scaling laws of generative AI. See the full breakdown in @GordonWetzstein's post below:
Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English
6
13
129
23.1K
Brian Chao
Brian Chao@BrianCChao·
yeah, tbh I was not really impressed by the deep research results. I felt like deep research always either finds really obscure papers or too fundamental/general ones. that's the reason I made my own skill so Claude can do literature search based on how *I* think. I haven't used it recently, though. maybe they got better.
English
0
0
3
644
Brian Chao
Brian Chao@BrianCChao·
Sharing a very simple Claude skill I created for ML literature survey. My experience with existing skills or ML paper search engines is that don't really capture how researchers *think* when doing literature search. Literature search is not just looking for keywords, but being creative, drawing parallels from different fields, and thinking two or three steps ahead. I iterated this skill with Claude a couple of times to refine it and I am pretty satisfied with its current hit rate. Topics I surveyed include efficient video tokenization, mixed-resolution diffusion / tokenization, etc, and it gave me pretty accurate results and found papers that went under my radar. Hope this is useful! github.com/bchao1/paper-f…
English
7
66
708
34.5K
Brian Chao
Brian Chao@BrianCChao·
@jetnew_sg I think a good addition would be finding similar concepts in adjacent fields. For example, a lot of current vision model designs are actually inspired heavily by LLM research, so it'd be useful to find parallels in NLP when searching for CV literature.
English
0
0
0
411
Jet New
Jet New@jetnew_sg·
@BrianCChao Thanks for sharing! What remaining gaps do you think it has that you think could be implemented for future directions?
English
1
0
0
798
myles
myles@themylesfiles·
Your eyes only see in high resolution in a tiny 2° patch - everything else is blurred and your brain fills in the gaps. This paper by researchers at @Stanford exploits that for diffusion models: render full detail only at the gaze point, downsample the periphery. 2-4x faster generation and users literally can't tell the difference. I built a @marimo_io notebook to try it yourself: move a gaze point around an image and watch how aggressively you can cut peripheral resolution before you notice. molab.marimo.io/notebooks/nb_2…
myles tweet media
alphaXiv@askalphaxiv

"Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation" This paper introduces the logic of human vision to diffusion models, where you generate full detail only when the viewer is looking, and becomes low detail in the periphery. With this setup, you can get up to 2x faster image generation and 4x faster video generation with little perceptual drop!

English
1
7
54
5.4K
Mayank Bhaskar
Mayank Bhaskar@cataluna84·
Congratulations on the awesome release of Foveated Diffusion: Efficient Spatially Aware Image and Video Generation! Would you or any of your co-authors like to present your papers in the Cohere Labs Computer Vision community? Cohere Labs Community Page: sites.google.com/cohere.com/coh… Here's the playlist to previous talks, if you are interested: youtube.com/playlist?list=…
English
1
0
2
439
Gordon Wetzstein
Gordon Wetzstein@GordonWetzstein·
High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵
English
24
114
1.1K
158.1K
Brian Chao
Brian Chao@BrianCChao·
@lukedneumann there are other works on perception that uses similar concepts (i.e, mixed-resolution patches”). keywords include adaptive patch sizes for ViTs, foveated ViTs, or a recent paper called AutoGaze.
English
0
0
1
29
Luke Neumann
Luke Neumann@lukedneumann·
@BrianCChao Not synthetic, no. The purpose would be helping with the digestion of an organic dataset of that quality.
English
1
0
0
33
Brian Chao
Brian Chao@BrianCChao·
This project started with a simple question: why are we still running full attention calculations for background pixels we aren’t even looking at? In many applications, such as interactive gaming and robotics simulation, only selective regions require high-resolution generation. Our new work, Foveated Diffusion, brings the biological efficiency of the human visual system to Diffusion Transformers by directly reducing the token count through a perceptually-motivated design, adding a new axis to the scaling laws of generative AI. See the full breakdown in @GordonWetzstein's post below:
Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English
6
13
129
23.1K
Brian Chao
Brian Chao@BrianCChao·
@lukedneumann if you are curating a synthetic dataset, yes of course. in fact one of the compelling use cases for this method is generative simulation
English
1
0
1
35
Luke Neumann
Luke Neumann@lukedneumann·
@BrianCChao Would there be a benefit to applying this process at the encoding stage? So essentially baking this efficiency into an ultra high resolution (8K/60fps/HDR) dataset.
English
1
0
0
39
Brian Chao
Brian Chao@BrianCChao·
codecs compress already generated images into lossy versions of it. the generation part is still costly because everything is still generated in high resolution. here, we speed up the generation itself so you can still pass the generated images through modern codecs.
English
1
0
1
59
Luke Neumann
Luke Neumann@lukedneumann·
@BrianCChao I guess my question is this: Modern codecs (H.265/.266) already do this pretty well. Is the breakthrough here more about the translation of this "focal point" downstream?
English
1
0
0
48
Brian Chao
Brian Chao@BrianCChao·
@deepfates @sameQCU in fact, our method is theoretically totally compatible with whatever attention speedup mechanisms since we are directly reducing the number of tokens. you can plug in whatever sparse attention mechanism you'd like to attain further speedup.
English
0
0
0
19
🎭
🎭@deepfates·
@sameQCU okay good I won't delete but I will update my mental model
English
1
0
3
86
Brian Chao
Brian Chao@BrianCChao·
thank you!! interactive video was definitively a use case we were targeting — there are also so much more applications like generative simulation in robotics! imagine a distilled autoregressive foveated video diffusion model where only the robotic arm and manipulated object are generated at high resolution 🦾
English
0
0
5
105
Anton Obukhov
Anton Obukhov@AntonObukhov1·
Generative foveating, so cool This could be the solution to interactive video generation in VR - low-quality content outside the fovea region won't even be noticeable Too bad Meta is already pulling the plug on VR
Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English
3
0
5
1K
Brian Chao
Brian Chao@BrianCChao·
i think this would be an interesting idea. some closely related works include “matryoshka models” where the text/images are generated at different levels of detail (arxiv.org/abs/2405.17430). we do have to think about how humans perceive text. gaze is a natural spatial signal for visual data, but for text it’s hard to know which part of the passage the reader will focus on in advance.
English
0
1
2
148
Brian Chao
Brian Chao@BrianCChao·
@curiouskid423 thank you!! great discussions on VAEs on the rooftop btw. Why aren’t there variable-length VAEs yet??
English
0
0
1
134
Brian Chao
Brian Chao@BrianCChao·
I think IPE features or other PE flavors that encode scale information will help training immensely. Another fundamental question, however, is that once we have the mixed-resolution latents, how can we decode them back to pixel space? This involves some upsampling of the low-res parts (either in pixel space or latent space) in the final stage since the most widely used VAEs onkly work with regular grids. I don't have a solution for that yet, but some ideas include: adapting VAE to handle variable length or applying Foveated Diffusion directly to pixel-space diffusion so that no decoding is required.
English
0
0
0
99
Brian Chao
Brian Chao@BrianCChao·
Re handling mixed-resolution RoPE: without the RoPE subsampling for low-resolution tokens, the generated images are trashy (I'll refer to this paper: arxiv.org/abs/2511.19778 for more details). With subsampling we are finetuning from a starting point of much higher quality (which we refer to as the "naive mixed-resolution" baseline in the paper). Re communications between frequencies: exactly as you said, the asymmetry between low-res and high-res RoPE already implicitly encodes the scale information. This was surprising to us when we first saw the results, as we initially thought we'd need some flag (like the attenuation you mentioned) to indicate which tokens are low-res. I do think that explicitly adding a flag like you suggested would make the DiT learn way faster, though.
English
1
0
2
133
Brian Chao
Brian Chao@BrianCChao·
It was so amazing working with @YarivLior!! Our eyes process wide field-of-view imagery with mind-boggling efficiency, largely due to the fact that peripheral information is never perceived at the highest resolution and is only used for global context. Why can't machines do the same? We also had a lot of fun making the interactive demos on our project website. Try it out yourself: bchao1.github.io/foveated-diffu…
Lior Yariv@YarivLior

Why pay full compute for pixels you're not even looking at? In our new work, Foveated Diffusion, we introduce a new concept for efficient image and video generation, motivated by how the human visual system works. (See full thread below)

English
1
4
41
4.8K
Lior Yariv
Lior Yariv@YarivLior·
Why pay full compute for pixels you're not even looking at? In our new work, Foveated Diffusion, we introduce a new concept for efficient image and video generation, motivated by how the human visual system works. (See full thread below)
Gordon Wetzstein@GordonWetzstein

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English
9
16
125
19.1K
Brian Chao
Brian Chao@BrianCChao·
Thanks Jon! This is a great question. We did something related to what you alluded to, where we subsample key tokens for low-resolution query tokens in attention calculation and fine-tuned the model so that the DiT can learn to handle multi-resolution RoPE. This solves the content scale mismatch between resolutions, but tiny border artifacts are still observable. We only used soft-blending in the final compositing stage, but I’d say that’s more of a post-processing fix because we never trained with soft masks. Filtering RoPE is very interesting, and recent works have also shown that manipulating high-frequency RoPE components actually don’t change the content a lot, so that’s definitely worth investigating. I think yet another direction is to rethink the design of VAEs so that they can decode variable-length, mixed-resolution sequences directly instead of having to blend mixed-resolution crops in pixel space like we did in this paper.
English
1
0
3
319
Jon Barron
Jon Barron@jon_barron·
Looks cool! Do you think there's value in having the mask being Gaussian instead of a hard disc? It seems like this would let you analytically prefilter the RoPEs (I think) which would downweight high frequencies instead of either dropping or keeping them, which might make the discontinuity between the foveated and non-foveated regions less visible.
English
1
0
20
3.2K