Xingyu Fu

171 posts

Xingyu Fu banner
Xingyu Fu

Xingyu Fu

@XingyuFu2

Postdoc Fellow @PrincetonPLI | PhD @Penn @cogcomp. | Focused on Vision+Language | Previous: @MSFTResearch @AmazonScience B.S. @UofIllinois | ⛳️😺

Princeton, NJ Katılım Eylül 2020
832 Takip Edilen1.6K Takipçiler
Xingyu Fu retweetledi
Xi Ye
Xi Ye@xiye_nlp·
We propose a new decoding algorithm, DySCO🪩 (Dynamic Attention Scaling), directly improving long-context reasoning without training. At each decoding step, we dynamically identify and upweight attention to important context for the next token. 📈20% gains on multiple tasks.
GIF
English
3
23
82
6.5K
Xingyu Fu retweetledi
Peter Tong
Peter Tong@TongPetersb·
Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]
Peter Tong tweet media
English
34
222
1.1K
207.3K
Xingyu Fu retweetledi
Alex Weers
Alex Weers@a_weers·
Let’s continue with the next paper
Alex Weers tweet media
English
7
36
435
21.9K
Xingyu Fu retweetledi
Bangzheng Li
Bangzheng Li@BangzhengL·
🚨 Tokens are the surface. Attention is the mechanism. What if RL learned the latter? 💡 Introducing Reinforced Attention Learning 🧠 Key idea: Current RL for LLMs shapes "what is generated". We instead optimize "where the model focuses", to influence how it actually reasons. 📰 Paper: arxiv.org/pdf/2602.04884 (with @Google and @GoogleDeepMind) Standard PPO/GRPO perform importance sampling over token distributions, using advantages to up- or down-weight token probabilities. ⏳ The flip: Instead of "what token to generate", we optimize "where to attend" At each generation step, we measure the divergence between attention distributions of the current and old policy—across all previous tokens. - High-reward samples → keep attention close - Low-reward samples → push attention away The policy learns where to allocate computation, not just which surface-level token to generate. ⚙️ Plug-and-play Our attention-level objective pairs with any policy gradient method, including PPO and GRPO. 📊 Results RAL > Vanilla GRPO. We’re seeing consistent accuracy gains across a wide range of image and video QA benchmarks when experimenting on multimodal LLMs. ✨ One more thing Attention can be learned from a teacher model via on-policy attention distillation—going beyond knowledge transfer to inherit latent attention behaviors. This On-policy Attention Distillation takes standard On-Policy Distillation to the next level. Thanks to all my colleagues: @jianmo_ni @Chen_Qu1 @ianmiao @liuyang_irnlp @XingyuFu2 @infolaber🙏
Bangzheng Li tweet mediaBangzheng Li tweet media
English
7
57
274
23.6K
Xingyu Fu
Xingyu Fu@XingyuFu2·
The idea is simple: Tokens show what the model outputs. Attention reveals how it reasons. We argue RL should optimize the latter. The experiment results turn out to be surprisingly effective, with up to +6.2% than standard GRPO and up to +4.2% over standard on-policy distillation! Kudos to our lead @BangzhengL! Checkout out our new paper: arxiv.org/abs/2602.04884!!
Bangzheng Li@BangzhengL

🚨 Tokens are the surface. Attention is the mechanism. What if RL learned the latter? 💡 Introducing Reinforced Attention Learning 🧠 Key idea: Current RL for LLMs shapes "what is generated". We instead optimize "where the model focuses", to influence how it actually reasons. 📰 Paper: arxiv.org/pdf/2602.04884 (with @Google and @GoogleDeepMind) Standard PPO/GRPO perform importance sampling over token distributions, using advantages to up- or down-weight token probabilities. ⏳ The flip: Instead of "what token to generate", we optimize "where to attend" At each generation step, we measure the divergence between attention distributions of the current and old policy—across all previous tokens. - High-reward samples → keep attention close - Low-reward samples → push attention away The policy learns where to allocate computation, not just which surface-level token to generate. ⚙️ Plug-and-play Our attention-level objective pairs with any policy gradient method, including PPO and GRPO. 📊 Results RAL > Vanilla GRPO. We’re seeing consistent accuracy gains across a wide range of image and video QA benchmarks when experimenting on multimodal LLMs. ✨ One more thing Attention can be learned from a teacher model via on-policy attention distillation—going beyond knowledge transfer to inherit latent attention behaviors. This On-policy Attention Distillation takes standard On-Policy Distillation to the next level. Thanks to all my colleagues: @jianmo_ni @Chen_Qu1 @ianmiao @liuyang_irnlp @XingyuFu2 @infolaber🙏

English
0
8
64
6.6K
Xingyu Fu retweetledi
Wenhao Chai
Wenhao Chai@wenhaocha1·
2026 will be the year unified models move from the lab into real, practical use. We believe real-world applications will demand multimodal outputs (just look at how obsessed people are with nano-banana–style experiences). I want to highlight two key design aspects of this project: 1. Detailed rubrics are essential for rigorous evaluation. In some cases, we even need per-sample rubrics. This may also provide useful guidance for RL training on open-ended tasks. 2. Bringing reasoning into multimodal generation is absolutely critical.We’ve seen a lot of work focusing on reasoning for multimodal perception — but what about generation? First, user instructions are never perfectly precise. Reasoning helps uncover the user’s true intent. Second, reasoning can reduce the difficulty of generation through step-by-step decomposition and reflection, which becomes very similar to patterns we already see in purely language-based reasoning. This project is led by Bo Li and also David, with creative guidance from Zhuang and Xingyu.
Zhuang Liu@liuzhuang1234

How good are unified models at generating images AND text together? We built UEval to find out. Results: GPT-5-Thinking scores only 66.4/100. Best open-source model (Emu3.5): 49.1. Introducing UEval: A Benchmark for Unified Multimodal Generation.

English
2
1
33
5.3K
Xingyu Fu retweetledi
Zhuang Liu
Zhuang Liu@liuzhuang1234·
How good are unified models at generating images AND text together? We built UEval to find out. Results: GPT-5-Thinking scores only 66.4/100. Best open-source model (Emu3.5): 49.1. Introducing UEval: A Benchmark for Unified Multimodal Generation.
Zhuang Liu tweet media
English
2
11
58
9.9K
Xingyu Fu retweetledi
Yushi Hu
Yushi Hu@huyushi98·
Reward models make or break post-training for multimodal omni models (e.g., nano banana), yet there’s surprisingly little research on that‼️ We’re releasing MMRB2: new reward benchmark focusing on omni models, spanning T2I, editing, interleaved, and thinking with images 🧵1/n
Yushi Hu tweet media
English
7
42
156
33.7K
Xingyu Fu retweetledi
Zhuang Liu
Zhuang Liu@liuzhuang1234·
Stronger Normalization-Free Transformers – new paper. We introduce Derf (Dynamic erf), a simple point-wise layer that lets norm-free Transformers not only work, but actually outperform their normalized counterparts.
Zhuang Liu tweet media
English
19
176
1.1K
164.7K
Anusha K
Anusha K@aiwithanu·
@XingyuFu2 is there any way to apply without a phd ?
English
1
0
0
80
Xingyu Fu retweetledi
Yu Feng
Yu Feng@AnnieFeng6·
LLM CoT reasoning looks smart but can be logically flawed or... just made up. It's time to hold reasoning accountable! We built VeriCoT to do just that. VeriCoT extracts the core argument of the CoT using well-formed symbolic notions of logical support. It formalizes every CoT step into first-order logic and finds the exact premise it's built on. This gives us two superpowers: 🤖Automated Proof: Solvers can automatically verify if the logic is valid. 🧑‍🔬Human-Readable Audits: Natural language premises let you pinpoint ungrounded leaps or fallacies. Best of all, all these can be used as signals to learn more verifiable models! To our knowledge, VeriCoT is the first neuro-symbolic validator of CoT traces in non-math/code domains. 📄 Paper: arxiv.org/pdf/2511.04662
Yu Feng tweet media
English
2
12
27
6.6K
Xingyu Fu retweetledi
Zhuang Liu
Zhuang Liu@liuzhuang1234·
Excited to share our lab’s first open-source release: LLM-Distillation-JAX supports practical knowledge distillation configurations (distillation strength, temperature, top-k/top-p), built on MaxText designed for reproducible JAX/Flax training on both TPUs and GPUs
Zhuang Liu tweet media
English
4
29
224
20.6K
Xingyu Fu retweetledi
Gabriel Sarch
Gabriel Sarch@GabrielSarch·
Life update: I recently defended my PhD at CMU and started as a postdoctoral fellow at Princeton! Grateful to my advisors and all who supported me, and excited for this next chapter :)
Gabriel Sarch tweet mediaGabriel Sarch tweet mediaGabriel Sarch tweet media
English
49
42
1.3K
61.5K
Xingyu Fu retweetledi
Yushi Hu
Yushi Hu@huyushi98·
We are hiring a PhD research intern (summer 2026) at Meta FAIR to work on frontier multimodal generation models! Apply here: metacareers.com/jobs/245366641… Feel free to DM me if you have any questions!
English
4
40
313
26.1K
Xingyu Fu retweetledi
Danqi Chen
Danqi Chen@danqi_chen·
I am going to present two papers at #COLM2025 tomorrow from 4:30-6:30pm, as none of our leading authors can attend due to visa issues. Haven't done poster presentations for years 🤣🤣 .... so I will do my best! #76: LongProc #80: Goedel-Prover v1
Danqi Chen tweet mediaDanqi Chen tweet media
Chi Jin@chijinML

Our Goedel-Prover V1 will be presented at COLM 2025 in Montreal this Wednesday afternoon! I won’t be there in person, but my amazing and renowned colleague @danqi_chen will be around to help with the poster — feel free to stop by!

English
4
25
349
48.7K
Xingyu Fu retweetledi
Wenhao Chai
Wenhao Chai@wenhaocha1·
Introducing VideoNSA We started working on compression for video-LMM back in 2023. MovieChat focuses on inter-frame compression, while AuroraCap focuses on intra-frame compression. After the emergence of NSA, we realized that the manually set heuristics we relied on should actually be replaced with learnable and dynamic operations. Another benefit is that sparse attention is not really about discarding information. it’s more about selective activation, which avoids the “one slip and you lose everything” problem of compression-only approaches. The development of video-LMM has been a spiral of efficiency and performance improvements. For academia, trying to scale data as the path toward performance is clearly a dead end. What we do is to demonstrate that something works under a given budget and recipe. Unlike LM architecture research, where you can directly compare loss at 1B vs. 100B scale, current video-LMM research almost always has to start from a pretrained LLM/VLM in order to produce meaningful metrics on benchmarks (somewhat similar to coding?). But introducing language (or pretrained vision) priors makes the experiments uncontrollable. One idea is that perhaps we need to abandon knowledge-heavy backbones and benchmarks and return to GPT-2-level models and toy (but controlled) benchmarks. A good example of this is Kevin’s paper VLog. We are not presenting something “new,” nor are we fabricating so-called novelty. Instead, in the paper we try to present the results and observations as completely as possible. Finally, long-context models will soon (or perhaps already do) face conflicts between context length and the pretrained weights themselves (let’s formulate it this way for now). This manifests in phenomena such as VLMs failing to recognize the number of fingers, coding models not using the latest APIs, and chat (even reasoning) models failing at selective copying. Beyond efficiency, this could become another multi-year battle.
Enxin Song@EnxinSong

Token compression causes irreversible information loss in video understanding. 🤔 What can we do with sparse attention? We introduce VideoNSA, a hardware-aware and learnable hybrid sparse attention mechanism that scales to 128K context length.

English
0
12
99
11K