Xingyu Fu (@XingyuFu2) - Twitter Profili | Zamantika Mersobahis Locabet

Xingyu Fu retweetledi

Xi Ye@xiye_nlp·5 Mar

We propose a new decoding algorithm, DySCO🪩 (Dynamic Attention Scaling), directly improving long-context reasoning without training. At each decoding step, we dynamically identify and upweight attention to important context for the next token. 📈20% gains on multiple tasks.

GIF

English

3

23

82

6.5K

Xingyu Fu retweetledi

Peter Tong@TongPetersb·4 Mar

Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]

English

34

222

1.1K

207.3K

Xingyu Fu retweetledi

Princeton Laboratory for Artificial Intelligence@PrincetonAInews·11 Şub

👋Meet the Postdoc! Meet Xinyi Wang, a researcher with @PrincetonPLI. Her research centers on developing a principled understanding of large foundation models with the aim of enhancing their capabilities. ai.princeton.edu/news/2026/meet…

Princeton Laboratory for Artificial Intelligence tweet media

English

0

10

60

7.5K

Xingyu Fu retweetledi

Alex Weers@a_weers·8 Şub

Let’s continue with the next paper

English

7

36

435

21.9K

Xingyu Fu retweetledi

Bangzheng Li@BangzhengL·5 Şub

🚨 Tokens are the surface. Attention is the mechanism. What if RL learned the latter? 💡 Introducing Reinforced Attention Learning 🧠 Key idea: Current RL for LLMs shapes "what is generated". We instead optimize "where the model focuses", to influence how it actually reasons. 📰 Paper: arxiv.org/pdf/2602.04884 (with @Google and @GoogleDeepMind) Standard PPO/GRPO perform importance sampling over token distributions, using advantages to up- or down-weight token probabilities. ⏳ The flip: Instead of "what token to generate", we optimize "where to attend" At each generation step, we measure the divergence between attention distributions of the current and old policy—across all previous tokens. - High-reward samples → keep attention close - Low-reward samples → push attention away The policy learns where to allocate computation, not just which surface-level token to generate. ⚙️ Plug-and-play Our attention-level objective pairs with any policy gradient method, including PPO and GRPO. 📊 Results RAL > Vanilla GRPO. We’re seeing consistent accuracy gains across a wide range of image and video QA benchmarks when experimenting on multimodal LLMs. ✨ One more thing Attention can be learned from a teacher model via on-policy attention distillation—going beyond knowledge transfer to inherit latent attention behaviors. This On-policy Attention Distillation takes standard On-Policy Distillation to the next level. Thanks to all my colleagues: @jianmo_ni @Chen_Qu1 @ianmiao @liuyang_irnlp @XingyuFu2 @infolaber🙏

English

7

57

274

23.6K

Xingyu Fu@XingyuFu2·5 Şub

The idea is simple: Tokens show what the model outputs. Attention reveals how it reasons. We argue RL should optimize the latter. The experiment results turn out to be surprisingly effective, with up to +6.2% than standard GRPO and up to +4.2% over standard on-policy distillation! Kudos to our lead @BangzhengL! Checkout out our new paper: arxiv.org/abs/2602.04884!!

Bangzheng Li@BangzhengL

🚨 Tokens are the surface. Attention is the mechanism. What if RL learned the latter? 💡 Introducing Reinforced Attention Learning 🧠 Key idea: Current RL for LLMs shapes "what is generated". We instead optimize "where the model focuses", to influence how it actually reasons. 📰 Paper: arxiv.org/pdf/2602.04884 (with @Google and @GoogleDeepMind) Standard PPO/GRPO perform importance sampling over token distributions, using advantages to up- or down-weight token probabilities. ⏳ The flip: Instead of "what token to generate", we optimize "where to attend" At each generation step, we measure the divergence between attention distributions of the current and old policy—across all previous tokens. - High-reward samples → keep attention close - Low-reward samples → push attention away The policy learns where to allocate computation, not just which surface-level token to generate. ⚙️ Plug-and-play Our attention-level objective pairs with any policy gradient method, including PPO and GRPO. 📊 Results RAL > Vanilla GRPO. We’re seeing consistent accuracy gains across a wide range of image and video QA benchmarks when experimenting on multimodal LLMs. ✨ One more thing Attention can be learned from a teacher model via on-policy attention distillation—going beyond knowledge transfer to inherit latent attention behaviors. This On-policy Attention Distillation takes standard On-Policy Distillation to the next level. Thanks to all my colleagues: @jianmo_ni @Chen_Qu1 @ianmiao @liuyang_irnlp @XingyuFu2 @infolaber🙏

English

0

8

64

6.6K

Xingyu Fu retweetledi

Wenhao Chai@wenhaocha1·2 Şub

2026 will be the year unified models move from the lab into real, practical use. We believe real-world applications will demand multimodal outputs (just look at how obsessed people are with nano-banana–style experiences). I want to highlight two key design aspects of this project: 1. Detailed rubrics are essential for rigorous evaluation. In some cases, we even need per-sample rubrics. This may also provide useful guidance for RL training on open-ended tasks. 2. Bringing reasoning into multimodal generation is absolutely critical.We’ve seen a lot of work focusing on reasoning for multimodal perception — but what about generation? First, user instructions are never perfectly precise. Reasoning helps uncover the user’s true intent. Second, reasoning can reduce the difficulty of generation through step-by-step decomposition and reflection, which becomes very similar to patterns we already see in purely language-based reasoning. This project is led by Bo Li and also David, with creative guidance from Zhuang and Xingyu.

Zhuang Liu@liuzhuang1234

How good are unified models at generating images AND text together? We built UEval to find out. Results: GPT-5-Thinking scores only 66.4/100. Best open-source model (Emu3.5): 49.1. Introducing UEval: A Benchmark for Unified Multimodal Generation.

English

2

1

33

5.3K

Xingyu Fu retweetledi

Zhuang Liu@liuzhuang1234·2 Şub

How good are unified models at generating images AND text together? We built UEval to find out. Results: GPT-5-Thinking scores only 66.4/100. Best open-source model (Emu3.5): 49.1. Introducing UEval: A Benchmark for Unified Multimodal Generation.

English

2

11

58

9.9K

Xingyu Fu retweetledi

Yushi Hu@huyushi98·19 Ara

Reward models make or break post-training for multimodal omni models (e.g., nano banana), yet there’s surprisingly little research on that‼️ We’re releasing MMRB2: new reward benchmark focusing on omni models, spanning T2I, editing, interleaved, and thinking with images 🧵1/n

English

7

42

156

33.7K

Xingyu Fu retweetledi

Zhuang Liu@liuzhuang1234·12 Ara

Stronger Normalization-Free Transformers – new paper. We introduce Derf (Dynamic erf), a simple point-wise layer that lets norm-free Transformers not only work, but actually outperform their normalized counterparts.

English

19

176

1.1K

164.7K

Xingyu Fu@XingyuFu2·10 Ara

@aiwithanu This is application for phd

English

1

0

2

63

Anusha K@aiwithanu·10 Ara

@XingyuFu2 is there any way to apply without a phd ?

English

1

0

80

Xingyu Fu@XingyuFu2·10 Ara

Zhuang is the best!!! Apply to him!

Zhuang Liu@liuzhuang1234

Excited to work with new PhD students (Fall 2026) on multimodal models, AI for automated scientific research, and foundation model architectures at Princeton. If this resonates with you, please apply to the CS PhD program and mention my name.

English

1

31

13.3K

Xingyu Fu retweetledi

Yu Feng@AnnieFeng6·7 Kas

LLM CoT reasoning looks smart but can be logically flawed or... just made up. It's time to hold reasoning accountable! We built VeriCoT to do just that. VeriCoT extracts the core argument of the CoT using well-formed symbolic notions of logical support. It formalizes every CoT step into first-order logic and finds the exact premise it's built on. This gives us two superpowers: 🤖Automated Proof: Solvers can automatically verify if the logic is valid. 🧑‍🔬Human-Readable Audits: Natural language premises let you pinpoint ungrounded leaps or fallacies. Best of all, all these can be used as signals to learn more verifiable models! To our knowledge, VeriCoT is the first neuro-symbolic validator of CoT traces in non-math/code domains. 📄 Paper: arxiv.org/pdf/2511.04662

English

2

12

27

6.6K

Xingyu Fu retweetledi

Zhuang Liu@liuzhuang1234·22 Eki

Excited to share our lab’s first open-source release: LLM-Distillation-JAX supports practical knowledge distillation configurations (distillation strength, temperature, top-k/top-p), built on MaxText designed for reproducible JAX/Flax training on both TPUs and GPUs

English

4

29

224

20.6K

Xingyu Fu retweetledi

Gabriel Sarch@GabrielSarch·21 Eki

Life update: I recently defended my PhD at CMU and started as a postdoctoral fellow at Princeton! Grateful to my advisors and all who supported me, and excited for this next chapter :)

English

49

42

1.3K

61.5K

Xingyu Fu retweetledi

Yushi Hu@huyushi98·11 Eki

We are hiring a PhD research intern (summer 2026) at Meta FAIR to work on frontier multimodal generation models! Apply here: metacareers.com/jobs/245366641… Feel free to DM me if you have any questions!

English

4

40

313

26.1K

Xingyu Fu retweetledi

Danqi Chen@danqi_chen·8 Eki

I am going to present two papers at #COLM2025 tomorrow from 4:30-6:30pm, as none of our leading authors can attend due to visa issues. Haven't done poster presentations for years 🤣🤣 .... so I will do my best! #76: LongProc #80: Goedel-Prover v1

Chi Jin@chijinML

Our Goedel-Prover V1 will be presented at COLM 2025 in Montreal this Wednesday afternoon! I won’t be there in person, but my amazing and renowned colleague @danqi_chen will be around to help with the poster — feel free to stop by!

English

4

25

349

48.7K

Xingyu Fu retweetledi

Wenhao Chai@wenhaocha1·3 Eki

Introducing VideoNSA We started working on compression for video-LMM back in 2023. MovieChat focuses on inter-frame compression, while AuroraCap focuses on intra-frame compression. After the emergence of NSA, we realized that the manually set heuristics we relied on should actually be replaced with learnable and dynamic operations. Another benefit is that sparse attention is not really about discarding information. it’s more about selective activation, which avoids the “one slip and you lose everything” problem of compression-only approaches. The development of video-LMM has been a spiral of efficiency and performance improvements. For academia, trying to scale data as the path toward performance is clearly a dead end. What we do is to demonstrate that something works under a given budget and recipe. Unlike LM architecture research, where you can directly compare loss at 1B vs. 100B scale, current video-LMM research almost always has to start from a pretrained LLM/VLM in order to produce meaningful metrics on benchmarks (somewhat similar to coding?). But introducing language (or pretrained vision) priors makes the experiments uncontrollable. One idea is that perhaps we need to abandon knowledge-heavy backbones and benchmarks and return to GPT-2-level models and toy (but controlled) benchmarks. A good example of this is Kevin’s paper VLog. We are not presenting something “new,” nor are we fabricating so-called novelty. Instead, in the paper we try to present the results and observations as completely as possible. Finally, long-context models will soon (or perhaps already do) face conflicts between context length and the pretrained weights themselves (let’s formulate it this way for now). This manifests in phenomena such as VLMs failing to recognize the number of fingers, coding models not using the latest APIs, and chat (even reasoning) models failing at selective copying. Beyond efficiency, this could become another multi-year battle.

Enxin Song@EnxinSong

Token compression causes irreversible information loss in video understanding. 🤔 What can we do with sparse attention? We introduce VideoNSA, a hardware-aware and learnable hybrid sparse attention mechanism that scales to 128K context length.

English

0

12

99

11K

Xingyu Fu

Keşfet