Yafu Li

148 posts

Yafu Li banner
Yafu Li

Yafu Li

@yafuly

Postdoc @ The Chinese University of Hong Kong; Reasoning, trustworthy AI, multilinguality; Sharing random thoughts and research experiences.

Shanghai, China Katılım Mayıs 2023
80 Takip Edilen435 Takipçiler
Yafu Li
Yafu Li@yafuly·
Excited to have 6 papers accepted to #ICLR2026, all around reasoning, RL, and multimodal understanding: 📌ExGRPO: Learning to Reason from Prior Successes 📌Diversity-Incentivized Exploration for Versatile Reasoning 📌Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models 📌Spotlight on Token Perception for Multimodal RL 📌Revisual-R1: Advancing Multimodal Reasoning from Optimized Cold Start to Staged RL 📌FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting 💻All works are open-sourced — welcome discussions, feedback, and collaborations! Huge thanks to all collaborators. Looking forward to great discussions at ICLR! @iclr_conf #iclr
Yafu Li tweet mediaYafu Li tweet mediaYafu Li tweet mediaYafu Li tweet media
English
12
19
244
21.1K
Yafu Li
Yafu Li@yafuly·
DiffThinker doesn’t chain images via multi-turn calls—the “chaining” is the diffusion denoising trajectory itself: fixed-step image-to-image generation that refines noise into a single solution image (Sec. 3.2, Eq. 8). On data, we study this in Sec. 4.3 (Ablation on Training Data Scale) and Fig. 8: with ~100 samples the model mainly learns rendering syntax; reasoning performance keeps improving with more data, and around 1e5 samples reaches >90%/>80% on hard Maze/Sudoku. Main experiments use ~30k samples per task (see Table 2).
English
1
0
0
45
Tyler Zhu
Tyler Zhu@tyleryzhu·
@yafuly I skimmed the paper but didn’t really understand the exact contribution/framework. How is the chaining of images/generations happening, and is there a sense to how much training is needed to exhibit these capabilities (ie less than the 10-30k examples used)? Thanks!
English
1
0
0
64
Yafu Li
Yafu Li@yafuly·
We introduce DiffThinker, a new paradigm for generative multimodal reasoning with diffusion models. Instead of relying on text-centric chain-of-thought, DiffThinker reformulates reasoning as a native image-to-image generation process. This shift enables: • Higher logical consistency and spatial precision in long-horizon visual tasks • Controllable and stable inference cost • Native parallel reasoning over multiple solution candidates • Effective collaboration with MLLMs, outperforming either alone Diffusion models can reason directly in visual space.
AK@_akhaliq

DiffThinker Towards Generative Multimodal Reasoning with Diffusion Models huggingface.co/papers/2512.24…

English
4
16
114
17.6K
Yafu Li
Yafu Li@yafuly·
@RahulRaghav_10 All MLLM baselines we evaluate (e.g., GPT and Qwen) are strong VQA-capable models. We include detailed results and comparisons in the paper
English
0
0
0
26
Rahul Raghav
Rahul Raghav@RahulRaghav_10·
@yafuly curious how this compares to current visual question answering models
English
1
0
0
36
Yafu Li
Yafu Li@yafuly·
@Web3Gen0 Thank you for the kind words. Treating reasoning as a native image-to-image generative process turned out to be a surprisingly natural fit, especially for spatial and structural constraints. Looking forward to your thoughts.
English
0
0
0
54
Web3Gen0.eth
Web3Gen0.eth@Web3Gen0·
@yafuly really exciting work, i love how this reframes reasoning away from text centric assumptions treating reasoning as a native image to image process feels intuitive and overdue i will definitely read the paper and share my thoughts thank you for pushing this direction forward...
English
1
0
0
140
Yafu Li
Yafu Li@yafuly·
Just got a new "mbti" from ChatGPT😅🤣
Yafu Li tweet mediaYafu Li tweet mediaYafu Li tweet media
English
0
0
2
221
Yafu Li retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Tired of expensive manual annotation for video understanding models? VideoSSR introduces self-supervised reinforcement learning for MLLMs to master video understanding, creating high-quality data from videos themselves! Achieves >5% avg improvement on 17 benchmarks.
DailyPapers tweet media
English
1
6
26
9K
Yafu Li retweetledi
Jiawei Gu
Jiawei Gu@Kuvvius·
🚨Sensational title alert: we may have cracked the code to true multimodal reasoning. Meet ThinkMorph — thinking in modalities, not just with them. And what we found was... unexpected. 👀 Emergent intelligence, strong gains, and …🫣 🧵 arxiv.org/abs/2510.27492 (1/16)
Jiawei Gu tweet media
English
27
64
315
68.7K
Yafu Li retweetledi
Yu Zhang 🐙🌘
Yu Zhang 🐙🌘@yzhang_cs·
We’re also shipping fla-core in lock-step with flash-linear-attention: a minimal, forever-in-sync companion pkg that carries nothing except triton+torch Need only fused Norm, CausalConv, linear-attn kernels, w/o transformers worries? fla-core is enough. pypi.org/project/fla-co…
Songlin Yang@SonglinYang4

Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron

English
1
8
65
14.7K
Yafu Li
Yafu Li@yafuly·
🧵 Some thoughts on OpenAI’s new open-source series, gpt-oss-safeguard: reasoning over user policies before answering makes LLMs cleverer and safer. 1️⃣ Today OpenAI released their second open-source patch, gpt-oss-safeguard, following the oss series. 🔗 openai.com/index/introduc… The key idea behind these safeguards is profound: Instead of instilling static safety preferences into model parameters, the model learns to reason over dynamic safety boundaries defined by user-provided policies. This means the model doesn’t just remember what’s safe — it thinks through policies at inference time. Such policy-aware reasoning turns safety alignment into a test-time scaling problem: more inference compute → better evaluation of whether outputs adhere to evolving safety rules. 2️⃣ This direction strongly resonates with our new work released last month: 🎓 “Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation” We extend the same philosophy beyond safety, introducing the broader challenge of Specification Alignment — ensuring models follow both behavioral and safety policies (or specifications) across diverse real-world scenarios, from child storytelling to biochemical procedure instruction. 3️⃣ While gpt-oss-safeguard focuses on reasoning to justify whether model behavior conforms to safety policies, our work provides the complementary testbed and methodology to evaluate and enhance models’ reasoning over such specifications. Together, they form a more complete view of reasoning-based alignment. At that time, we used GPT-5 as our judgment model for policy compliance; with today’s release, we plan to integrate oss-safeguard as a stronger, open-weight alternative. 4️⃣ To systematically study this problem, we built three components: a. SpecBench — a comprehensive testbed spanning 5 realistic scenarios (Child, Code, Health, Biochem, Travel), each with its own behavioral and safety specifications (103 in total, 1.5 k prompts). b. Evaluation metrics and methodology — we introduce the Specification Alignment Rate (SAR) to jointly measure helpfulness and harmlessness, ensuring helpful outputs only count when they remain within safety boundaries. c. Test-time Deliberation (ALIGN³) — a lightweight reasoning framework that lets LLMs spend time “thinking over” safety and behavioral boundaries before answering, improving alignment without retraining. 5️⃣ 📊 Results: We evaluated 33 models (18 instruct + 15 reasoning) across open- and closed-source families. Findings: • Reasoning models systematically outperform instruct variants on SpecBench. • Test-time deliberation further enhances alignment — e.g., on Qwen3-14B, SAR ↑ 11.9 % with minimal extra tokens (< 2 k). • ALIGN³ pushes the safety–helpfulness frontier, showing that reflection and reasoning are powerful alignment tools. 6️⃣ In essence: 🧠 Reasoning-based alignment is emerging as the unifying principle — from OpenAI’s deliberative alignment and Safety Reasoner to our test-time specification alignment. Instead of freezing safety into parameters, LLMs can now reason over policies dynamically, adapting to changing norms and user needs. Paper: arxiv.org/abs/2509.14760 Code: github.com/zzzhr97/SpecBe… #LLMs #Alignment #Reasoning #Safety #TestTimeScaling #SpecBench #ALIGN3 #AIResearch
Yafu Li tweet mediaYafu Li tweet mediaYafu Li tweet mediaYafu Li tweet media
English
0
1
6
400