Yukang Chen (@yukangchen_) - Twitter Profili | Zamantika Mersobahis Locabet

Github Repo: github.com/WeianMao/triat… Paper Link: huggingface.co/papers/2604.04… Homepage: weianmao.github.io/tri-attention-…

English

0

3

18

2.8K

Yukang Chen@yukangchen_·1d

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: github.com/WeianMao/triat… Paper Link: huggingface.co/papers/2604.04… Homepage: weianmao.github.io/tri-attention-…

English

21

115

877

89.3K

Yukang Chen retweetledi

Baifeng@baifeng_shi·24 Mar

Humans can see in high-res, high-FPS in real-time. Why can't VLMs? Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos. 📄 arxiv.org/abs/2603.12254 🌐 autogaze.github.io 🤗 huggingface.co/collections/bf… (1/n)🧵

English

48

199

1.6K

149.6K

Yukang Chen@yukangchen_·9 Şub

Thrilled to share that LongLive (github.com/NVlabs/LongLive) has reached 1K+ GitHub Stars ⭐ and been accepted by ICLR 2026 🎉! This is my first project as a last author, which came with much broader responsibilities: deciding research directions, coordinating human & GPU resources, planning open-source timelines, designing demos, outreach, and promotion 📢. Huge congrats to our first-author @ShuaiYa68505475 , who’s graduating this year and landed offers from top companies including NVIDIA—well-deserved! Grateful for all the support from my advisors and collaborators 🤝❤️. In the new year, we’ll continue to dive deep into Long Video Generation + Efficient AI 🎥⚡, steadily pushing forward research, open-source, and real-world impact 🌱.

English

3

62

3.3K

Yukang Chen@yukangchen_·3 Ara

🚀 Our NeurIPS 2025 paper “Scaling RL to Long Videos” will be presented at Poster #4709! 📅 Dec 5, AM – PM (PT) 📍 Exhibit Hall C,D,E — Poster #4709 We build a full-stack framework that scales RL to long-video reasoning in VLMs. • 📘 LongVideo-Reason: long-video QA pairs with high-quality reasoning • ⚙️ MR-SP system: sequence parallelism + vLLM engine • 🚄 Up to 2.1× RL speedup • 🎥 Supports 8,192 frames per video + hour-long RL on 8 A100 GPUs • 🏆 New model LongVILA-R1-7B on VideoMME (65.1% / 71.1%) 🔗 Paper: arxiv.org/abs/2507.07966 🔗 Code: github.com/NVlabs/Long-RL

English

0

2

14

2.1K

Yukang Chen@yukangchen_·5 Kas

@CliffLattner We use BF16 LoRA for more accurate training—it performs much better than training directly in NVFP4. Our activations are not in NVFP4; only the weights are, and we did not train with NVFP4.

English

0

1

42

Cliff Lattner@CliffLattner·4 Kas

@yukangchen_ Could you not just use NVFP4 for activations and weights, and then gradients in BF16? Why do you need LoRA?

English

1

0

39

Yukang Chen@yukangchen_·14 Eki

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: huggingface.co/papers/2510.11… Code: github.com/NVlabs/QeRL #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

English

12

65

351

69.6K

Yukang Chen@yukangchen_·4 Kas

@paulcx Theoretically support, we did not try yet.

English

0

52

Paul Chen@paulcx·4 Kas

@yukangchen_ Does it support on-policy distillation?

English

1

0

135

Yukang Chen@yukangchen_·4 Kas

@CliffLattner LoRA branch using bf16 enables the backward of quantized RL

English

1

0

91

Cliff Lattner@CliffLattner·4 Kas

@yukangchen_ Hm, this seems like it is really two or three papers all in one. You apply the noise vector after weight quantization, so the quantization format of the weight is pretty irrelevant. Not to mention that LoRA seems pretty orthogonal to all of this.

English

1

0

175

Yukang Chen@yukangchen_·20 Eki

Great work!

HanRong YE @ Nvidia@leoyerrrr

OmniVinci is now #1 paper on Huggingface!!! 🤗 Building omni-modal LLMs is MORE than just mixing tokens 😉 At @NVIDIA, we explored deeper possibilities in building truly omni-modal systems — leading to OmniVinci-9B, which introduces three key innovations: - OmniAlignNet – a unified vision–audio alignment module powered by contrastive learning - Temporal Embedding Grouping & Constrained Rotary Time Embedding – enabling absolute and relative temporal representation across multimodal tokens - To support this, we curated a 24M-sample omni-modal dataset and developed a new large-scale data engine for efficient labeling. 🔍 Key Findings: - Audio understanding significantly enhances video comprehension - Audio signals improve omni-modal reinforcement learning - Modality-specific captioning falls short — true understanding demands omni-modal context 📈 Results: OmniVinci-9B outperforms Qwen2.5-Omni across omni-modal, vision, and audio benchmarks — using only 1/6 of the training tokens. #LLMs #AI #ML

English

1

0

11

2K

Yukang Chen@yukangchen_·20 Eki

@leoyerrrr @nvidia Great work!

English

0

1

197

HanRong YE @ Nvidia@leoyerrrr·20 Eki

OmniVinci is now #1 paper on Huggingface!!! 🤗 Building omni-modal LLMs is MORE than just mixing tokens 😉 At @NVIDIA, we explored deeper possibilities in building truly omni-modal systems — leading to OmniVinci-9B, which introduces three key innovations: - OmniAlignNet – a unified vision–audio alignment module powered by contrastive learning - Temporal Embedding Grouping & Constrained Rotary Time Embedding – enabling absolute and relative temporal representation across multimodal tokens - To support this, we curated a 24M-sample omni-modal dataset and developed a new large-scale data engine for efficient labeling. 🔍 Key Findings: - Audio understanding significantly enhances video comprehension - Audio signals improve omni-modal reinforcement learning - Modality-specific captioning falls short — true understanding demands omni-modal context 📈 Results: OmniVinci-9B outperforms Qwen2.5-Omni across omni-modal, vision, and audio benchmarks — using only 1/6 of the training tokens. #LLMs #AI #ML

English

11

27

151

22.3K

Yukang Chen@yukangchen_·18 Eki

I will give a talk at the ICCV HiGen Workshop at 11:30 AM (HST) on October 19. All are welcome to join.🎤

English

0

13

821

Yukang Chen@yukangchen_·17 Eki

@Amandee59573123 Good idea! But giving the model all past frames can help consistency, but it kills real-time. KV grows linearly with time, and each new frame’s attention cost grows with context (per-step ≈ O(T)). For frame-level AR, FPS tanks and VRAM becomes the bottleneck.

English

0

1

70

Amandeep Kumar@Amandeep__kumar·16 Eki

@yukangchen_ Great work! 👏 Instead of using sink tokens as a global prior, what if we exposed the model to all previous frames during next-frame generation? It might boost temporal consistency- through K-V cache would grow fast, but if we can manage the cache. Curious to hear your thoughts.

English

1

0

121

Yukang Chen@yukangchen_·16 Eki

The Convergence of “Understanding × Generation” in Long Video — Attention Sink ✨🎬🧠 We recently open-sourced two works related to long videos: long-video understanding StreamingVLM (github.com/mit-han-lab/st…) and long-video generation LongLive (github.com/NVlabs/LongLive). Both papers validated the effectiveness of Attention Sink (originating from StreamingLLM - arxiv.org/pdf/2309.17453) through experiments and adopted it as a core component. As a co-author on both works, I’d like to briefly introduce how Attention Sink is used in long-video understanding and generation, and how this differs from its usage in LLMs. 🔗📄 1. Why are long videos hard? 🤔⏳ Long video is an “ultra-long context” scenario. Whether for StreamingVLM’s understanding or LongLive’s generation, we deal with millions of tokens. Full attention makes computation explode—training and inference costs become prohibitive, and real-time/interactive use is essentially impossible. We therefore need an approach that preserves quality while remaining efficient. ⚖️⚡ 2. What is Attention Sink? 🧲🧩 Attention Sink was first proposed in the LLM setting by StreamingLLM: insert a set of “anchor” tokens (sink tokens) early in the attention sequence and increase their salience (e.g., larger key norms or special embeddings) so that tokens at any later position can reliably attend back to these global-memory anchors. Combined with Window Attention, the model’s logits are less likely to collapse when prompts change, yielding more stable behavior; the extra overhead is nearly cost-free, because the number of sink tokens is fixed. 🧮✅ 3. On the “understanding” side: How does StreamingVLM use it? 🧐🎥 Attention Sink + Sliding Window. The sink serves as a global prior for long-video understanding, persistently retaining information that does not quickly become outdated (e.g., players in a sports broadcast), improving stability across shots. 📈 4. On the “generation” side: How does LongLive use it? 🎨⚙️ Attention Sink + Window Attention + KV-recache. The sink acts as a global prior in long-video generation, maintaining stylistic and narrative consistency during generation; KV-recache refreshes the cache at prompt switch points to ensure smooth transitions. 🔁🎞️ 5. Same hammer, different nails 🔨🔩 • In long-video understanding, the sink functions like a retrieval prior, helping the model stay on the main storyline. 🧭 • In long-video generation, the sink acts like a visual metronome, keeping overall style from drifting. 🎼 6. How it differs from Attention Sink in StreamingLLM 🔍📚 In both long-video understanding and generation, the usage barrier is higher than in LLMs. • On one hand, in LLMs it can be used inference-only without training; in StreamingVLM and LongLive, we need fine-tuning to adapt the model to this mechanism. 🛠️ • On the other hand, there are more sink tokens: for example, in LongLive we construct sink tokens in the first 3 frames, leading to more sinks than in StreamingLLM. 📦 One reason is that pure text models are trained on corpora with natural anchors like BOS, paragraph openings, and titles, so early-position signals are already strong in attention statistics. Video data lacks a stable “global-anchor paradigm” (frames are homogeneous streams and scenes vary widely), so injecting sinks at inference time can easily mismatch—hence the need for fine-tuning to “teach” the model how to use them. 🎯 #LongVideoGeneration #LongVideoUnderstanding #RealTimeGeneration #Multimodal

English

2

12

72

11.5K

Yukang Chen@yukangchen_·16 Eki

@jo32 Great! Thanks!

English

0

2

367

jo32@jo32·16 Eki

@yukangchen_ 32B LLM training on 1 H100 GPU means a lot! Congrats! I made a video for anyone who wants to know what it’s about: explain.getmegaportal.com/arxiv/2510.116…

English

2

0

6

458

Yukang Chen@yukangchen_·16 Eki

StreamingVLM - github.com/mit-han-lab/st… LongLive - github.com/NVlabs/LongLive

English

0

7

448

Yukang Chen@yukangchen_·15 Eki