DENG Lab @ SJTU

73 posts

DENG Lab @ SJTU

@SJTUDengLab

https://t.co/kWvEXvQgXu

Shanghai Katılım Kasım 2024

74 Takip Edilen120 Takipçiler

DENG Lab @ SJTU@SJTUDengLab·6 Mar

🔗 Gateways 🌐 Web: sjtu-deng-lab.github.io/LightningRL 📄 Paper: sjtu-deng-lab.github.io/LightningRL/pa… (arXiv soon) 💻 Code: github.com/SJTU-DENG-Lab/… 🤗 Models: huggingface.co/collections/SJ…

English

DENG Lab @ SJTU@SJTUDengLab·6 Mar

🧠 LightningRL’s 3 fixes: 1️⃣ Decoupled Reward Norm: Balances acc & speed rewards. 2️⃣ Token NLL Anchoring: Sparse → dense gradients. 3️⃣ TPF-Aware Sampling: High-variance groups with ≥1 correct.

English

DENG Lab @ SJTU@SJTUDengLab·6 Mar

🔥 How do we make dLLMs both FAST and ACCURATE? Meet LightningRL ⚡️! We use RL to push the speed-quality frontier of dLLMs, teaching the model to find the "fastest & correct" path among sampling trajectories. 🚀 Faster than EAGLE-3 🏆 Outperforms Qwen2.5 Dive in below 🧵👇

English

236

DENG Lab @ SJTU@SJTUDengLab·18 Oca

🚀 Introduce Think-Then-Generate (T2G): Transforming Qwen-Image into an open-source NanoBanana! Qwen-Image excels in text rendering, but in practice, we find it struggling with implicit and non-descriptive prompts (see figure). We identify that a closed-source LLM is necessary to rewrite prompts descriptively for the diffusion transformer (DiT) renderer. However, the disconnect between the LLM and DiT can lead to imperfections. 🔧 Our solution: T2G—Think First, Generate Second! T2G overcomes this by empowering the text encoder in Qwen-Image itself (i.e., Qwen2.5-VL) to think first, then generate with DiT, and introducing a multimodal GRPO (Dual-GRPO) strategy to enhance seamless, self-driven reasoning in Qwen2.5-VL. 🎨 Check out some results first: - Idiom Comics: Input “A multi-panel comic showing ‘playing the lute to a cow’” -> Not just images of cows and instruments, but a dynamic, narrative-driven comic with accurate evolution of the idiom’s context. - Math & Physics Teaching Example: Input “A math teacher explaining the equation 2x − 4 = 10 on the blackboard.” -> Not random elements, but a fully structured blackboard with clear steps and a teacher scene, accurately capturing the teaching process. 📖 Paper: arxiv.org/abs/2601.10332 💻 Github: github.com/SJTU-DENG-Lab/…

English

7.5K

DENG Lab @ SJTU@SJTUDengLab·18 Oca

🔗 Connection with GLM-Image: The recently released GLM-Image adopts a similar approach—maximizing the semantic capture of the LLM text encoder, using it as the "brain" to understand semantics, while the DiT focuses on a faithful visual renderer. Our method directly reasons over textual space to interpret complex user instructions, while GLM focuses on capturing semantics within the visual token space. Here is our gallery to compare results from different models: sjtu-deng-lab.github.io/Think-Then-Gen…. Feel free to see more comparisons.😉

English

130

DENG Lab @ SJTU@SJTUDengLab·18 Oca

📊 Experimental Results: T2G + Dual-GRPO delivers stunning performance! After joint training with SFT + Dual-GRPO, our method shines across various T2I and image editing tasks: - WISE Benchmark: Highest score among open-source models at 0.79, nearly on par with GPT-4o, +30% improvement over original Qwen-Image. - T2I-ReasonBench: Achieved 92.2 overall score, outperforming closed-source models like Gemini-2.0. - RISEBench & UniREditBench: Significant performance boost in editing tasks with just 5000 samples post-RL training.

English

180

DENG Lab @ SJTU retweetledi

Ethan Chern@ethanchern·30 Ara

"Failure is just iteration. No explosion, no innovation. Keep going."🚀 You vent to @elonmusk—he looks you in the eye and replies instantly, like a video call. Introducing LiveTalk: real-time video gen system on a GPU that sees you, reads emotion, and responds in real time.🧵👇

English

11.1K

DENG Lab @ SJTU@SJTUDengLab·23 Ara

LoPA: Scaling Diffusion LLM Single-Sample Throughput to Over 1000 TPS We introduce LoPA, a training-free algorithm that breaks the parallelism bottleneck in Diffusion LLMs. 📜Paper: arxiv.org/abs/2512.16229 💻Code: github.com/zhijie-group/L… 🗞️Blog: zhijie-group.github.io/blogs/lopa/

English

205

DENG Lab @ SJTU@SJTUDengLab·23 Ara

⚡️1000+ Tokens/s Theoretical TPF translates to real wall-clock acceleration. Under multi-device deployment, LoPA-Dist achieves a staggering 1073.9 tokens/s single-sample throughput on Ascend 910C.

English

DENG Lab @ SJTU@SJTUDengLab·23 Ara

The Scaling Curve With an optimal k, LoPA substantially scales the inference speed. 🚀Dream: TPF rockets from ~2.5 (Vanilla D2F) to 10.1 TPF on GSM8K. 🚀DiffuCoder: TPF reaches 8.3 TPF on HumanEval+ with minimal impact on generation quality.

English

DENG Lab @ SJTU@SJTUDengLab·23 Ara

LoPA-Dist: Engineered for Scale We built LoPA-Dist with BP to handle the load: 🎞️NVIDIA GPUs: Implements a two-phase update protocol to ensure KV cache consistency. 🎞️Ascend 910C: Utilizes Graph Compilation and Block-wise masking for high-throughput serving.

English

DENG Lab @ SJTU@SJTUDengLab·23 Ara

LoPA changes this in 3 simple steps: 🚀Spawn Futures: Generate multiple Lookahead Branches exploring different token orders. 🔍Parallel Verify: Check branches in a single forward pass. 🏆Pick the Winner: Keep the branch with the highest confidence to unlock max parallelism.

English

DENG Lab @ SJTU@SJTUDengLab·23 Ara

🚧The Bottleneck: Current dLLMs use "greedy" strategies, filling only high-confidence tokens. This limits parallelism to just 1-3 tokens per step.

English

DENG Lab @ SJTU@SJTUDengLab·23 Ara

Meet LoPA: A training-free algorithm that breaks the speed limit of dLLMs. LoPA Scales Diffusion LLM Inference to 10.1 TPF and 1000+ TPS! 🚀10.1 Tokens Per Forward pass (TPF) on GSM8K. 🚀1073.9 tokens/s throughput on multi-device systems. 🚀SOTA speed without retraining.

English

DENG Lab @ SJTU@SJTUDengLab·18 Ara

🫡 The ULTIMATE AR LLM adaptation for native parallel decoders! 🔥 Way higher generation quality than diffusion LLMs, with zero tradeoffs on speed. Craving top-tier quality and blazingly fast inference? Jacobi Forcing is your answer—try it now!

Hao AI Lab@haoailab

Jacobi Forcing: training AR models as diffusion-style parallel decoders with 4x speedup while staying causal and maintaining high generation quality. 🚀🎯 Autoregressive (AR) LLM and diffusion LLMs each have their own strengths. We analyze each method's pros and cons and ask the question: can we get the best of both worlds by turning an AR model into a causal, native parallel decoder? Our answer is YES. 👉 Read the full story here: hao-ai-lab.github.io/blogs/jacobi-f…

English

925

DENG Lab @ SJTU retweetledi

Hao Zhang@haozhangml·18 Ara

One of the most interesting things I’ve been working on recently: Jacobi Forcing -- a recipe that turns any autoregressive (AR) LLM into a native, causal parallel decoder (I am glad to send it out to wrap up a great year of 2025📣📣) There's a lot of buzz around diffusion LLMs (and yes --we work on those, too 😃). They’re exciting because they can decode many tokens in parallel. But in practice, there are some big cons: * Quality gap vs. strong AR baselines is still common. * Systems mismatch: non-causal attention often breaks “free wins” we’ve spent the last ~2 years optimizing in serving stacks (kernels, batching, etc.). Speculative decoding (SD) is also great, but most people building real serving systems probably heard about this: the speedup we can achieve in a real system through SD is much lower than those reported in SoTA paper headlines. See these threads: * github.com/sgl-project/sg… * github.com/vllm-project/v… A big reason is that SD introduces an extra drafting/verification procedure (and sometimes a draft model), which makes scheduling/orchestration much harder to do efficiently at scale. This Jacobi Forcing finds a very nice sweet spot in "middle path"! * It keeps the causal (left -> right) generation order, so the model stays close to the AR distribution (as well as compute kernels and scheduling from the system perspective) * But it behaves diffusion-like in how much it decodes per forward pass -- multi-token generation without adding drafting heads / models (the tokens per forward in our current version can go up to 5) * Here I emphasize "native": the model itself learns to parallel-decode, which makes integration into existing serving engines much cleaner. This addresses a big painpoint of integrating SD into serving systems. High level idea: we use the model’s own Jacobi decoding trajectory and progressively distill a sequential decoder into a parallel one — hence the name Jacobi Forcing. we hypothesizes the causal decoding order is crucial to keep the generation quality as high as the original model, which turns out to be that case in our empirical results. The blog goes into the full recipe (noise schedule, training mask, and the inference tricks that make it actually fast in wall-clock). If you’ve been thinking about the AR vs. SD vs. dLLM design space: would love to hear your take — especially on whether keeping causal order is a key ingredient for preserving quality.

Hao AI Lab@haoailab

English

282

31.3K

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry