Daeun Lee (@danadaeun) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Daeun Lee@danadaeun·17 Mar

🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵

English

1

24

75

10.5K

Daeun Lee retweetledi

Zun Wang@ZunWang919·6d

✨ Excited to share V-Co, a systematic look and recipe for visual co-denoising in pixel-space diffusion. Instead of loosely injecting pretrained features, we study how pixels and semantics should be jointly denoised and properly aligned. - Dual-stream design for clean interaction - Structural masking for stronger CFG - Hybrid loss for richer supervision - Simple scaling for stable training A simple yet principled recipe, delivering strong and consistent gains. Details👇

Han Lin@hanlin_hl

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English

0

8

11

1.3K

Daeun Lee retweetledi

Han Lin@hanlin_hl·18 Mar

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English

2

41

129

20.6K

Daeun Lee retweetledi

Yue Zhang@zhan1624·17 Mar

🚨Excited to share our work on VisionCoach! -Video reasoning isn’t failing because models can’t reason —it’s failing because they don’t see correctly. -Instead of adding more tools at inference, we teach models how to look during training. ✨VisionCoach = visual prompting (train-time) + RL with self-distillation -> grounded reasoning with tool-free inference

Daeun Lee@danadaeun

🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵

English

0

5

17

2.9K

Daeun Lee retweetledi

Shoubin Yu@shoubin621·17 Mar

Check out our new work ⚽️VisionCoach, an RL + self-distillation framework for complex video reasoning. We combine reinforcement learning with dynamic visual prompting, where a visual prompt selector adaptively augments hard training examples based on reward signals. Visual grounding is key to accurate video reasoning. Instead of adding complexity at inference, we use visual prompting during training to guide models toward better spatio-temporal attention—then distill this capability into a simple, single-path model.

Daeun Lee@danadaeun

🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵

English

0

7

18

2.4K

Daeun Lee@danadaeun·17 Mar

Awesome collaboration with @shoubin621 @zhan1624 @mohitban47 @unc_ai_group @unccs Check the full paper for more details! - ArXiv: arxiv.org/abs/2603.14659 - Code: github.com/daeunni/Vision… - Webpage: visioncoach.github.io - @huggingface page: huggingface.co/papers/2603.14… - @huggingface model: huggingface.co/daeunni/Vision…

English

0

2

11

568

Daeun Lee@danadaeun·17 Mar

✏️ Analysis: Spatio-temporal Attention - Visual prompting improves spatio-temporal grounding by increasing attention on the correct key frame and focusing on the relevant spatial region. - It highlights key visual attributes (e.g., the cowboy’s clothing) while suppressing irrelevant regions. Please refer to the demo video for more qualitative examples!

English

1

0

8

278

Daeun Lee@danadaeun·17 Mar

🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵

English

1

24

75

10.5K

Daeun Lee retweetledi

AK@_akhaliq·10 Mar

V1 Unifying Generation and Self-Verification for Parallel Reasoners paper: huggingface.co/papers/2603.04…

English

3

21

96

27.9K

Daeun Lee retweetledi

AK@_akhaliq·10 Mar

Holi-Spatial Evolving Video Streams into Holistic 3D Spatial Intelligence paper: huggingface.co/papers/2603.07…

English

1

20

86

12.6K

Daeun Lee retweetledi

AK@_akhaliq·5 Mar

Heterogeneous Agent Collaborative Reinforcement Learning huggingface.co/papers/2603.02…

English

4

5

31

7.5K

Daeun Lee retweetledi

AK@_akhaliq·5 Mar

Proact-VL A Proactive VideoLLM for Real-Time AI Companions huggingface.co/papers/2603.03…

English

1

3

42

7K

Daeun Lee retweetledi

AK@_akhaliq·27 Şub

From Statics to Dynamics Physics-Aware Image Editing with Latent Transition Priors paper: huggingface.co/papers/2602.21…

English

1

6

52

7.2K

Daeun Lee retweetledi

Rohan Paul@rohanpaul_ai·26 Şub

A new RL framework helps AI agents learn more effectively by paying attention to how uncertain they feel while completing a task. Allows AI to improve its own performance by using its internal doubt as a guide for better exploration. Usually, AI agents learn through simple rewards, like getting a point for finishing a job or 0 for failing. The problem is that these simple rewards do not tell the AI which specific steps it got right or where it started to feel confused. SELAUR changes this by looking at token-level uncertainty, which involves checking how confident the AI is in every single word it picks. It uses measures like entropy, which is a way to see how much the AI is guessing, to create a detailed map of the agent's confidence. If the AI is unsure about a decision, the system reshapes the reward to encourage the agent to try new and more effective paths. This approach even finds value in failed attempts because it can identify the exact moment things went wrong during the process. In major tests involving online shopping and home tasks, this technique led to consistently higher success rates than older methods. Makes the learning process much more stable and helps the AI handle complex jobs that require many different steps. By focusing on what it does not know, the AI can evolve into a much more reliable assistant for real-world digital tasks. --- arxiv .org/pdf/2602.21158v1

English

12

20

93

5.7K

Daeun Lee retweetledi

Rohan Paul@rohanpaul_ai·26 Şub

Multi-token prediction via self-distillation delivers 3x inference speedups in model weights on Llama-3.1-8B without draft models, with under 3% math accuracy drop. Standard AI models are usually slow because they have to guess the very next word, wait for it to finish, and then guess the word after that in a long, repetitive chain. This new method, called multi-token prediction via self-distillation, allows a model to guess several words in a single step without needing any extra helper models. The model teaches itself to pick out chunks of text, such as "7 days in a week," as one single unit rather than 4 separate pieces. This process happens during training where the model learns to look ahead and predict future words alongside the very next word. In tests involving math word problems, this technique boosted speed by 300% while keeping accuracy within 3% of the original slower version. One of the biggest advantages is that it works with the same hardware and code as the original model, meaning there is no need for complex new software pipelines. The system uses a confidence-adaptive strategy, meaning it only predicts multiple words when it is very sure about what comes next. If the AI is confused, it simply drops back to predicting 1 word at a time to ensure it does not make mistakes. This approach eliminates the need for speculative decoding, which is a common but complex speed-up method that requires running 2 different models at once. --- arxiv .org/pdf/2602.06019

English

12

35

264

15.5K

Daeun Lee retweetledi

AK@_akhaliq·25 Şub

Learning from Trials and Errors Reflective Test-Time Planning for Embodied LLMs huggingface.co/papers/2602.21…

English

3

9

40

19K

Daeun Lee

Keşfet