Daeun Lee

488 posts

Daeun Lee banner
Daeun Lee

Daeun Lee

@danadaeun

PhD student @unccs advised by @mohitban47 | Intern @AIatMeta, @AdobeResearch | Multimodal, Video, Embodied AI, Post-training, RL

United States Katılım Şubat 2024
563 Takip Edilen530 Takipçiler
Sabitlenmiş Tweet
Daeun Lee
Daeun Lee@danadaeun·
🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵
English
1
24
75
10.5K
Daeun Lee retweetledi
Zun Wang
Zun Wang@ZunWang919·
✨ Excited to share V-Co, a systematic look and recipe for visual co-denoising in pixel-space diffusion. Instead of loosely injecting pretrained features, we study how pixels and semantics should be jointly denoised and properly aligned. - Dual-stream design for clean interaction - Structural masking for stronger CFG - Hybrid loss for richer supervision - Simple scaling for stable training A simple yet principled recipe, delivering strong and consistent gains. Details👇
Han Lin@hanlin_hl

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English
0
8
11
1.3K
Daeun Lee retweetledi
Han Lin
Han Lin@hanlin_hl·
🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇
Han Lin tweet media
English
2
41
129
20.6K
Daeun Lee retweetledi
Daeun Lee retweetledi
Shoubin Yu
Shoubin Yu@shoubin621·
Check out our new work ⚽️VisionCoach, an RL + self-distillation framework for complex video reasoning. We combine reinforcement learning with dynamic visual prompting, where a visual prompt selector adaptively augments hard training examples based on reward signals. Visual grounding is key to accurate video reasoning. Instead of adding complexity at inference, we use visual prompting during training to guide models toward better spatio-temporal attention—then distill this capability into a simple, single-path model.
Daeun Lee@danadaeun

🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵

English
0
7
18
2.4K
Daeun Lee
Daeun Lee@danadaeun·
✏️ Analysis: Spatio-temporal Attention - Visual prompting improves spatio-temporal grounding by increasing attention on the correct key frame and focusing on the relevant spatial region. - It highlights key visual attributes (e.g., the cowboy’s clothing) while suppressing irrelevant regions. Please refer to the demo video for more qualitative examples!
Daeun Lee tweet media
English
1
0
8
278
Daeun Lee
Daeun Lee@danadaeun·
🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵
English
1
24
75
10.5K
Daeun Lee retweetledi
AK
AK@_akhaliq·
Holi-Spatial Evolving Video Streams into Holistic 3D Spatial Intelligence paper: huggingface.co/papers/2603.07…
English
1
20
86
12.6K
Daeun Lee retweetledi
AK
AK@_akhaliq·
From Statics to Dynamics Physics-Aware Image Editing with Latent Transition Priors paper: huggingface.co/papers/2602.21…
AK tweet media
English
1
6
52
7.2K
Daeun Lee retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
A new RL framework helps AI agents learn more effectively by paying attention to how uncertain they feel while completing a task. Allows AI to improve its own performance by using its internal doubt as a guide for better exploration. Usually, AI agents learn through simple rewards, like getting a point for finishing a job or 0 for failing. The problem is that these simple rewards do not tell the AI which specific steps it got right or where it started to feel confused. SELAUR changes this by looking at token-level uncertainty, which involves checking how confident the AI is in every single word it picks. It uses measures like entropy, which is a way to see how much the AI is guessing, to create a detailed map of the agent's confidence. If the AI is unsure about a decision, the system reshapes the reward to encourage the agent to try new and more effective paths. This approach even finds value in failed attempts because it can identify the exact moment things went wrong during the process. In major tests involving online shopping and home tasks, this technique led to consistently higher success rates than older methods. Makes the learning process much more stable and helps the AI handle complex jobs that require many different steps. By focusing on what it does not know, the AI can evolve into a much more reliable assistant for real-world digital tasks. --- arxiv .org/pdf/2602.21158v1
Rohan Paul tweet media
English
12
20
93
5.7K
Daeun Lee retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Multi-token prediction via self-distillation delivers 3x inference speedups in model weights on Llama-3.1-8B without draft models, with under 3% math accuracy drop. Standard AI models are usually slow because they have to guess the very next word, wait for it to finish, and then guess the word after that in a long, repetitive chain. This new method, called multi-token prediction via self-distillation, allows a model to guess several words in a single step without needing any extra helper models. The model teaches itself to pick out chunks of text, such as "7 days in a week," as one single unit rather than 4 separate pieces. This process happens during training where the model learns to look ahead and predict future words alongside the very next word. In tests involving math word problems, this technique boosted speed by 300% while keeping accuracy within 3% of the original slower version. One of the biggest advantages is that it works with the same hardware and code as the original model, meaning there is no need for complex new software pipelines. The system uses a confidence-adaptive strategy, meaning it only predicts multiple words when it is very sure about what comes next. If the AI is confused, it simply drops back to predicting 1 word at a time to ensure it does not make mistakes. This approach eliminates the need for speculative decoding, which is a common but complex speed-up method that requires running 2 different models at once. --- arxiv .org/pdf/2602.06019
Rohan Paul tweet media
English
12
35
264
15.5K
Daeun Lee retweetledi
AK
AK@_akhaliq·
Learning from Trials and Errors Reflective Test-Time Planning for Embodied LLMs huggingface.co/papers/2602.21…
English
3
9
40
19K