Sangwon Jang

47 posts

Sangwon Jang

Sangwon Jang

@jangsangwon7

PhD student, KAIST AI

KAIST Katılım Nisan 2022
257 Takip Edilen183 Takipçiler
Sabitlenmiş Tweet
Sangwon Jang
Sangwon Jang@jangsangwon7·
What if your video generator could refine itself—at inference time? ❌No new models. ❌No retraining. ❌No external verifier. 💡 Introducing Self-Refining Video Sampling By reinterpreting a pretrained generator (Wan2.2, Cosmos) as a denoising autoencoder, we enable iterative self-refinement at inference time ➡️dramatically improving physical realism and achieving over 70% human preference! 🧵
English
3
26
181
39.3K
Sangwon Jang retweetledi
Rhoda AI
Rhoda AI@rhoda_ai_·
To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)
English
16
38
196
56.6K
Sangwon Jang retweetledi
George Bredis
George Bredis@BredisGeorge·
Most imagination-based world models learn representations by reconstructing pixels. But reconstruction may not be the right objective for control. In our new paper we explore a different idea: 👉 predict the next embedding instead of reconstructing observations. Introducing NE-Dreamer. Project page: corl-team.github.io/nedreamer/ Paper: arxiv.org/pdf/2603.02765 Code: github.com/corl-team/nedr…
GIF
English
12
58
366
47K
Sangwon Jang retweetledi
Standard Intelligence
Standard Intelligence@si_pbc·
Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.
GIF
English
187
400
3.9K
1.1M
Sangwon Jang retweetledi
Yuxuan Kuang
Yuxuan Kuang@yuxuank02·
How to enable dexterous hands with manipulation capabilities that work across diverse objects, tasks, scenes, camera views, and external perturbations? Excited to share Dex4D, a method for generalizable sim-to-real dexterous manipulation via a task-agnostic point track policy and video generation planners! NO parallel grippers, NO teleop! Project page: dex4d.github.io 🧵👇 0/10 #Robotics #EmbodiedAI #Manipulation #AI #ComputerVision
English
1
29
96
20.1K
Sangwon Jang retweetledi
Xingjian Bai
Xingjian Bai@SimulatedAnneal·
Do causal video diffusers really need dense causal attention at every layer, every denoising step? We looked inside and found: no. Causality is separable from denoising. Here are two surprising observations that hold across architectures, training objectives, and scales.
Xingjian Bai tweet media
English
4
49
332
66.9K
Sangwon Jang retweetledi
Seonghyeon Ye
Seonghyeon Ye@SeonghyeonYe·
VLAs (from VLMs) ❌ => WAMs (from Video Models) ✅ Why WAMs? 1️⃣ World Physics: VLMs know the internet, but Video Models implicitly model the physical laws essential for manipulation. 2️⃣ The "GPT Direction": VLAs are like BERT (rely heavily on task-specific post-training). WAMs are like GPT (pre-train & prompt), unlocking incredible zero-shot transfer! What I want to see in 2026: 📈 Scaling Laws: We will see much clearer scaling laws for robotics compared to VLAs. 🤝 Human-to-Robot Transfer: Unlocking massive transfer capabilities using video as a shared representation space. 🤖 Zero-Shot Mastery: Moving from short-horizon tasks to long-horizon, dexterous manipulation without task-specific demonstrations. We recently open-sourced the checkpoints, training and inference code. Dive into the research! 👇 📄 Paper: arxiv.org/abs/2602.15922 💻 Code: github.com/dreamzero0/dre… 🤗 HF: huggingface.co/GEAR-Dreams/Dr…
Seonghyeon Ye tweet media
English
5
65
517
74.2K
Sangwon Jang retweetledi
Seonghyeon Ye
Seonghyeon Ye@SeonghyeonYe·
We just gave robots "imagination," and the results are wild. 🤯 This robot wasn't trained to untie shoes or shake hands. It's never seen these tasks before. It simply "dreams" the future outcome, then acts to make it real. 🧵👇
English
3
22
81
15.6K
Sangwon Jang
Sangwon Jang@jangsangwon7·
@andrew_n_carr @sainingxie Thank you for waiting. I found that this was caused by an issue in transformers==5.0.0 when loading the Wan checkpoint. Using transformers==4.57.3 fixes the error. Thanks for reporting this, and please feel free to DM me if any issues remain!
English
1
0
2
52
Andrew Carr 🤸
Andrew Carr 🤸@andrew_n_carr·
@jangsangwon7 @sainingxie I just tried again with a vanilla install for the text to video pipeline with poor results - this is the default prompt from the repository (gymnast on the pommel horse)
English
1
0
0
76
Sangwon Jang
Sangwon Jang@jangsangwon7·
Yes, our method is general, and we also tested it on image generation (in Appendix). We focused on video since it is more challenging than other modalities, especially in terms of physics. Moreover, video-specific characteristics, (we call cross-frame consistency; strong spatio-temporal corrleation btw frames), make iterative refinement more stable!
English
0
0
1
26
Anthony Zhang
Anthony Zhang@AnthonyZhang123·
@sainingxie I’m curious if this would work for other modalities - uncertain pixels of the video make sense, and I see how it fails for image. Do you think such sampling has potential in action generation or maybe kinematic trajectory generation?
English
2
0
0
414
Sangwon Jang
Sangwon Jang@jangsangwon7·
@andrew_n_carr @sainingxie Thanks for checking. In that case, it’s likely an issue in my code. I don’t have access to my server right now, but I’ll check and follow up soon — really sorry about that.
English
1
0
0
64
Andrew Carr 🤸
Andrew Carr 🤸@andrew_n_carr·
@jangsangwon7 @sainingxie as far as I can tell I only changed the prompt: "Two men fight in the street, brawling back and forth with kicks and punches" I'm going to try a fresh install to ensure it wasn't a user error / cached something
English
1
0
0
60
Sangwon Jang
Sangwon Jang@jangsangwon7·
[Result #4] Beyond generation, our method also improves visual reasoning. We see significant gains on the graph traversal task from [1], which has motion-like characteristics. [1] Wiedemer et al., Video models are zero-shot learners and reasoners, arxiv
English
1
1
11
1.5K
Sangwon Jang
Sangwon Jang@jangsangwon7·
What if your video generator could refine itself—at inference time? ❌No new models. ❌No retraining. ❌No external verifier. 💡 Introducing Self-Refining Video Sampling By reinterpreting a pretrained generator (Wan2.2, Cosmos) as a denoising autoencoder, we enable iterative self-refinement at inference time ➡️dramatically improving physical realism and achieving over 70% human preference! 🧵
English
3
26
181
39.3K