Sangwon Jang (@jangsangwon7) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Sangwon Jang@jangsangwon7·29 Oca

What if your video generator could refine itself—at inference time? ❌No new models. ❌No retraining. ❌No external verifier. 💡 Introducing Self-Refining Video Sampling By reinterpreting a pretrained generator (Wan2.2, Cosmos) as a denoising autoencoder, we enable iterative self-refinement at inference time ➡️dramatically improving physical realism and achieving over 70% human preference! 🧵

English

3

26

181

39.3K

Sangwon Jang retweetledi

Rhoda AI@rhoda_ai_·10 Mar

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

English

16

38

196

56.6K

Sangwon Jang retweetledi

George Bredis@BredisGeorge·4 Mar

Most imagination-based world models learn representations by reconstructing pixels. But reconstruction may not be the right objective for control. In our new paper we explore a different idea: 👉 predict the next embedding instead of reconstructing observations. Introducing NE-Dreamer. Project page: corl-team.github.io/nedreamer/ Paper: arxiv.org/pdf/2603.02765 Code: github.com/corl-team/nedr…

GIF

English

12

58

366

47K

Sangwon Jang retweetledi

Standard Intelligence@si_pbc·23 Şub

Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.

GIF

English

187

400

3.9K

1.1M

Sangwon Jang retweetledi

Yuxuan Kuang@yuxuank02·18 Şub

How to enable dexterous hands with manipulation capabilities that work across diverse objects, tasks, scenes, camera views, and external perturbations? Excited to share Dex4D, a method for generalizable sim-to-real dexterous manipulation via a task-agnostic point track policy and video generation planners! NO parallel grippers, NO teleop! Project page: dex4d.github.io 🧵👇 0/10 #Robotics #EmbodiedAI #Manipulation #AI #ComputerVision

English

1

29

96

20.1K

Sangwon Jang retweetledi

Xingjian Bai@SimulatedAnneal·20 Şub

Do causal video diffusers really need dense causal attention at every layer, every denoising step? We looked inside and found: no. Causality is separable from denoising. Here are two surprising observations that hold across architectures, training objectives, and scales.

English

4

49

332

66.9K

Sangwon Jang retweetledi

Seonghyeon Ye@SeonghyeonYe·19 Şub

VLAs (from VLMs) ❌ => WAMs (from Video Models) ✅ Why WAMs? 1️⃣ World Physics: VLMs know the internet, but Video Models implicitly model the physical laws essential for manipulation. 2️⃣ The "GPT Direction": VLAs are like BERT (rely heavily on task-specific post-training). WAMs are like GPT (pre-train & prompt), unlocking incredible zero-shot transfer! What I want to see in 2026: 📈 Scaling Laws: We will see much clearer scaling laws for robotics compared to VLAs. 🤝 Human-to-Robot Transfer: Unlocking massive transfer capabilities using video as a shared representation space. 🤖 Zero-Shot Mastery: Moving from short-horizon tasks to long-horizon, dexterous manipulation without task-specific demonstrations. We recently open-sourced the checkpoints, training and inference code. Dive into the research! 👇 📄 Paper: arxiv.org/abs/2602.15922 💻 Code: github.com/dreamzero0/dre… 🤗 HF: huggingface.co/GEAR-Dreams/Dr…

English

5

65

517

74.2K

Sangwon Jang retweetledi

Seonghyeon Ye@SeonghyeonYe·4 Şub

We just gave robots "imagination," and the results are wild. 🤯 This robot wasn't trained to untie shoes or shake hands. It's never seen these tasks before. It simply "dreams" the future outcome, then acts to make it real. 🧵👇

English

3

22

81

15.6K

Sangwon Jang@jangsangwon7·1 Şub

@andrew_n_carr @sainingxie Thank you for waiting. I found that this was caused by an issue in transformers==5.0.0 when loading the Wan checkpoint. Using transformers==4.57.3 fixes the error. Thanks for reporting this, and please feel free to DM me if any issues remain!

English

1

0

2

52

Andrew Carr 🤸@andrew_n_carr·30 Oca

@jangsangwon7 @sainingxie I just tried again with a vanilla install for the text to video pipeline with poor results - this is the default prompt from the repository (gymnast on the pommel horse)

English

1

0

76

Saining Xie@sainingxie·30 Oca

if you are building video diffusion / world simulators, try this new sampler. temporal consistency pins videos to a low-dimensional manifold in the total pixel space. self-refinement sampling keeps them there.

Sangwon Jang@jangsangwon7

What if your video generator could refine itself—at inference time? ❌No new models. ❌No retraining. ❌No external verifier. 💡 Introducing Self-Refining Video Sampling By reinterpreting a pretrained generator (Wan2.2, Cosmos) as a denoising autoencoder, we enable iterative self-refinement at inference time ➡️dramatically improving physical realism and achieving over 70% human preference! 🧵

English

2

19

260

31.1K

Sangwon Jang@jangsangwon7·1 Şub

Yes, our method is general, and we also tested it on image generation (in Appendix). We focused on video since it is more challenging than other modalities, especially in terms of physics. Moreover, video-specific characteristics, (we call cross-frame consistency; strong spatio-temporal corrleation btw frames), make iterative refinement more stable!

English

0

1

26

Anthony Zhang@AnthonyZhang123·30 Oca

@sainingxie I’m curious if this would work for other modalities - uncertain pixels of the video make sense, and I see how it fails for image. Do you think such sampling has potential in action generation or maybe kinematic trajectory generation?

English

2

0

414

Sangwon Jang@jangsangwon7·30 Oca

@andrew_n_carr @sainingxie Thanks for checking. In that case, it’s likely an issue in my code. I don’t have access to my server right now, but I’ll check and follow up soon — really sorry about that.

English

1

0

64

Andrew Carr 🤸@andrew_n_carr·30 Oca

@jangsangwon7 @sainingxie as far as I can tell I only changed the prompt: "Two men fight in the street, brawling back and forth with kicks and punches" I'm going to try a fresh install to ensure it wasn't a user error / cached something

English

1

0

60

Sangwon Jang@jangsangwon7·30 Oca

@andrew_n_carr @sainingxie Did you change any hyperparameters, or could you share your settings?

English

1

0

52

Andrew Carr 🤸@andrew_n_carr·30 Oca

@jangsangwon7 @sainingxie This was intended to be some karate sequence

English

1

0

58

Sangwon Jang@jangsangwon7·30 Oca

@andrew_n_carr @sainingxie @andrew_n_carr Thanks for your interest in our work! We’ve released our Diffusers-based implementation on GitHub: github.com/agwmon/self-re… I’m not familiar with ComfyUI myself, but @OverPower13959’s setup looks conceptually similar to the basic P&P method described in our paper.

English

1

0

86

Andrew Carr 🤸@andrew_n_carr·30 Oca

@sainingxie I couldn't get their code to run 😔

English

2

0

4

758

Sangwon Jang@jangsangwon7·29 Oca

Huge thanks to @taekyungki (co-led), @Jaehyeong_Jo (co-led), @sainingxie, @jaeh0ng_yoon (co-advising), @SungJuHwang1 (co-advising) for the invaluable feedback! For more details, please refer to the paper 👇 Paper: arxiv.org/abs/2601.18577 Code: github.com/agwmon/self-re… Project page: agwmon.github.io/self-refine-vi… HF: huggingface.co/papers/2601.18…

English

0

2

20

1.8K

Sangwon Jang@jangsangwon7·29 Oca

[Result #4] Beyond generation, our method also improves visual reasoning. We see significant gains on the graph traversal task from [1], which has motion-like characteristics. [1] Wiedemer et al., Video models are zero-shot learners and reasoners, arxiv

English

1

11

1.5K

Sangwon Jang@jangsangwon7·29 Oca

What if your video generator could refine itself—at inference time? ❌No new models. ❌No retraining. ❌No external verifier. 💡 Introducing Self-Refining Video Sampling By reinterpreting a pretrained generator (Wan2.2, Cosmos) as a denoising autoencoder, we enable iterative self-refinement at inference time ➡️dramatically improving physical realism and achieving over 70% human preference! 🧵

English

3

26

181

39.3K

Sangwon Jang

Keşfet