Ziqi Ma (@ziqi__ma) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Ziqi Ma@ziqi__ma·16 Ara

Generative models shouldn’t just generate. They should be steerable by your commands. Meet Steer3D🕹️: edit generated 3D assets with text📝 in one forward pass. Trained on only 100k synthetic data, it shows that we can make generative models responsive to signals from another modality🎛️. Check out: glab-caltech.github.io/steer3d/

English

8

55

403

32.6K

Ziqi Ma retweetledi

Lizhi(Gary) Yang@lzyang2000·7 Nis

Over the weekend coded up an implementation of TWIST2 in MjLab in order to update the framework from the EOL IsaacGym and it is fully public on GitHub: github.com/lzyang2000/twi…. Credits to @ZeYanjie and @kevin_zakka for their amazing work!

English

2

18

112

8.8K

Ziqi Ma@ziqi__ma·16 Mar

@Haotianxue_GT Even with memory, static memory modules encourage remembering the scene “verbatim”, whereas processes should evolve (e.g. water level rise) rather than always stay exactly the same as how they were are last seen.

English

1

0

39

Haotian Xue@Haotianxue_GT·16 Mar

current static-world navigation world model does not have a good memory system to deal with dynamic-world

Ziqi Ma@ziqi__ma

Today’s video world models “simulate” the world by generating pixel frame observations🖼️. Can they continue to simulate the world when observations are interrupted - such as by occlusion, illumination dimming, or camera lookaway? To probe this question, we release STEVO-Bench, which holistically evaluates whether image-/text-to-video models and camera-controlled video models can correctly evolve states under observation control. Check out our website, blog and paper for how they fail!

English

1

0

2

537

Ziqi Ma@ziqi__ma·16 Mar

Joint work with @JhanLiufu (co-1st) and @georgiagkioxari. This is our attempt to "distill" many fun, philosophical conversations about world models into something quantifiable. We are sharing this benchmark and our early thoughts in hope of sparking more discussions on this topic. Let us know what you think!

English

0

2

352

Ziqi Ma@ziqi__ma·16 Mar

Benchmark website💻: glab-caltech.github.io/STEVOBench/ Blog📝: ziqi-ma.github.io/blog/2026/outo… Paper📄: arxiv.org/abs/2603.13215

English

1

0

5

465

Ziqi Ma@ziqi__ma·16 Mar

Today’s video world models “simulate” the world by generating pixel frame observations🖼️. Can they continue to simulate the world when observations are interrupted - such as by occlusion, illumination dimming, or camera lookaway? To probe this question, we release STEVO-Bench, which holistically evaluates whether image-/text-to-video models and camera-controlled video models can correctly evolve states under observation control. Check out our website, blog and paper for how they fail!

English

5

14

87

8.6K

Ziqi Ma retweetledi

Aadarsh Sahoo@SahooAadarsh·17 Şub

Perception is actionable. Humans don't just see objects, we see affordances and constraints. "Something to sit on." "Region unsafe to walk." "Something that will tip if I bump it." But today’s vision models mostly see… labels. So we built ConverSeg: Conversational Image Segmentation 🧵 glab-caltech.github.io/converseg/

English

7

21

95

12.6K

Ziqi Ma retweetledi

Damiano Marsili@marsilidamiano·27 Oca

Our paper, VALOR, got accepted at #ICLR2026 ! We explore improving visual reasoning using multimodal verifiers - all without any ground truth annotations! More details below 👇 Excited to see everyone in Rio!

Damiano Marsili@marsilidamiano

(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels

English

2

6

29

4.8K

Ziqi Ma retweetledi

Jiacheng Liu@liujc1998·26 Oca

Calling on behalf of infini-gram: does anyone know where I can get / apply for AWS credits? 💸💸 Keeping infini-gram alive costs quite some money, mostly SSD rental. If you're a fan of keeping open LLM training data readily inspectable, please reply / DM me some pointers! 🧵1/4

English

3

15

24

3.2K

Ziqi Ma retweetledi

Raphi Kang@RaphiKang·27 Oca

🤓 How do LVLM/LMMMs reason about space and time? This was the central question of our #ICLR2016 paper, “Linear Mechanisms For Spatiotemporal Reasoning In Vision Language Models”. I’m very excited to finally share it:D 🥳🥳 A thread: [1/7]

English

2

12

63

3.5K

Ziqi Ma@ziqi__ma·26 Oca

@DeemosTech The prompt edits are cool! We also work on text-based 3D editing, both local and global: glab-caltech.github.io/steer3d/

English

1

0

3

240

Hyper3D by Deemos@DeemosTech·23 Oca

💥🍌3D Nano Banana just dropped! ✨We just launched #Rodin Gen-2 "Edit", upload ANY model and edit like magic: 1️⃣ Smart Low-poly → artist-style topology 2️⃣ Local edits via prompt (Beta) 3️⃣ BANG to Parts 4️⃣+ more #Hyper3D is now the FIRST true #3D GenAI editing platform! 🚀

English

27

107

673

126.8K

Ziqi Ma retweetledi

Julius Berner@julberner·16 Oca

🚀🎬We introduce TMD (Transition Matching Distillation): 480p videos generated from text prompts in < 3 NFEs! 1️⃣Main backbone for feature extraction and lightweight head for iterative refinement 2️⃣Distilled from Wan2.1 14B T2V combining MeanFlow & DMD2 🔗research.nvidia.com/labs/genair/tmd

English

3

17

64

13.5K

Ziqi Ma@ziqi__ma·6 Oca

Really impressive generalization! The video gen formulation unlocks large-scale, diverse human video data. Amazing to see this working on a foundation-model scale!

Boyuan Chen@BoyuanChen0

Introducing Large Video Planner (LVP-14B) — a robot foundation model that actually generalizes. LVP is built on video gen, not VLA. As my final work at @MIT, LVP has all its eval tasks proposed by third parties as a maximum stress test, but it excels!🤗 boyuan.space/large-video-pl…

English

0

1

221

Ziqi Ma retweetledi

Damiano Marsili@marsilidamiano·30 Ara

(1/6) Do these images show the same vacuum cleaner? They are certainly similar, but a human will notice the differences in dustbin geometry, design, and color accents. In contrast, open-source VLMs struggle at this task. Our recent work TWIN poses the question: Can we fix this?

English

1

9

24

9.2K

Ziqi Ma@ziqi__ma·30 Ara

Oops, link now fixed: "Rodney Brooks, Tolstoy, and World Models" ziqi-ma.github.io/blog/2025/tols…🎄☕️

English

0

5

429

Ziqi Ma@ziqi__ma·30 Ara

@Hongyu_Lii Definitely! I lean towards latent for more generality, but input action mapping (at inference time) might still be tricky, especially if you want to keep embodiment flexible while maintaining physical precision.

English

1

0

1

52

Hongyu Li@Hongyu_Lii·29 Ara

Great observations on action representations! Since the underlying principle is physics, we see attempts to capture it either explicitly (e.g., particle models) or implicitly (e.g., latent actions like LAPA). I feel it remains an open question which approach is the ultimate answer.

English

1

0

3

278

Ziqi Ma@ziqi__ma·29 Ara

It was incredibly fun to write about world models and Tolstoy in the same blog post:) My new blog: “Rodney Brooks, Tolstoy, and World Models” 👉ziqi-ma.github.io/blog/2025/tols… Check it out as a light holiday read🎄☕️

English

1

3

12

2.3K

Ziqi Ma@ziqi__ma·17 Ara

@JobyOtero Thanks! Exactly - we should build models that are more interactive to users and better integrated with different types of control!

English

0

1

208

Joby Otero@JobyOtero·17 Ara

@ziqi__ma Nice work! Fwiw, what I’d love to see as someone who’s been doing 3d 40+yrs: AI that lets me model/texture/etc in rough form or partial, and the AI fills in or completes. Ideally with sketches, text, or other content, as guides for the AI.

English

1

0

4

220

Ziqi Ma@ziqi__ma·16 Ara

Generative models shouldn’t just generate. They should be steerable by your commands. Meet Steer3D🕹️: edit generated 3D assets with text📝 in one forward pass. Trained on only 100k synthetic data, it shows that we can make generative models responsive to signals from another modality🎛️. Check out: glab-caltech.github.io/steer3d/

English

8

55

403

32.6K

Ziqi Ma@ziqi__ma·17 Ara

@sarafianosn Thank you!

English

0

227