Ziqi Ma

68 posts

Ziqi Ma

Ziqi Ma

@ziqi__ma

PhD student @caltech, research intern @AIatMeta, previously @microsoft. https://t.co/S1J9LItcXw

Pasadena, CA Katılım Mayıs 2019
274 Takip Edilen325 Takipçiler
Sabitlenmiş Tweet
Ziqi Ma
Ziqi Ma@ziqi__ma·
Generative models shouldn’t just generate. They should be steerable by your commands. Meet Steer3D🕹️: edit generated 3D assets with text📝 in one forward pass. Trained on only 100k synthetic data, it shows that we can make generative models responsive to signals from another modality🎛️. Check out: glab-caltech.github.io/steer3d/
English
8
55
403
32.6K
Ziqi Ma
Ziqi Ma@ziqi__ma·
@Haotianxue_GT Even with memory, static memory modules encourage remembering the scene “verbatim”, whereas processes should evolve (e.g. water level rise) rather than always stay exactly the same as how they were are last seen.
English
1
0
0
39
Ziqi Ma
Ziqi Ma@ziqi__ma·
Joint work with @JhanLiufu (co-1st) and @georgiagkioxari. This is our attempt to "distill" many fun, philosophical conversations about world models into something quantifiable. We are sharing this benchmark and our early thoughts in hope of sparking more discussions on this topic. Let us know what you think!
English
0
0
2
352
Ziqi Ma
Ziqi Ma@ziqi__ma·
Today’s video world models “simulate” the world by generating pixel frame observations🖼️. Can they continue to simulate the world when observations are interrupted - such as by occlusion, illumination dimming, or camera lookaway? To probe this question, we release STEVO-Bench, which holistically evaluates whether image-/text-to-video models and camera-controlled video models can correctly evolve states under observation control. Check out our website, blog and paper for how they fail!
English
5
14
87
8.6K
Ziqi Ma retweetledi
Aadarsh Sahoo
Aadarsh Sahoo@SahooAadarsh·
Perception is actionable. Humans don't just see objects, we see affordances and constraints. "Something to sit on." "Region unsafe to walk." "Something that will tip if I bump it." But today’s vision models mostly see… labels. So we built ConverSeg: Conversational Image Segmentation 🧵 glab-caltech.github.io/converseg/
English
7
21
95
12.6K
Ziqi Ma retweetledi
Damiano Marsili
Damiano Marsili@marsilidamiano·
Our paper, VALOR, got accepted at #ICLR2026 ! We explore improving visual reasoning using multimodal verifiers - all without any ground truth annotations! More details below 👇 Excited to see everyone in Rio!
Damiano Marsili@marsilidamiano

(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels

English
2
6
29
4.8K
Ziqi Ma retweetledi
Jiacheng Liu
Jiacheng Liu@liujc1998·
Calling on behalf of infini-gram: does anyone know where I can get / apply for AWS credits? 💸💸 Keeping infini-gram alive costs quite some money, mostly SSD rental. If you're a fan of keeping open LLM training data readily inspectable, please reply / DM me some pointers! 🧵1/4
Jiacheng Liu tweet media
English
3
15
24
3.2K
Ziqi Ma retweetledi
Raphi Kang
Raphi Kang@RaphiKang·
🤓 How do LVLM/LMMMs reason about space and time? This was the central question of our #ICLR2016 paper, “Linear Mechanisms For Spatiotemporal Reasoning In Vision Language Models”. I’m very excited to finally share it:D 🥳🥳 A thread: [1/7]
Raphi Kang tweet media
English
2
12
63
3.5K
Hyper3D by Deemos
Hyper3D by Deemos@DeemosTech·
💥🍌3D Nano Banana just dropped! ✨We just launched #Rodin Gen-2 "Edit", upload ANY model and edit like magic: 1️⃣ Smart Low-poly → artist-style topology 2️⃣ Local edits via prompt (Beta) 3️⃣ BANG to Parts 4️⃣+ more #Hyper3D is now the FIRST true #3D GenAI editing platform! 🚀
English
27
107
673
126.8K
Ziqi Ma retweetledi
Julius Berner
Julius Berner@julberner·
🚀🎬We introduce TMD (Transition Matching Distillation): 480p videos generated from text prompts in < 3 NFEs! 1️⃣Main backbone for feature extraction and lightweight head for iterative refinement 2️⃣Distilled from Wan2.1 14B T2V combining MeanFlow & DMD2 🔗research.nvidia.com/labs/genair/tmd
English
3
17
64
13.5K
Ziqi Ma
Ziqi Ma@ziqi__ma·
Really impressive generalization! The video gen formulation unlocks large-scale, diverse human video data. Amazing to see this working on a foundation-model scale!
Boyuan Chen@BoyuanChen0

Introducing Large Video Planner (LVP-14B) — a robot foundation model that actually generalizes. LVP is built on video gen, not VLA. As my final work at @MIT, LVP has all its eval tasks proposed by third parties as a maximum stress test, but it excels!🤗 boyuan.space/large-video-pl…

English
0
0
1
221
Ziqi Ma retweetledi
Damiano Marsili
Damiano Marsili@marsilidamiano·
(1/6) Do these images show the same vacuum cleaner? They are certainly similar, but a human will notice the differences in dustbin geometry, design, and color accents. In contrast, open-source VLMs struggle at this task. Our recent work TWIN poses the question: Can we fix this?
Damiano Marsili tweet media
English
1
9
24
9.2K
Ziqi Ma
Ziqi Ma@ziqi__ma·
@Hongyu_Lii Definitely! I lean towards latent for more generality, but input action mapping (at inference time) might still be tricky, especially if you want to keep embodiment flexible while maintaining physical precision.
English
1
0
1
52
Hongyu Li
Hongyu Li@Hongyu_Lii·
Great observations on action representations! Since the underlying principle is physics, we see attempts to capture it either explicitly (e.g., particle models) or implicitly (e.g., latent actions like LAPA). I feel it remains an open question which approach is the ultimate answer.
English
1
0
3
278
Ziqi Ma
Ziqi Ma@ziqi__ma·
It was incredibly fun to write about world models and Tolstoy in the same blog post:) My new blog: “Rodney Brooks, Tolstoy, and World Models” 👉ziqi-ma.github.io/blog/2025/tols… Check it out as a light holiday read🎄☕️
English
1
3
12
2.3K
Ziqi Ma
Ziqi Ma@ziqi__ma·
@JobyOtero Thanks! Exactly - we should build models that are more interactive to users and better integrated with different types of control!
English
0
0
1
208
Joby Otero
Joby Otero@JobyOtero·
@ziqi__ma Nice work! Fwiw, what I’d love to see as someone who’s been doing 3d 40+yrs: AI that lets me model/texture/etc in rough form or partial, and the AI fills in or completes. Ideally with sketches, text, or other content, as guides for the AI.
English
1
0
4
220
Ziqi Ma
Ziqi Ma@ziqi__ma·
Generative models shouldn’t just generate. They should be steerable by your commands. Meet Steer3D🕹️: edit generated 3D assets with text📝 in one forward pass. Trained on only 100k synthetic data, it shows that we can make generative models responsive to signals from another modality🎛️. Check out: glab-caltech.github.io/steer3d/
English
8
55
403
32.6K