Junyi Zhang

206 posts

Junyi Zhang banner
Junyi Zhang

Junyi Zhang

@junyi42

CS Ph.D. Student @Berkeley_AI. B.Eng. @SJTU1896 CS. previous with @GoogleDeepMind, @MSFTResearch. Vision, generative model, robotics.

Katılım Temmuz 2022
540 Takip Edilen2.7K Takipçiler
Sabitlenmiş Tweet
Junyi Zhang
Junyi Zhang@junyi42·
𝗢𝗻𝗲 𝗺𝗲𝗺𝗼𝗿𝘆 𝗰𝗮𝗻’𝘁 𝗿𝘂𝗹𝗲 𝘁𝗵𝗲𝗺 𝗮𝗹𝗹. We present 𝗟𝗼𝗚𝗲𝗥, a new 𝗵𝘆𝗯𝗿𝗶𝗱 𝗺𝗲𝗺𝗼𝗿𝘆 architecture for long-context geometric reconstruction. LoGeR enables stable reconstruction over up to 𝟭𝟬𝗸 𝗳𝗿𝗮𝗺𝗲𝘀 / 𝗸𝗶𝗹𝗼𝗺𝗲𝘁𝗲𝗿 𝘀𝗰𝗮𝗹𝗲, with 𝗹𝗶𝗻𝗲𝗮𝗿-𝘁𝗶𝗺𝗲 𝘀𝗰𝗮𝗹𝗶𝗻𝗴 in sequence length, 𝗳𝘂𝗹𝗹𝘆 𝗳𝗲𝗲𝗱𝗳𝗼𝗿𝘄𝗮𝗿𝗱 inference, and 𝗻𝗼 𝗽𝗼𝘀𝘁-𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻. Yet it matches or surpasses strong optimization-based pipelines. (1/5) @GoogleDeepMind @Berkeley_AI
English
64
446
3.4K
559.1K
Junyi Zhang retweetledi
Jianyuan
Jianyuan@jianyuan_wang·
Introducing VGGT-Ω: scaling feed-forward reconstruction across static and dynamic scenes, and studying whether the learned geometric representations transfer beyond reconstruction.
English
13
142
1K
755K
Junyi Zhang retweetledi
Jiawei Yang
Jiawei Yang@JiaweiYang118·
Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.
Jiawei Yang tweet media
English
55
156
922
212.2K
Junyi Zhang retweetledi
Zirui "Colin" Wang
Zirui "Colin" Wang@zwcolin·
🏆 Our VisGym just got the ✨best paper award✨ at the multimodal intelligence workshop in ICLR :)
Zirui "Colin" Wang tweet media
English
4
7
66
4.3K
Junyi Zhang retweetledi
Neerja Thakkar
Neerja Thakkar@neerjathakkar·
What’s the right representation for a world model? 3D, pixels, or something else? Excited to release our new paper “Forecasting Motion in the Wild” where we propose point tracks as tokens for generating complex non-rigid motion and behavior From @GoogleDeepmind @Berkeley_AI @TTIC_Connect
GIF
English
7
74
465
78.4K
Junyi Zhang retweetledi
Max Fu
Max Fu@letian_fu·
Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵
English
20
127
632
165.5K
Songyou Peng
Songyou Peng@songyoupeng·
Huge thanks for the acknowledgment! It is really an honor to get honorable mentioned twice at @3DVconf, first a best paper honorable mention 2 years ago :) Also big congrats to my dear colleague and friend @Mi_Niemeyer for the award, so well deserved!
International Conference on 3D Vision (3DV)@3DVconf

3DV Outstanding Doctoral Dissertation Award Honorable Mention goes to Songyou Peng! @songyoupeng Thesis title: "Neural Scene Representations for 3D Reconstruction and Scene Understanding" #3DV2026

English
18
4
141
10K
Junyi Zhang retweetledi
Haocheng Xi
Haocheng Xi@HaochengXiUCB·
𝗞-𝗺𝗲𝗮𝗻𝘀 𝗶𝘀 𝘀𝗶𝗺𝗽𝗹𝗲. 𝗠𝗮𝗸𝗶𝗻𝗴 𝗶𝘁 𝗳𝗮𝘀𝘁 𝗼𝗻 𝗚𝗣𝗨𝘀 𝗶𝘀𝗻’𝘁. That’s why we built Flash-KMeans — an IO-aware implementation of exact k-means that rethinks the algorithm around modern GPU bottlenecks. By attacking the memory bottlenecks directly, Flash-KMeans achieves 30x speedup over cuML and 200x speedup over FAISS — with the same exact algorithm, just engineered for today’s hardware. At the million-scale, Flash-KMeans can complete a k-means iteration in milliseconds. A classic algorithm — redesigned for modern GPUs. Paper: arxiv.org/abs/2603.09229 Code: github.com/svg-project/fl…
English
36
201
1.8K
306.8K
Junyi Zhang retweetledi
Junyi Zhang
Junyi Zhang@junyi42·
LoGeR breaks both walls with 𝗰𝗵𝘂𝗻𝗸-𝘄𝗶𝘀𝗲 𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 + 𝗵𝘆𝗯𝗿𝗶𝗱 𝗺𝗲𝗺𝗼𝗿𝘆: 🔹 Local Memory (SWA): non-parametric, lossless sliding-window attention preserves high-fidelity adjacent alignment. 🔹 Global Memory (TTT): compressed fast weights propagate long-range structure and stabilize scale over kilometer-scale trajectories.
Junyi Zhang tweet media
English
1
0
54
11K
Junyi Zhang
Junyi Zhang@junyi42·
𝗢𝗻𝗲 𝗺𝗲𝗺𝗼𝗿𝘆 𝗰𝗮𝗻’𝘁 𝗿𝘂𝗹𝗲 𝘁𝗵𝗲𝗺 𝗮𝗹𝗹. We present 𝗟𝗼𝗚𝗲𝗥, a new 𝗵𝘆𝗯𝗿𝗶𝗱 𝗺𝗲𝗺𝗼𝗿𝘆 architecture for long-context geometric reconstruction. LoGeR enables stable reconstruction over up to 𝟭𝟬𝗸 𝗳𝗿𝗮𝗺𝗲𝘀 / 𝗸𝗶𝗹𝗼𝗺𝗲𝘁𝗲𝗿 𝘀𝗰𝗮𝗹𝗲, with 𝗹𝗶𝗻𝗲𝗮𝗿-𝘁𝗶𝗺𝗲 𝘀𝗰𝗮𝗹𝗶𝗻𝗴 in sequence length, 𝗳𝘂𝗹𝗹𝘆 𝗳𝗲𝗲𝗱𝗳𝗼𝗿𝘄𝗮𝗿𝗱 inference, and 𝗻𝗼 𝗽𝗼𝘀𝘁-𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻. Yet it matches or surpasses strong optimization-based pipelines. (1/5) @GoogleDeepMind @Berkeley_AI
English
64
446
3.4K
559.1K
Junyi Zhang
Junyi Zhang@junyi42·
Very excited to share our year-long work with an amazing team @zwcolin @aomaru_21490 and all! Everything is open sourced, the code, benchmark, trajectories, datasets, and models: VisGym.github.io We hope this could be a step towards developing general-purpose vlm agent
English
1
0
5
574
Junyi Zhang
Junyi Zhang@junyi42·
Luckily, the same environment we designed for evaluation could be used for training, and is a diverse, customizable, scalable source. We have early exploration in the paper on how to generate SFT data that is more effective, we believe more potentials are ahead (eg, with RL) 🧵
English
1
0
3
528
Junyi Zhang
Junyi Zhang@junyi42·
We rigorously test VLM agents across diverse domains: symbolic, 2D, 3D, embodied. Even frontier models face key gaps for general vlm agent: 1. Memory: more history context hurts performance. 2. Perception: still a major bottleneck. 3. Partial observation is very hard. 🧵
Zirui "Colin" Wang@zwcolin

🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ @junyi42 @aomaru_21490) 🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to. 🧵[1/8]

English
1
4
37
3.3K