




Jikai Wang
9 posts







🤖 Robots don't fail in the lab. They fail in the wild — clutter, occlusion, constantly changing environments. The real question: Can robots learn directly from these failures during deployment? How about teaching robots the way we'd teach a child — by showing them where they went wrong? 🧵👇

Colmap 4.0 was very recently released, so it inspired me to do some work to better understand it and its new capabilities with @rerundotio. I want to really understand how Colmap, and in particular, pycolmap, works outside of just calling it via the CLI. So my goal is to use the low-level pycolmap API to log every part of the pipeline. The explicit goal is to have an alternative to the SQLite database that I can utilize. Instead of SQLite, I want to try logging everything directly to rerun and use RRD. This means I can have deep inspectability and still save the features/matches/2D view geometry, but be able to view it directly in rerun. I think this is one of the superpowers that rerun provides; data and visualizations are deeply integrated. As I'm often working with sequential data (videos), I'm going to specifically focus on four things: 1. Monocular Video Simple: Calls high-level APIs such as pycolmap.extract_features, pycolmap.match_sequential, pycolmap.incremental_mapping. These are basically identical to the CLI options and provide a good baseline. 2. Monocular Video Streamed: Take the above high-level APIs and break them down to their iterator version, logging each component in a streamed manner. This way, I can stream the intermediate features to rerun while the extraction/matching/mapping is happening. 3. Rig with unknown calibration: <- WHAT THE VIDEO SHOWS This is probably the most interesting version and the first one I've been working on. It allows one to set a rig between known sensors, such as in VR/AR devices, leading to much better reconstructions with multiple cameras. This is the case where we don't know the calibration a priori, so we have to run a reconstruction twice: once as a normal Colmap reconstruction with no rig constraints, use this to generate the constraints, and then do it again with the newly found rig. 4. Rig with known calibration: This is the RoboCap example, where we have a pre-calibrated set of sensors, so we don't need to run the two reconstructions and also gain better matching between cameras, both spatially and temporally. Again, this leads to a much better reconstruction! Along with all this, GLOMAP has become a first-class global mapper, making it super easy to use directly within pycolmap! I'm excited to do more with this and compare it to things like pycuvslam, vipe, and other alternatives.

Reality of robotics: humanoid kung fu is solved before they can open doors with RGB. Here we are. Introducing the frontier of sim2real at NVIDIA GEAR. 100% sim data. RGB input only. Code name: 𝗗𝗼𝗼𝗿𝗠𝗮𝗻. We are opening the sim-to-real door. doorman-humanoid.github.io 🧵

Jikai Wang @JwRobotics (jwroboticsvision.github.io), who led our HO-Cap project, will be graduating next year. He’s looking for full-time roles in industry. If your team needs an expert in 3D hand-object interaction and robot simulation, please reach out to Jikai!

Announcing Kaputt: a large-scale dataset for visual defect detection in retail logistics with 238,421 images across 48,376 unique items – 40x as large as current benchmarks:

If you're not labeling your own data, you're NGMI. I take this seriously, so I finished building the first version of my hand-tracking annotation app using @rerundotio and @Gradio. The combination of Rerun's callback system and Gradio integration enables a highly customizable and powerful labeling app. It supports multiple views, 2D and 3D, and maintains time synchronization! The only input required is a zip file containing two or more multiview MP4 files. I handle everything else automatically. This application works with both egocentric (first-person) and exocentric (third-person) videos. Networks will occasionally make mistakes, so having the ability to correct them manually is crucial. This is a significant step towards robust and powerful hand tracking, which will provide excellent training data for robot dexterous manipulation. The next step involves leveraging Rerun's recent updates, particularly the multisink support. Changes are saved directly to a file in .rrd format, easily extractable since the underlying representation is PyArrow. This can be converted to Pandas, Polars, or DuckDB. This tight integration between visuals, predictions, and data is crucial to ensure your data is precisely what you expect it to be.
