Daniel DeTone

513 posts

Daniel DeTone

@ddetone

Deep Nets and Geometry — what could go wrong?

Long Beach, CA Katılım Haziran 2009

675 Takip Edilen2.2K Takipçiler

Daniel DeTone@ddetone·6d

@weikaih04 @allen_ai @uwcse @AIatMeta Very cool! Is this conditioning on the semi dense depth point cloud or RGB only?

English

Weikai Huang@weikaih04·19 May

We updated @allen_ai @uwcse 's WildDet3D for end-to-end 3D localization on Project Aria using the great codebase from @ddetone @AIatMeta's Boxer. Same AriaLoader, same CSV schema, same view_fusion / view_tracker - try WildDet3D in Proj Aria and do world coordinates-based detection! Code: github.com/allenai/WildDe…

GIF

English

28K

Daniel DeTone@ddetone·26 Nis

The biggest impact of the Segment Anything line of work was not the actual image segmentation, but rather the flood of paper titles with the name “Any” in them. Cmon folks, let’s just call this generalization and move on!!

English

520

Daniel DeTone@ddetone·26 Nis

@MattNiessner I agree that metric 3D is critical! It’s a compressed, minimal representation. Playing devils advocate — humans also operate on projections of the 3D world and we are able to operate pretty well

English

661

Matthias Niessner@MattNiessner·26 Nis

Large foundation models have made enormous progress in modeling language, images, and video. These systems can generate highly realistic outputs and capture complex statistical structure in data. However, they still operate on projections of the world, text sequences and 2D pixel grids, rather than the world itself. The real world is not a sequence of text tokens or frames; the real world is inherently anchored in 3D metric space, and dynamics across time. Objects occupy space and persist over time. They interact according to physical laws. Any model that aims to support real-world intelligence, e.g., for robotics, simulation, design, or spatial computing, must capture this structure. This is where current approaches fall short. While most video models can generate visually plausible frames, they often lack a consistent notion of the underlying scene due to limited context windows. As a result, geometry drifts, scale is ambiguous, objects appear and disappear, and interactions are not physically grounded. The model produces superficial appearance without a persistent world representation. For many downstream applications, this is not enough. The first step toward addressing this is modeling 3D space and keeping it consistent. A model should recover a coherent spatial representation of the scene, including layout, geometry, and scale. This not only allows the environment to be rendered from new viewpoints but also, more critically, reasoned about in metric space. If a model cannot produce a stable 3D representation, it is not grounded in the physical world, and it will fail to model the world due to its inefficient contextual memory. However, 3D is only the beginning. A truly useful world model must also be temporally and physically consistent. It should not only reconstruct a scene, but also simulate it, predicting how it evolves, how objects interact, and what happens under intervention. Eventually this requires moving beyond static representations toward models that capture dynamics and causality. I believe that generative approaches are highly compelling in this context, as they can be trained on large-scale data in a self-supervised fashion. In particular, comprehensive 3D world modeling is a highly-promising path forward, since richer environmental representations directly enable deeper and more effective learning of physical reality. Crucially, such generation enforces consistency: for instance, to generate a scene across viewpoints, a model must implicitly recover its underlying 3D structure. To generate it over time, it must capture its dynamics. This forces the model to internalize the latent state of the world, including geometry, scale, materials, motion, and physical behavior. This also highlights a limitation of purely abstract representations. High-level embeddings or action-centric models can be effective for specific tasks, but without the ability to model and simulate the world, they will eventually remain incomplete. They compress observations, but do not fully model the underlying process that generates them. The next generation of AI systems should therefore move beyond text and pixels, and toward physically-grounded world models: models that represent space, maintain consistency over time, and enable simulation and interaction. This is the missing layer between the physical and digital world, which will ultimately enable AI systems not just to observe the world, but to understand and operate within it.

English

142

14.4K

Daniel DeTone@ddetone·23 Nis

New blog post about Boxer is live on the Project Aria website

Project Aria @Meta@meta_aria

How do you decompose a 2D image into accurate 3D object detections? You use🥊Boxer. A new model from Reality Labs Research enables robust 3D object detection by "lifting" 2D proposals from off-the-shelf detectors like OWL-ViT and SAM into metric 3D space. No more "flat" AI—this is about spatial intelligence for the next generation of wearables. Blog🔗 projectaria.com/news/introduci… Website with links to download: facebookresearch.github.io/boxer/ 👉@ddetone

English

1.5K

Daniel DeTone@ddetone·12 Nis

@neural_avb I was there in Barcelona! Epic

English

1.4K

AVB@neural_avb·11 Nis

My favourite piece of Schmidhuber lore is when he challenged Ian Goodfellow during a NIPS presentation on GANs Live in public Deep Learning drama peaked here. You have seen nothing like this.

Yuntian Deng@yuntiandeng

Glad to see followups to neural-os.com, but disappointed that neither the blog (with 34 refs) nor the code repo acknowledged NeuralOS, even tho the released data code appears to build directly on top of ours. That omission is hard to understand given our shared vision.

English

615

106.5K

Daniel DeTone@ddetone·11 Nis

@Capsbrr Ah sorry, I thought you meant on Quest cameras, not running on the ML model on Quest hardware. I don’t think this model can run in realtime on Quest. Though it could probably be distilled significantly with further effort and maybe work

English

Carlos Pinheiro@Capsbrr·11 Nis

@ddetone How does Boxer run in real-time on the Quest 3, when it takes 20 ms to run on a *RTX 4090*? The hardware specs are worlds apart. Im genuinely curious.

English

Daniel DeTone@ddetone·8 Nis

Today we release Boxer, a new lightweight approach that lifts open-world 2D bounding boxes to *metric* 3D: facebookresearch.github.io/boxer/ Here we show Boxer in action on an egocentric sequence captured from smart glasses:

English

168

1.3K

79.2K

Daniel DeTone@ddetone·11 Nis

@Capsbrr yes

Carlos Pinheiro@Capsbrr·10 Nis

@ddetone I would also be curious: Does it run in real-time on the meta quest 3?

English

Daniel DeTone@ddetone·11 Nis

@weikaih04 that would be great! we didn't train on much on outdoor data, I would expect a big boost the WildDet3D dataset training for outdoors

English

171

Weikai Huang@weikaih04·11 Nis

Just check out the boxer model — the latency (20–40ms) and the generalization are also pretty impressive. Huge congrats to Daniel and the other authors! If you're interested in open-world 3D detection for outdoor/in-the-wild scenarios, also check out our WildDet3D 👇 github.com/allenai/WildDe…. Thinking about training Boxer with WildDet3D data to do 30fps in-the-wild 3D tracking.

Daniel DeTone@ddetone

English

12.9K

Daniel DeTone@ddetone·11 Nis

Cool showcase from @_satyam_ai running Boxer on RGB video using COLMAP for poses + pointcloud and GeoCalib for gravity estimation

Satyam Kumar@_satyam_ai

I implemented it and the ~8 degree gravity correction from GeoCalib made a real difference. Look at the monitor - on the left (pose heuristic) the box is tilted and doesn't match the screen edges, on the right (GeoCalib) it wraps the monitor much more tightly. The shelf boxes at the top are also cleaner, less overshoot. Yeah, the improvement is clear.

English

1.8K

Daniel DeTone@ddetone·10 Nis

@_satyam_ai @pesarlin GeoCalib looking solid 💪

English

Daniel DeTone@ddetone·10 Nis

@_satyam_ai Amazing!

English

110

Satyam Kumar@_satyam_ai·10 Nis

Meta recently open-sourced Boxer, a model that lifts 2D bounding boxes into 3D oriented bounding boxes (OBBs) for scene understanding. The catch? It was designed for Aria AR glasses, not regular cameras. So I built a pipeline to make it work with any phone video. The hard part: Boxer expects gravity from Aria's IMU. COLMAP doesn't know "up" from "down." Had to estimate gravity from camera poses and rotate the entire reconstruction. @Meta #ComputerVision #3DReconstruction #MetaAI #SceneUnderstanding

Daniel DeTone@ddetone

English

668

Daniel DeTone@ddetone·10 Nis

@_satyam_ai the gravity estimate looks a little bit off. another idea could be to run this per frame and take the global 3D average: github.com/cvg/GeoCalib

English

138

Satyam Kumar@_satyam_ai·10 Nis

@ddetone Yes, feeding in the COLMAP sparse point cloud as depth input along with RGB frames. For the example video, it had around ~14K points.

English

112

Daniel DeTone@ddetone·9 Nis

@yesitsarmin yes, the main limitation is the 2D detector here, but there are tons of better models (SAM3, VLMs) if you have the compute. for very cluttered scenes it doesn't work as well

English

𓅋 𐎫𐎤𐎶𐏀 ‎ﷺ@yesitsarmin·9 Nis

@ddetone this is very cool, can it work for any arbitrary objects, like stuff on a table, and can it work with stereo camera?

English

166

Daniel DeTone@ddetone·9 Nis

@ElioenaiSiqCst Yes, I didn't show any examples of that but we trained on a massive internal-only Quest3 dataset

English

120

Elioenai Siqueira Costa@ElioenaiSiqCst·9 Nis

@ddetone Works In meta quest?

English

283

Daniel DeTone@ddetone·9 Nis

@BlueAquilae great question! I would not expect it to work well here, we would need to re-train it with a full 9 DoF representation. but feel free to try it out anyway, I'd be curious

English

Daniel DeTone@ddetone·9 Nis

@CleverBetTips The National?

English

CleverBet@CleverBetTips·9 Nis

@ddetone Your name is competing with one of the great albums of the last 30 years 😀

English

218

Daniel DeTone@ddetone·9 Nis

@haodongli00 One limitation I found using both of those models is the runtime. For detecting 1000+ text prompts with SAM3 it takes 20+ sec per image. SAM3D also takes ~15 sec per object, so running on large datasets can be expensive. OWLv2 runs at ~30ms and Boxer takes ~20ms

English

128

Haodong Li@haodongli00·9 Nis

@ddetone Great work! @ddetone Also doing something very similar, using SAM3, SAM3D and many other powerful tools! 😎

English

352

Daniel DeTone@ddetone·8 Nis

@nickkarpov Feel free to file a GitHub issue if you have any problems! Will do my best to answer them quickly

English

639

Nick Karpov@nickkarpov·8 Nis

@ddetone Awesome work, going to use this

English

644

Daniel DeTone@ddetone·8 Nis

For more details, check out the arxiv paper here: arxiv.org/abs/2604.05212

English

Daniel DeTone@ddetone·8 Nis

BoxerNet runs FAST 🔥🔥, taking roughly 20ms on a 4090 with bfloat16 for ALL prompts in an image (e.g. 30 boxes in parallel)

English

1.1K

Keşfet

@weikaih04 @allen_ai @uwcse @AIatMeta @MattNiessner @neural_avb @Capsbrr @_satyam_ai