Daniel DeTone

513 posts

Daniel DeTone banner
Daniel DeTone

Daniel DeTone

@ddetone

Deep Nets and Geometry — what could go wrong?

Long Beach, CA Katılım Haziran 2009
675 Takip Edilen2.2K Takipçiler
Weikai Huang
Weikai Huang@weikaih04·
We updated @allen_ai @uwcse 's WildDet3D for end-to-end 3D localization on Project Aria using the great codebase from @ddetone @AIatMeta's Boxer. Same AriaLoader, same CSV schema, same view_fusion / view_tracker - try WildDet3D in Proj Aria and do world coordinates-based detection! Code: github.com/allenai/WildDe…
GIF
English
1
5
44
28K
Daniel DeTone
Daniel DeTone@ddetone·
The biggest impact of the Segment Anything line of work was not the actual image segmentation, but rather the flood of paper titles with the name “Any” in them. Cmon folks, let’s just call this generalization and move on!!
English
0
0
6
520
Daniel DeTone
Daniel DeTone@ddetone·
@MattNiessner I agree that metric 3D is critical! It’s a compressed, minimal representation. Playing devils advocate — humans also operate on projections of the 3D world and we are able to operate pretty well
English
1
0
9
661
Matthias Niessner
Matthias Niessner@MattNiessner·
Large foundation models have made enormous progress in modeling language, images, and video. These systems can generate highly realistic outputs and capture complex statistical structure in data. However, they still operate on projections of the world, text sequences and 2D pixel grids, rather than the world itself. The real world is not a sequence of text tokens or frames; the real world is inherently anchored in 3D metric space, and dynamics across time. Objects occupy space and persist over time. They interact according to physical laws. Any model that aims to support real-world intelligence, e.g., for robotics, simulation, design, or spatial computing, must capture this structure. This is where current approaches fall short. While most video models can generate visually plausible frames, they often lack a consistent notion of the underlying scene due to limited context windows. As a result, geometry drifts, scale is ambiguous, objects appear and disappear, and interactions are not physically grounded. The model produces superficial appearance without a persistent world representation. For many downstream applications, this is not enough. The first step toward addressing this is modeling 3D space and keeping it consistent. A model should recover a coherent spatial representation of the scene, including layout, geometry, and scale. This not only allows the environment to be rendered from new viewpoints but also, more critically, reasoned about in metric space. If a model cannot produce a stable 3D representation, it is not grounded in the physical world, and it will fail to model the world due to its inefficient contextual memory. However, 3D is only the beginning. A truly useful world model must also be temporally and physically consistent. It should not only reconstruct a scene, but also simulate it, predicting how it evolves, how objects interact, and what happens under intervention. Eventually this requires moving beyond static representations toward models that capture dynamics and causality. I believe that generative approaches are highly compelling in this context, as they can be trained on large-scale data in a self-supervised fashion. In particular, comprehensive 3D world modeling is a highly-promising path forward, since richer environmental representations directly enable deeper and more effective learning of physical reality. Crucially, such generation enforces consistency: for instance, to generate a scene across viewpoints, a model must implicitly recover its underlying 3D structure. To generate it over time, it must capture its dynamics. This forces the model to internalize the latent state of the world, including geometry, scale, materials, motion, and physical behavior. This also highlights a limitation of purely abstract representations. High-level embeddings or action-centric models can be effective for specific tasks, but without the ability to model and simulate the world, they will eventually remain incomplete. They compress observations, but do not fully model the underlying process that generates them. The next generation of AI systems should therefore move beyond text and pixels, and toward physically-grounded world models: models that represent space, maintain consistency over time, and enable simulation and interaction. This is the missing layer between the physical and digital world, which will ultimately enable AI systems not just to observe the world, but to understand and operate within it.
Matthias Niessner tweet media
English
11
25
142
14.4K
AVB
AVB@neural_avb·
My favourite piece of Schmidhuber lore is when he challenged Ian Goodfellow during a NIPS presentation on GANs Live in public Deep Learning drama peaked here. You have seen nothing like this.
Yuntian Deng@yuntiandeng

Glad to see followups to neural-os.com, but disappointed that neither the blog (with 34 refs) nor the code repo acknowledged NeuralOS, even tho the released data code appears to build directly on top of ours. That omission is hard to understand given our shared vision.

English
14
33
615
106.5K
Daniel DeTone
Daniel DeTone@ddetone·
@Capsbrr Ah sorry, I thought you meant on Quest cameras, not running on the ML model on Quest hardware. I don’t think this model can run in realtime on Quest. Though it could probably be distilled significantly with further effort and maybe work
English
1
0
2
22
Carlos Pinheiro
Carlos Pinheiro@Capsbrr·
@ddetone How does Boxer run in real-time on the Quest 3, when it takes 20 ms to run on a *RTX 4090*? The hardware specs are worlds apart. Im genuinely curious.
English
1
0
1
26
Daniel DeTone
Daniel DeTone@ddetone·
Today we release Boxer, a new lightweight approach that lifts open-world 2D bounding boxes to *metric* 3D: facebookresearch.github.io/boxer/ Here we show Boxer in action on an egocentric sequence captured from smart glasses:
English
22
168
1.3K
79.2K
Carlos Pinheiro
Carlos Pinheiro@Capsbrr·
@ddetone I would also be curious: Does it run in real-time on the meta quest 3?
English
1
0
1
70
Daniel DeTone
Daniel DeTone@ddetone·
@weikaih04 that would be great! we didn't train on much on outdoor data, I would expect a big boost the WildDet3D dataset training for outdoors
English
0
0
1
171
Weikai Huang
Weikai Huang@weikaih04·
Just check out the boxer model — the latency (20–40ms) and the generalization are also pretty impressive. Huge congrats to Daniel and the other authors! If you're interested in open-world 3D detection for outdoor/in-the-wild scenarios, also check out our WildDet3D 👇 github.com/allenai/WildDe…. Thinking about training Boxer with WildDet3D data to do 30fps in-the-wild 3D tracking.
Daniel DeTone@ddetone

Today we release Boxer, a new lightweight approach that lifts open-world 2D bounding boxes to *metric* 3D: facebookresearch.github.io/boxer/ Here we show Boxer in action on an egocentric sequence captured from smart glasses:

English
3
9
83
12.9K
Satyam Kumar
Satyam Kumar@_satyam_ai·
Meta recently open-sourced Boxer, a model that lifts 2D bounding boxes into 3D oriented bounding boxes (OBBs) for scene understanding. The catch? It was designed for Aria AR glasses, not regular cameras. So I built a pipeline to make it work with any phone video. The hard part: Boxer expects gravity from Aria's IMU. COLMAP doesn't know "up" from "down." Had to estimate gravity from camera poses and rotate the entire reconstruction. @Meta #ComputerVision #3DReconstruction #MetaAI #SceneUnderstanding
Daniel DeTone@ddetone

Today we release Boxer, a new lightweight approach that lifts open-world 2D bounding boxes to *metric* 3D: facebookresearch.github.io/boxer/ Here we show Boxer in action on an egocentric sequence captured from smart glasses:

English
2
1
14
668
Satyam Kumar
Satyam Kumar@_satyam_ai·
@ddetone Yes, feeding in the COLMAP sparse point cloud as depth input along with RGB frames. For the example video, it had around ~14K points.
English
1
0
2
112
Daniel DeTone
Daniel DeTone@ddetone·
@yesitsarmin yes, the main limitation is the 2D detector here, but there are tons of better models (SAM3, VLMs) if you have the compute. for very cluttered scenes it doesn't work as well
English
0
0
0
71
𓅋 𐎫𐎤𐎶𐏀 ‎ﷺ
𓅋 𐎫𐎤𐎶𐏀 ‎ﷺ@yesitsarmin·
@ddetone this is very cool, can it work for any arbitrary objects, like stuff on a table, and can it work with stereo camera?
English
1
0
0
166
Daniel DeTone
Daniel DeTone@ddetone·
@ElioenaiSiqCst Yes, I didn't show any examples of that but we trained on a massive internal-only Quest3 dataset
English
0
0
0
120
Daniel DeTone
Daniel DeTone@ddetone·
@BlueAquilae great question! I would not expect it to work well here, we would need to re-train it with a full 9 DoF representation. but feel free to try it out anyway, I'd be curious
English
0
0
1
99
CleverBet
CleverBet@CleverBetTips·
@ddetone Your name is competing with one of the great albums of the last 30 years 😀
English
1
0
0
218
Daniel DeTone
Daniel DeTone@ddetone·
@haodongli00 One limitation I found using both of those models is the runtime. For detecting 1000+ text prompts with SAM3 it takes 20+ sec per image. SAM3D also takes ~15 sec per object, so running on large datasets can be expensive. OWLv2 runs at ~30ms and Boxer takes ~20ms
English
0
0
1
128
Haodong Li
Haodong Li@haodongli00·
@ddetone Great work! @ddetone Also doing something very similar, using SAM3, SAM3D and many other powerful tools! 😎
English
1
0
3
352
Daniel DeTone
Daniel DeTone@ddetone·
@nickkarpov Feel free to file a GitHub issue if you have any problems! Will do my best to answer them quickly
English
0
0
3
639
Daniel DeTone
Daniel DeTone@ddetone·
BoxerNet runs FAST 🔥🔥, taking roughly 20ms on a 4090 with bfloat16 for ALL prompts in an image (e.g. 30 boxes in parallel)
English
2
0
14
1.1K