Antonio Oroz

24 posts

Antonio Oroz

Antonio Oroz

@antonio_oroz

Germany Katılım Mayıs 2012
36 Takip Edilen51 Takipçiler
Antonio Oroz retweetledi
Angela Dai
Angela Dai@angelaqdai·
Excited to share HOI-PAGE, to appear at #ICML2026! 🚀 @craigleili generates 4D human-object interactions zero-shot from text A part-affordance graph grounds interactions via LLM+video priors, enabling complex multi-person, multi-object interactions 👉craigleili.github.io/projects/hoipa…
English
3
30
129
9.3K
Antonio Oroz retweetledi
Wojciech Zielonka
Wojciech Zielonka@w_zielonka·
I am happy to share that our STAR has been accepted to Eurographics 2026: “How to Build Digital Humans?” It introduces a novel taxonomy and a concise overview of the full creation pipeline, from face and body to hands, garments, and hair. tinyurl.com/5f6u7rks
Wojciech Zielonka tweet media
English
1
17
73
6.5K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢𝐁𝐈𝐆 𝐍𝐄𝐖𝐒: 𝐋𝐚𝐮𝐧𝐜𝐡𝐢𝐧𝐠 𝐄𝐜𝐡𝐨-𝟐 𝐓𝐨𝐝𝐚𝐲📢 My obsession with virtual environments started with childhood video games. But after years of research in 3D reconstruction and neural rendering, the bottleneck became obvious: we don't just need to generate better pixels. We needed a foundational model that natively understands space and the underlying physics. That spatial grounding is exactly what you are seeing in the thread below. Echo-2 enables a two-way flow of knowledge between reality and simulation. It is the bridge between capturing the physical world and building the high-fidelity simulations required to train tomorrow's robots. "What I cannot create, I do not understand." — Richard Feynman.
SpAItial AI@SpAItial_AI

🚀Echo-2 is here - our new world model! These aren’t videos. These are 𝟑𝐃 𝐬𝐜𝐞𝐧𝐞𝐬. Generated from a single image. - Stunning visual quality. - Real-time rendering. - Interactive camera control. - Physically grounded. 🧵More details👇

English
6
29
232
20.5K
Antonio Oroz retweetledi
SpAItial AI
SpAItial AI@SpAItial_AI·
🚀Echo-2 is here - our new world model! These aren’t videos. These are 𝟑𝐃 𝐬𝐜𝐞𝐧𝐞𝐬. Generated from a single image. - Stunning visual quality. - Real-time rendering. - Interactive camera control. - Physically grounded. 🧵More details👇
English
20
87
516
90.6K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢Face Anything: 4D Face Reconstruction from Any Image Sequence Transformer model for 4D face reconstruction and dense tracking: - predict canonical facial coordinates per pixel - tracking as reconstruction in canonical space - geometry + correspondences in one forward pass Key idea: a shared canonical space across frames - correspondences as nearest neighbors - no motion or deformation estimation Stable geometry and tracking, even under large expressions and viewpoint changes - check out our results! 🌐 kocasariumut.github.io/FaceAnything ▶️ youtu.be/wSGHpAscp0Y Great work by @UmutKocasa4344, @SGiebenhain, @richard_o_shaw
YouTube video
YouTube
English
8
89
546
60.4K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
Congrats to @Normanisation for his successful PhD defense 🥳🎓 Norman's thesis about 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐌𝐨𝐝𝐞𝐥𝐬 𝐨𝐧 𝟑𝐃 𝐑𝐞𝐩𝐫𝐞𝐬𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧𝐬 makes important contributions to the 3D vision community. For instance, DiffRF, a generative approach directly operating in 3D space, was among the first diffusion techniques for neural radiance fields. This led to many follow up works in this area and sparked interest across the computer vision community, establishing generative approaches as a corner stone in the 3D domain. Also after his PhD, Norman continues to work on the forefront in computer vision, such as his contributions to MapAnything, a universal feedforward approach for 3D reconstruction. Check out Norman's amazing work: normanm.de Congratulations Dr. Mueller - super proud!
Matthias Niessner tweet mediaMatthias Niessner tweet media
English
6
10
115
13.3K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
🚀Announcing NeRSemble 3D Head Avatar Benchmark v2 Version 2 of the NeRSemble 3D Head Avatar Benchmark systematically evaluates several aspects of 3D head avatar creation. Our goal is to drive progress toward more realistic, robust, and generalizable avatar methods. 🔬Benchmark Tasks The NeRSemble Benchmark v2 features three core challenges: - Dynamic Novel View Synthesis - Monocular FLAME-driven Avatar Creation (updated) - Single-view 3D Face Reconstruction (new) 👉Explore the online leaderboard and submission system: kaldir.vc.cit.tum.de/nersemble_benc… 🆕What's new? 1. New Task: Single-view 3D Face Reconstruction Given a single portrait image, reconstruct an accurate 3D mesh either showing the input expression or a fully neutral one. Unlike prior benchmarks, the NeRSemble benchmark emphasizes diverse and challenging facial expressions, better reflecting real scenarios. For technical details, see the Pixel3DMM paper. 2. Updated task: Monocular FLAME-driven Avatar Creation We have improved the FLAME tracking that is used for both avatar creation from the monocular videos and avatar driving on the hidden test sequences. The updated benchmark task has: - more stable torso tracking - more expressive lip closures during speech - Improved mouth tracking for challenging facial expressions We hope that these improvements to the benchmark help drive the field forward. 🏆 CVPR 2026 Workshop & Prizes The NeRSemble benchmark will be featured at the CVPR 2026 Workshop on Photo-realistic 3D Head Avatars. Participants in the new and updated tasks have the opportunity to win: - 🎁RTX 5080 GPUs (sponsored by NVIDIA) - 🎤15-minute oral presentation at the workshop ⏰ Submission Deadline - May 26, 2026 Reach out to the amazing @TobiasKirschst1 and @SGiebenhain for more details :)
English
0
31
159
17.1K
Antonio Oroz retweetledi
Angela Dai
Angela Dai@angelaqdai·
📢Diff3r: fast feed-forward 3DGS + per-scene optimization @liuyuehcheng predicts optimization-ready 3DGS init end to end, computing implicit gradients via Implicit Function Theorem + Gauss-Newton approximation for fast & stable results Check it out: liu115.github.io/diff3r
English
1
33
179
17.3K
Antonio Oroz retweetledi
Angela Dai
Angela Dai@angelaqdai·
📢Seen2Scene Real-world 3D is incomplete, typically requiring training on synthetic scene data. @QTDSMQ introduces visibility-guided flow matching, enabling training on real partial scans for scan completion & text-to-3D scene generation! Check it out: quan-meng.github.io/projects/seen2…
English
7
111
792
44.2K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢GaussianGPT: autoregressive 3D Gaussian scene generation. We introduce a GPT-style model that directly generates 3D Gaussian scenes, token by token, in a series of small, discrete decision steps. Generation, completion, and large-scale outpainting in a single pipeline. Unlike diffusion-based approaches, GaussianGPT explicitly models the scene distribution at every step, allowing for quite flexible scene synthesis. 🌐 nicolasvonluetzow.github.io/GaussianGPT/ ▶️ youtu.be/zVnMHkFzHDg Great work by @nicolasvluetzow, @barbara_roessle, @katha_schmid
YouTube video
YouTube
English
36
297
2.4K
150.5K
Antonio Oroz retweetledi
Angela Dai
Angela Dai@angelaqdai·
📢Lookalike3D: Seeing Double in 3D @chandan__yes enables holistic, instance-consistent 3D object reconstruction & part segmentation by detecting identical and near-identical objects from multiview images. Built on a dataset of 76k curated object pairs cy94.github.io/lookalike3d/
English
0
17
97
6.3K
Antonio Oroz retweetledi
Angela Dai
Angela Dai@angelaqdai·
Image & video synthesis struggle with the scale of truly large 3D scenes. @mschneider456 presents a geometry-first approach : - structure first: mesh scaffold defining the scene - then appearance: mesh-conditioned image synthesis Check it out: mschneider456.github.io/world-mesh/
English
2
33
244
19.2K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! ziyaerkoc.com/worldagents youtu.be/Mj2FqqhurdI Great work by @ErkocZiya @angelaqdai
YouTube video
YouTube
English
7
46
272
18.6K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢 3D world models from video diffusion suffer from inconsistent frames -> blurry output. Our fix: instead of naïve 3D reconstruction, we non-rigidly align each frame into a globally-consistent 3DGS representation. ->sharp visuals on top of any VDM! lukashoel.github.io/video_to_world
English
4
77
497
39.4K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢📢📢Data release: high-res, multi-view, OLAT face recordings 📢📢📢 We captured individuals in our custom light stage with 16 high-end, global shutter cameras (72 fps) and 40 LED modules, totaling 2.8M precisely calibrated frames. We us the data for BecomingLit (#NeurIPS2025): intrinsically decomposed Gaussian avatars, enabling photorealistic and real-time relighting via hybrid neural shading. Code & Data: jonathsch.github.io/becominglit/ Great work by @jnthnschmdt, @SGiebenhain
English
5
39
237
19.2K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image📢 We directly regress neural parametric head models (NPHMs) from a single image — fast, stable, and significantly more expressive than classical 3DMMs such as FLAME. Face tracking & 3D reconstruction are often limited by the representational capacity of PCA-based face models. By lifting NPHMs to a first-class reconstruction primitive, we enable more accurate geometry, richer expressions, and finer animation control. Pix2NPHM obtains fast and reliable NPHM reconstructions on real-world data. Inference-time optimization against surface normals and canonical point maps can further increase fidelity. Key to successful and generalized training of our ViT-based network are: (1) large-scale registration of existing 3D head datasets, and (2) self-supervised training on vast in-the-wild 2D video datasets using pseudo ground-truth surface normals. Finally, we show that geometry-aware pretraining on pixel-aligned reconstruction tasks significantly outperforms generic visual pretraining (e.g., DINO-style features) in terms of generalization. 🌍simongiebenhain.github.io/Pix2NPHM 🎥youtu.be/MgpEJC5p1Ts Great work by @SGiebenhain, @TobiasKirschst1, @liamschoneveld, Davide Davoli, Zhe Chen
YouTube video
YouTube
English
15
80
542
37.6K
Antonio Oroz retweetledi
Tobias Kirschstein
Tobias Kirschstein@TobiasKirschst1·
Super excited to announce FlexAvatar! 📢📢 With FlexAvatar, you can create a full 360°, high-quality, and expressive 3D head avatar from just a single portrait image. In this real-time demo, we showcase the avatar creation which takes only 2 minutes. 👉tobias-kirschstein.github.io/flexavatar/
Matthias Niessner@MattNiessner

Want to create an avatar from a single image? FlexAvatar is a transformer model that creates full 360°, high-quality, and expressive 3D head avatar from just a single portrait image in minutes. Real-time Demo: FlexAvatar's lightweight architecture allows both animation and rendering in real-time, enabling interactive user experiences. To create a new 3D head avatar, only one image is required, e.g., from a webcam. The final avatar is ready after 2 minutes. Architecture: Under the hood, FlexAvatar adopts a transformer-based encoder-decoder design. The encoder maps the input image onto a latent avatar space, while the decoder produces 3D Gaussian attribute maps by incorporating the animation signal via cross-attention. The model learns all facial animations directly from the data without relying on pre-built 3D face models. This equips the avatars with realistic facial expressions. The internal avatar latent space can be conveniently used to integrate additional observations of a person via fitting. This enables use-cases where more than one image of a person is available, e.g., from a phone scan of the person. We train jointly on 2D monocular videos and multi-view data. However, in monocular videos, the animation signal leaks the target viewpoint, causing the model to produce incomplete 3D heads. We call this phenomenon entanglement of driving signal and target viewpoint. To prevent entanglement, we introduce bias sinks. These are learnable tokens that indicate whether a training sample stems from a monocular or a multi-view dataset. During training, the model learns to produce incomplete 3D heads only when the monocular token is present. During inference, FlexAvatar then always uses the multi-view token for which the model has learned to produce complete 3D heads. This simple design allows to combine the generalizability from monocular data with the quality of multi-view data. FlexAvatar summary: - Input: Single-image, phone scan, or monocular video - Output: Full 360° head avatar - Expressive animations - Real-time rendering and animation - Generalization to any portrait - Create a new avatar in 2 minutes - Use bias sinks to combine 2D and 3D data 🏠tobias-kirschstein.github.io/flexavatar/ 🌍arxiv.org/pdf/2512.15599 🎥youtu.be/g8wxqYBlRGY Great work by @TobiasKirschst1 and @SGiebenhain!

English
13
67
492
47.3K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
Want to create an avatar from a single image? FlexAvatar is a transformer model that creates full 360°, high-quality, and expressive 3D head avatar from just a single portrait image in minutes. Real-time Demo: FlexAvatar's lightweight architecture allows both animation and rendering in real-time, enabling interactive user experiences. To create a new 3D head avatar, only one image is required, e.g., from a webcam. The final avatar is ready after 2 minutes. Architecture: Under the hood, FlexAvatar adopts a transformer-based encoder-decoder design. The encoder maps the input image onto a latent avatar space, while the decoder produces 3D Gaussian attribute maps by incorporating the animation signal via cross-attention. The model learns all facial animations directly from the data without relying on pre-built 3D face models. This equips the avatars with realistic facial expressions. The internal avatar latent space can be conveniently used to integrate additional observations of a person via fitting. This enables use-cases where more than one image of a person is available, e.g., from a phone scan of the person. We train jointly on 2D monocular videos and multi-view data. However, in monocular videos, the animation signal leaks the target viewpoint, causing the model to produce incomplete 3D heads. We call this phenomenon entanglement of driving signal and target viewpoint. To prevent entanglement, we introduce bias sinks. These are learnable tokens that indicate whether a training sample stems from a monocular or a multi-view dataset. During training, the model learns to produce incomplete 3D heads only when the monocular token is present. During inference, FlexAvatar then always uses the multi-view token for which the model has learned to produce complete 3D heads. This simple design allows to combine the generalizability from monocular data with the quality of multi-view data. FlexAvatar summary: - Input: Single-image, phone scan, or monocular video - Output: Full 360° head avatar - Expressive animations - Real-time rendering and animation - Generalization to any portrait - Create a new avatar in 2 minutes - Use bias sinks to combine 2D and 3D data 🏠tobias-kirschstein.github.io/flexavatar/ 🌍arxiv.org/pdf/2512.15599 🎥youtu.be/g8wxqYBlRGY Great work by @TobiasKirschst1 and @SGiebenhain!
YouTube video
YouTube
English
9
62
375
76.2K
Antonio Oroz retweetledi
SpAItial AI
SpAItial AI@SpAItial_AI·
🚀 Announcing Echo — our new frontier model for 3D world generation. Echo turns a simple text prompt or image into a fully explorable, 3D-consistent world. Instead of disconnected views, the result is a single, coherent spatial representation you can move through freely. This is part of a bigger shift in AI: from generating pixels and tokens to generating spaces. Echo predicts a geometry-grounded 3D scene at metric scale, meaning every novel view, depth map, and interaction comes from the same underlying world — not independent hallucinations. Once generated, the world is interactive in real time. You control the camera, explore from any angle, and render instantly — even on low-end hardware, directly in the browser. High-quality 3D world exploration is no longer gated by expensive equipment. Under the hood, Echo infers a physically grounded 3D representation and converts it into a renderable format. For our web demo, we use 3D Gaussian Splatting (3DGS) for fast, GPU-friendly rendering — but the representation itself is flexible and can be easily adapted. Why this matters: consistent 3D worlds unlock real workflows — digital twins, 3D design, game environments, robotics simulation, and more. From a single photo or a line of text, Echo builds worlds that are reliable, editable, and spatially faithful. Echo also enables scene editing and restyling. Change materials, remove or add objects, explore design variations — all while preserving global 3D consistency. Editing no longer breaks the world. This is only the beginning. Echo is the foundation for future world models with dynamics, physical reasoning, and richer interaction — environments that don’t just look right, but behave right. Explore the generated worlds on our website and sign up for the closed beta. The era of spatial intelligence starts here. 🌍 #Echo #WorldModels #SpatialAI #3DFoundationModels Check it out: spaitial.ai
English
49
178
1.1K
174.7K
Antonio Oroz retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
Releasing Echo today is incredibly exciting for me — because it is a critical step for generative AI, enabling the creation of virtual worlds. Echo is our first world model at SpAItial AI. It turns text or images into explorable 3D environments — spaces you can move through, inspect, and build on. Seeing this work in real time still feels a bit surreal. My fascination with this goes back a long way: video games, virtual environments, and the idea of capturing the real world in 3D. As a researcher, I spent years working on 3D reconstruction, neural rendering, and scene understanding — all driven by the same question: how do we teach machines to understand the world? One thing became clear over time: the biggest bottleneck isn’t compute or rendering — it’s 3D worlds themselves. High-quality, consistent environments are expensive to create by hand and don’t scale to the experiences we want to build. In particular, I believe that the ability to generate virtual worlds is ultimately key towards understanding the real world. That’s why we founded SpAItial AI. We’re building spatial world models that combine geometric understanding with creative generation — models that can generate, edit, and eventually reason about 3D environments. Echo is just the beginning. For me, this feels like the moment when decades of research finally meet the imagination that got many of us into graphics, games, 3D understanding in the first place.🌍 spaitial.ai
SpAItial AI@SpAItial_AI

🚀 Announcing Echo — our new frontier model for 3D world generation. Echo turns a simple text prompt or image into a fully explorable, 3D-consistent world. Instead of disconnected views, the result is a single, coherent spatial representation you can move through freely. This is part of a bigger shift in AI: from generating pixels and tokens to generating spaces. Echo predicts a geometry-grounded 3D scene at metric scale, meaning every novel view, depth map, and interaction comes from the same underlying world — not independent hallucinations. Once generated, the world is interactive in real time. You control the camera, explore from any angle, and render instantly — even on low-end hardware, directly in the browser. High-quality 3D world exploration is no longer gated by expensive equipment. Under the hood, Echo infers a physically grounded 3D representation and converts it into a renderable format. For our web demo, we use 3D Gaussian Splatting (3DGS) for fast, GPU-friendly rendering — but the representation itself is flexible and can be easily adapted. Why this matters: consistent 3D worlds unlock real workflows — digital twins, 3D design, game environments, robotics simulation, and more. From a single photo or a line of text, Echo builds worlds that are reliable, editable, and spatially faithful. Echo also enables scene editing and restyling. Change materials, remove or add objects, explore design variations — all while preserving global 3D consistency. Editing no longer breaks the world. This is only the beginning. Echo is the foundation for future world models with dynamics, physical reasoning, and richer interaction — environments that don’t just look right, but behave right. Explore the generated worlds on our website and sign up for the closed beta. The era of spatial intelligence starts here. 🌍 #Echo #WorldModels #SpatialAI #3DFoundationModels Check it out: spaitial.ai

English
24
99
635
70.7K