Marc Benedí

135 posts

Marc Benedí banner
Marc Benedí

Marc Benedí

@marcbenedi

PhD Candidate @ TU Munich Visual Computing & Artificial Intelligence Group w/ Matthias Niessner. Previously CS @ UPC - FIB

Munich Katılım Ağustos 2012
352 Takip Edilen318 Takipçiler
Marc Benedí retweetledi
Tobias Kirschstein
Tobias Kirschstein@TobiasKirschst1·
If you are at #Eurographics tomorrow, don't miss our STAR session on "How to Build Digital Humans?" 🕺 🗓️ Monday, 4th of May 🕐 1:15 pm - 2:45pm 🏡 Kino 5 We will have experts in the field share their thoughts on 3D avatars. It will be cinematic!
Tobias Kirschstein tweet media
English
0
6
21
1.3K
Marc Benedí retweetledi
Wojciech Zielonka
Wojciech Zielonka@w_zielonka·
I am happy to share that our STAR has been accepted to Eurographics 2026: “How to Build Digital Humans?” It introduces a novel taxonomy and a concise overview of the full creation pipeline, from face and body to hands, garments, and hair. tinyurl.com/5f6u7rks
Wojciech Zielonka tweet media
English
1
16
73
6.5K
Marc Benedí retweetledi
Simon Giebenhain
Simon Giebenhain@SGiebenhain·
7/ 🇬🇧the support system in the iOS app is also a mess. When I try to get help again via the app, a hopeful message turns out to link to my old chat, where I was the last one responding.
Simon Giebenhain tweet media
English
0
1
1
147
Marc Benedí retweetledi
Simon Giebenhain
Simon Giebenhain@SGiebenhain·
@Uber_Support @Uber_Brasil urgent help needed! I left my luggage in an Uber in Rio with my passport inside. I’ve already filed a report in the app, but the driver isn’t responding. I’m a tourist and need to fly back to Germany. Can you help me reach him? 🙏
English
8
3
6
548
Marc Benedí retweetledi
Angela Dai
Angela Dai@angelaqdai·
Image & video synthesis struggle with the scale of truly large 3D scenes. @mschneider456 presents a geometry-first approach : - structure first: mesh scaffold defining the scene - then appearance: mesh-conditioned image synthesis Check it out: mschneider456.github.io/world-mesh/
English
2
33
244
19.2K
Marc Benedí retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! ziyaerkoc.com/worldagents youtu.be/Mj2FqqhurdI Great work by @ErkocZiya @angelaqdai
YouTube video
YouTube
English
7
46
271
18.6K
Marc Benedí retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢 3D world models from video diffusion suffer from inconsistent frames -> blurry output. Our fix: instead of naïve 3D reconstruction, we non-rigidly align each frame into a globally-consistent 3DGS representation. ->sharp visuals on top of any VDM! lukashoel.github.io/video_to_world
English
4
77
497
39.4K
Marc Benedí retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image📢 We directly regress neural parametric head models (NPHMs) from a single image — fast, stable, and significantly more expressive than classical 3DMMs such as FLAME. Face tracking & 3D reconstruction are often limited by the representational capacity of PCA-based face models. By lifting NPHMs to a first-class reconstruction primitive, we enable more accurate geometry, richer expressions, and finer animation control. Pix2NPHM obtains fast and reliable NPHM reconstructions on real-world data. Inference-time optimization against surface normals and canonical point maps can further increase fidelity. Key to successful and generalized training of our ViT-based network are: (1) large-scale registration of existing 3D head datasets, and (2) self-supervised training on vast in-the-wild 2D video datasets using pseudo ground-truth surface normals. Finally, we show that geometry-aware pretraining on pixel-aligned reconstruction tasks significantly outperforms generic visual pretraining (e.g., DINO-style features) in terms of generalization. 🌍simongiebenhain.github.io/Pix2NPHM 🎥youtu.be/MgpEJC5p1Ts Great work by @SGiebenhain, @TobiasKirschst1, @liamschoneveld, Davide Davoli, Zhe Chen
YouTube video
YouTube
English
15
80
542
37.6K
Marc Benedí retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
Want to create an avatar from a single image? FlexAvatar is a transformer model that creates full 360°, high-quality, and expressive 3D head avatar from just a single portrait image in minutes. Real-time Demo: FlexAvatar's lightweight architecture allows both animation and rendering in real-time, enabling interactive user experiences. To create a new 3D head avatar, only one image is required, e.g., from a webcam. The final avatar is ready after 2 minutes. Architecture: Under the hood, FlexAvatar adopts a transformer-based encoder-decoder design. The encoder maps the input image onto a latent avatar space, while the decoder produces 3D Gaussian attribute maps by incorporating the animation signal via cross-attention. The model learns all facial animations directly from the data without relying on pre-built 3D face models. This equips the avatars with realistic facial expressions. The internal avatar latent space can be conveniently used to integrate additional observations of a person via fitting. This enables use-cases where more than one image of a person is available, e.g., from a phone scan of the person. We train jointly on 2D monocular videos and multi-view data. However, in monocular videos, the animation signal leaks the target viewpoint, causing the model to produce incomplete 3D heads. We call this phenomenon entanglement of driving signal and target viewpoint. To prevent entanglement, we introduce bias sinks. These are learnable tokens that indicate whether a training sample stems from a monocular or a multi-view dataset. During training, the model learns to produce incomplete 3D heads only when the monocular token is present. During inference, FlexAvatar then always uses the multi-view token for which the model has learned to produce complete 3D heads. This simple design allows to combine the generalizability from monocular data with the quality of multi-view data. FlexAvatar summary: - Input: Single-image, phone scan, or monocular video - Output: Full 360° head avatar - Expressive animations - Real-time rendering and animation - Generalization to any portrait - Create a new avatar in 2 minutes - Use bias sinks to combine 2D and 3D data 🏠tobias-kirschstein.github.io/flexavatar/ 🌍arxiv.org/pdf/2512.15599 🎥youtu.be/g8wxqYBlRGY Great work by @TobiasKirschst1 and @SGiebenhain!
YouTube video
YouTube
English
9
62
375
76.2K
Marc Benedí retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
Congrats to @yawarnihal for winning the @MdsiTum best paper award for his amazing 𝐌𝐞𝐬𝐡𝐆𝐏𝐓 work🎉 MeshGPT autoregressively generates compact, artist-style triangle meshes by tokenizing faces into a learned discrete vocabulary (VQ-style codebook) and training a decoder-only transformer to predict those face tokens — because discrete tokenization + attention lets GPT-style models learn long-range geometric & topological patterns and produce coherent, high-fidelity 3D assets. MeshGPT's use cases go far beyond traditional content creation applications in computer graphics. For instance, the method was developed in collaboration with @Audi to help rapid prototyping of car designs, where explicit and precise mesh design are essential. In the research community, there have already been many follow ups such as MeshAnything, MeshXL, Meshtron, and many more - finally, we can use AI to generate high-fidelity 3D content :) Project: nihalsid.github.io/mesh-gpt/ Video: youtu.be/UV90O1_69_o
YouTube video
YouTube
Matthias Niessner tweet media
English
4
11
67
10.6K
Marc Benedí retweetledi
Tobias Kirschstein
Tobias Kirschstein@TobiasKirschst1·
We will present Avat3r at #ICCV2025! 🥳 Avat3r brings animation to Large Reconstruction Models. One surprising finding was that we can get rid of any template-based deformation modeling and simply use cross-attention to an abstract facial expression code. tobias-kirschstein.github.io/avat3r/
Matthias Niessner@MattNiessner

📢📢 𝐀𝐯𝐚𝐭𝟑𝐫 📢📢 Avat3r creates high-quality 3D head avatars from just a few input images in a single forward pass with a new dynamic 3DGS reconstruction model. Video: youtu.be/P3zNVx15gYs Project: tobias-kirschstein.github.io/avat3r Our core idea is to make Gaussian Reconstruction Models animatable. We find that a simple cross-attention to an expression code sequence is already sufficient to model complex facial expressions. We then incorporate position maps from DUSt3R and feature maps from Sapiens to facilitate the prediction task. While DUSt3R's position maps act as a pixel-aligned initialization for the Gaussians' positions, the Sapiens feature maps help the cross-view transformer to match corresponding image tokens in the 4 input images. One major challenge in creating a 3D head avatar from smartphone images comes from inconsistent facial expressions when the subject could not remain perfectly static during the capture. We eliminate this static requirement by simply showing our model input images with different facial expressions during training. This technique makes our model robust to inconsistent input images later on. Finally, we show that despite the model has been trained with 4 input images, one can even create a 3D head avatar when only a single image is available. To achieve this, we employ a pre-trained 3D GAN to lift the single image to 3D and then render the 4 input images for our model. This allows us to create 3D head avatars from single images and even highly out-of-distribution examples like AI generated faces, paintings or statues. Great work by @TobiasKirschst1 from his internship at Meta with Javier Romero, @ASevastopolsky, and @psyth91

English
1
32
150
13.1K
Marc Benedí retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢 LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans🏠✨ -> converts RGB-D scans into compact, realistic, and interactive 3D scenes — featuring high-quality meshes, PBR materials, and articulated objects. 📷youtu.be/ecK9m3LXg2c 🌍litereality.github.io
YouTube video
YouTube
English
5
63
321
23.6K
Marc Benedí retweetledi
Tobias Kirschstein
Tobias Kirschstein@TobiasKirschst1·
Happening now in room 110A! Shunsuke Saito @psyth91 talking about Codec Avatars!
Tobias Kirschstein tweet media
English
0
3
12
1.1K
Marc Benedí retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢PBR-SR: Mesh PBR Texture Super Resolution from 2D Image Priors📢 We propose a new optimization to up-sample textures of 3D assets (albedo, roughness, metallic, and normal maps) by leveraging 2D super-resolution models. 📝arxiv.org/abs/2506.02846 📽️youtu.be/eaM5S3Mt1RM
YouTube video
YouTube
English
1
36
136
7.5K
Marc Benedí retweetledi
Tobias Kirschstein
Tobias Kirschstein@TobiasKirschst1·
NeRSemble benchmark submission deadline extension for #CVPR2025! Due to lots of submissions in the past days, we have decided to extend the deadline until Wednesday, 28th May. You have 5 more days to submit your SOTA method for dynamic NVS and monocular 3D head avatar creation!
Matthias Niessner@MattNiessner

📢Announcing our 3D head avatar benchmark📢 Two tasks with hidden test sets: - Dynamic Novel View Synthesis on Heads - Monocular FLAME-driven Head Avatar Reconstruction Our goal is to make research on 3D head avatars more comparable and ultimately increase the realism of digital humans. The benchmark studies distinct phenomena of 3D head avatar creation, such as extreme facial expressions, slow motion captures of shaking long hair, or complicated light reflection and refraction patterns of glasses. The two benchmark tasks assess two core desiderata of 3D avatars: While the novel view synthesis challenge focuses on best possible rendering quality of complex moving scenes, the avatar animation challenge is concerned with how well a driving signal is translated into an avatar. Evaluations are light-weight and consist of diverse video recordings from the popular NeRSemble dataset with a hidden test set. Participation in the benchmark is therefore straight-forward and requires only 5 reconstructions per task. Leaderboard and benchmark submission: kaldir.vc.in.tum.de/nersemble_benc… Benchmark data access and toolkit: github.com/tobias-kirschs… Great work by @TobiasKirschst1 @SGiebenhain

English
0
6
36
4.7K
Marc Benedí retweetledi
Jack Saunders
Jack Saunders@jack_r_saunders·
📣 Potential Game Changer for Video Game Animation 📣 Animating the Uncaptured Humanoid Mesh Animation with Video Diffusion Models
English
1
2
6
517
Marc Benedí retweetledi
Angela Dai
Angela Dai@angelaqdai·
📢 QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization @liuyuehcheng learns 2DGS initialization, densification, and optimization priors from ScanNet++ => fast & accurate reconstruction! Project: liu115.github.io/quicksplat
English
3
54
239
27.2K
Marc Benedí retweetledi
Matthias Niessner
Matthias Niessner@MattNiessner·
📢Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction📢 -> highly accurate face reconstruction by training powerful VITs via surface normals and UV-coordinates estimation. The geometric cues from our 2D foundation model backbone constrain the 3DMM parameters, which allows us to achieve remarkable reconstruction accuracy - works for both single image and videos! In addition, we introduce a new 3D face reconstruction benchmark that evaluates both neutral and posed face geometry. 🌍 simongiebenhain.github.io/pixel3dmm 📷 youtu.be/BwxwEXJwUDc Great work by @SGiebenhain @TobiasKirschst1 @martin_ruenz @LourdesAgapito
YouTube video
YouTube
English
12
139
841
61.4K