Florent BARTOCCIONI

384 posts

Florent BARTOCCIONI

@fbartoc

Building world models at valeoAI

가입일 Mayıs 2020

1.1K 팔로잉87 팔로워

Florent BARTOCCIONI 리트윗함

Stan Szymanowicz@StanSzymanowicz·4h

🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: szymanowiczs.github.io/lagernvs Code: github.com/facebookresear… Paper: arxiv.org/abs/2603.20176 Models: huggingface.co/collections/fa… Work done with @jianyuan_wang @MinghaoChen23 Christian Rupprecht and Andrea Vedaldi

English

5.7K

Florent BARTOCCIONI 리트윗함

Lucas Maes@lucasmaes_·1h

JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: le-wm.github.io

English

547

24.1K

Florent BARTOCCIONI 리트윗함

Xichen Pan@xichen_pan·2d

There has been a lot of debate around the choice of denoising space. But it’s hard to get both semantics/diffusability and strong low-level reconstruction at the same time. REPA and VA-VAE are great explorations of adding semantics into the VAE space. After JiT came out, we started thinking about adding semantics directly into pixel space to improve generation. We explore co-denoising as another form of visual representation alignment and provide a detailed training recipe. The final results show improvements over vanilla JiT and outperform simply applying REPA. Thanks @hanlin_hl for leading this project!

Han Lin@hanlin_hl

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English

6.1K

Florent BARTOCCIONI 리트윗함

Kwang Moo Yi@kwangmoo_yi·3d

Yu et al., "MosaicMem: Hybrid Spatial Memory for Controllable Video World Models" A patch-based spatial memory that you raster into views + glues to make things work.

English

144

10.8K

Florent BARTOCCIONI 리트윗함

Chelsea Finn@chelseabfinn·16 Mar

Usually, we expect more diverse data >> less diverse data. Cross-embodiment transfer seems to benefit from paired data across embodiments, more so than increasing diversity. Webpage & code: data-analogies.github.io Paper: arxiv.org/abs/2603.06450

English

474

37.9K

Florent BARTOCCIONI 리트윗함

Huan Ling@HuanLing6·6d

Can we use genie-like world model as a real world simulator? Today we introduce Nvidia AlpaDream, now you can drive in a video model! (The video in the attached demo are all generated by a real-time video model!) Come test our interactive real time demo with a gaming wheel at GTC booth.

Zan Gojcic@ZGojcic

A new generation in AV simulation is here! We are announcing AlpaDreams, a real time interactive generative world model for AV simualtion! Just a year ago it took minutes to generate a few seconds of video, today it is real time and interactive! research.nvidia.com/labs/sil/proje…

English

4.5K

Florent BARTOCCIONI 리트윗함

DailyPapers@HuggingPapers·6d

Seoul World Model Navigate the real streets of Seoul for kilometers without leaving your screen. This city-scale world model uses retrieval-augmented generation to ground every frame in actual street-view data. You can even spawn Godzilla or summon a tsunami via text prompts.

English

7.7K

Florent BARTOCCIONI 리트윗함

Zhikai Zhang@Zhikai273·15 Mar

🎾Introducing LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data Dynamic movements, agile whole-body coordination, and rapid reactions. A step toward athletic humanoid sports skills. Project: zzk273.github.io/LATENT/ Code: github.com/GalaxyGeneralR…

English

162

644

4.1K

1.3M

Florent BARTOCCIONI 리트윗함

Sophie Wang@SophieLWang·13 Mar

I made an interactive blog post about how JPEG image compression works: sophielwang.com/blog/jpeg

English

401

3.7K

166.7K

Florent BARTOCCIONI 리트윗함

Ying Wang@yingwww_·13 Mar

What is a good latent space for world modeling and planning? 🤔 Inspired by the perceptual straightening hypothesis in human vision, we introduce temporal straightening to improve representation learning for latent planning. 📑: agenticlearning.ai/temporal-strai…

English

131

777

223.5K

Florent BARTOCCIONI 리트윗함

Arnas Uselis@a_uselis·11 Mar

How do embedding spaces of models that generalize from limited data look? We study what structure such models should exhibit. Turns out: linear and orthogonal. And modern embedding models like CLIP and SigLIP already show signs of it! 🧵 (1/n)

English

101

708

75.1K

Florent BARTOCCIONI 리트윗함

Alec Helbling@alec_helbling·16 Oca

Most of the visualizations have interactive elements that work by running an actual flow model on the front end using Tensorflow.js. It should even work on most mobile devices. Link to blog: alechelbling.com/blog/rectified…

English

6.8K

Florent BARTOCCIONI 리트윗함

Jan Eric Lenssen@janericlenssen·9 Mar

Can 3D scenes be represented by and rendered from a set of compressed tokens? It turns out they can and it pairs very well with generative rendering to handle uncertainty! Make sure to check out @Mohamma68780050's recent work Scenetok, accepted at #CVPR2026. Links below.

GIF

English

173

8.4K

Florent BARTOCCIONI 리트윗함

Tengfei Wang@DylanTFWang·9 Mar

Autoregressive diffusion models drift for long videos? 📉 We fixed it.🚀 Speed + Stability = ✅ Meeting *Test-Time Correction (TTC)*. We stop error accumulation in its tracks without any retraining. ✅ Training-free ✅ 1 minute+ stable generation ✅ Negligible overhead

English

228

15.5K

Florent BARTOCCIONI 리트윗함

Chuanxia Zheng@ChuanxiaZ·7 Mar

#ICLR2026 🔥 Excited to share NOVA3R, the scene-level version of our previous Amodal3R. ✨ Key highlights: - Amodal reasoning: reconstructs occluded geometry - Physically plausible 3D with fewer duplicated structures Page: wrchen530.github.io/nova3r/ Page: arxiv.org/pdf/2603.04179

English

137

9.4K

Florent BARTOCCIONI 리트윗함

Nan Rosemary Ke@rosemary_ke·17 Eki

We’ll be presenting our paper at the Multi-turn Interactions and Embodied World Models workshops at #NeurIPS2025. Frontier foundation models are powerful—but how well can they explore and learn in interactive environments? Paper 👇 arxiv.org/abs/2412.06438 🧵1/13

GIF

English

965

Florent BARTOCCIONI 리트윗함

Evan Kim@evnkimm·6 Mar

How do you train compute-optimal novel view synthesis models? In our CVPR ‘26 paper Scaling View Synthesis Transformers, we uncover key design choices through scaling and careful ablations--and along the way train a new SoTA with 3x less compute. (1/n)

English

165

33.1K

Florent BARTOCCIONI 리트윗함

Photoroom@photoroom_ML·4 Mar

How far can you push diffusion training in 24 hours and $1500? We ran a diffusion speedrun in the next post of our PRX series. 32× H200 1 day of training The result is a surprisingly capable text-to-image model. Full recipe and code open sourced 🧵

English

166

12.1K

Florent BARTOCCIONI 리트윗함

George Bredis@BredisGeorge·4 Mar

Most imagination-based world models learn representations by reconstructing pixels. But reconstruction may not be the right objective for control. In our new paper we explore a different idea: 👉 predict the next embedding instead of reconstructing observations. Introducing NE-Dreamer. Project page: corl-team.github.io/nedreamer/ Paper: arxiv.org/pdf/2603.02765 Code: github.com/corl-team/nedr…

GIF

English

366

46.8K

Florent BARTOCCIONI 리트윗함

Hila Chefer@hila_chefer·4 Mar

New research from @bfl_ml 🥳 Meet Self-Flow: our self-supervised framework for image, audio, video & world models 🤖 bfl.ai/research/self-… Do generative models really need DINO to learn strong representations? We propose teaching them directly via a joint framework instead 🧵

English

272

57.3K

탐색

@jianyuan_wang @MinghaoChen23 @hanlin_hl @Mohamma68780050 @bfl_ml @elonmusk @BarackObama @taylorswift13