Florent BARTOCCIONI

384 posts

Florent BARTOCCIONI

Florent BARTOCCIONI

@fbartoc

Building world models at valeoAI

가입일 Mayıs 2020
1.1K 팔로잉87 팔로워
Florent BARTOCCIONI 리트윗함
Stan Szymanowicz
Stan Szymanowicz@StanSzymanowicz·
🍺 LagerNVS (CVPR 2026) 🍺 LagerNVS is a generalizable, feed-forward, real-time Novel View Synthesis network which - performs rendering in real time, - generalizes to in-the-wild data, - works with and without known source cameras, - sets a new state-of-the-art among deterministic methods, - can be paired with a diffusion decoder for generative extrapolation. LagerNVS shows that 3D biases are useful for Novel View Synthesis but explicit 3D representations are not required to achieve them. We use 3D biases in (1) architecture design and (2) pre-training: (1) In NVS with explicit 3D representations (3DGS, NeRF) reconstruction is typically difficult and slow, but rendering is much faster and simpler. We mimic this process in the network design: we use a large (1B params) encoder and a small, lightweight decoder (ViT-B). This allows increasing the network capacity while still achieving real-time rendering. (2) The encoder, initialized from VGGT, was pre-trained with 3D reconstruction objectives, making the initial features 3D aware. Both substantially improve performance. Project page: szymanowiczs.github.io/lagernvs Code: github.com/facebookresear… Paper: arxiv.org/abs/2603.20176 Models: huggingface.co/collections/fa… Work done with @jianyuan_wang @MinghaoChen23 Christian Rupprecht and Andrea Vedaldi
English
3
12
73
5.7K
Florent BARTOCCIONI 리트윗함
Lucas Maes
Lucas Maes@lucasmaes_·
JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: le-wm.github.io
English
14
67
547
24.1K
Florent BARTOCCIONI 리트윗함
Xichen Pan
Xichen Pan@xichen_pan·
There has been a lot of debate around the choice of denoising space. But it’s hard to get both semantics/diffusability and strong low-level reconstruction at the same time. REPA and VA-VAE are great explorations of adding semantics into the VAE space. After JiT came out, we started thinking about adding semantics directly into pixel space to improve generation. We explore co-denoising as another form of visual representation alignment and provide a detailed training recipe. The final results show improvements over vanilla JiT and outperform simply applying REPA. Thanks @hanlin_hl for leading this project!
Han Lin@hanlin_hl

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English
1
9
55
6.1K
Florent BARTOCCIONI 리트윗함
Kwang Moo Yi
Kwang Moo Yi@kwangmoo_yi·
Yu et al., "MosaicMem: Hybrid Spatial Memory for Controllable Video World Models" A patch-based spatial memory that you raster into views + glues to make things work.
English
6
18
144
10.8K
Florent BARTOCCIONI 리트윗함
Chelsea Finn
Chelsea Finn@chelseabfinn·
Usually, we expect more diverse data >> less diverse data. Cross-embodiment transfer seems to benefit from paired data across embodiments, more so than increasing diversity. Webpage & code: data-analogies.github.io Paper: arxiv.org/abs/2603.06450
Chelsea Finn tweet media
English
12
54
474
37.9K
Florent BARTOCCIONI 리트윗함
Huan Ling
Huan Ling@HuanLing6·
Can we use genie-like world model as a real world simulator? Today we introduce Nvidia AlpaDream, now you can drive in a video model! (The video in the attached demo are all generated by a real-time video model!) Come test our interactive real time demo with a gaming wheel at GTC booth.
Zan Gojcic@ZGojcic

A new generation in AV simulation is here! We are announcing AlpaDreams, a real time interactive generative world model for AV simualtion! Just a year ago it took minutes to generate a few seconds of video, today it is real time and interactive! research.nvidia.com/labs/sil/proje…

English
6
14
39
4.5K
Florent BARTOCCIONI 리트윗함
DailyPapers
DailyPapers@HuggingPapers·
Seoul World Model Navigate the real streets of Seoul for kilometers without leaving your screen. This city-scale world model uses retrieval-augmented generation to ground every frame in actual street-view data. You can even spawn Godzilla or summon a tsunami via text prompts.
English
2
24
90
7.7K
Florent BARTOCCIONI 리트윗함
Zhikai Zhang
Zhikai Zhang@Zhikai273·
🎾Introducing LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data Dynamic movements, agile whole-body coordination, and rapid reactions. A step toward athletic humanoid sports skills. Project: zzk273.github.io/LATENT/ Code: github.com/GalaxyGeneralR…
English
162
644
4.1K
1.3M
Florent BARTOCCIONI 리트윗함
Sophie Wang
Sophie Wang@SophieLWang·
I made an interactive blog post about how JPEG image compression works: sophielwang.com/blog/jpeg
English
42
401
3.7K
166.7K
Florent BARTOCCIONI 리트윗함
Ying Wang
Ying Wang@yingwww_·
What is a good latent space for world modeling and planning? 🤔 Inspired by the perceptual straightening hypothesis in human vision, we introduce temporal straightening to improve representation learning for latent planning. 📑: agenticlearning.ai/temporal-strai…
Ying Wang tweet media
English
29
131
777
223.5K
Florent BARTOCCIONI 리트윗함
Arnas Uselis
Arnas Uselis@a_uselis·
How do embedding spaces of models that generalize from limited data look? We study what structure such models should exhibit. Turns out: linear and orthogonal. And modern embedding models like CLIP and SigLIP already show signs of it! 🧵 (1/n)
English
4
101
708
75.1K
Florent BARTOCCIONI 리트윗함
Alec Helbling
Alec Helbling@alec_helbling·
Most of the visualizations have interactive elements that work by running an actual flow model on the front end using Tensorflow.js. It should even work on most mobile devices. Link to blog: alechelbling.com/blog/rectified…
English
2
5
77
6.8K
Florent BARTOCCIONI 리트윗함
Jan Eric Lenssen
Jan Eric Lenssen@janericlenssen·
Can 3D scenes be represented by and rendered from a set of compressed tokens? It turns out they can and it pairs very well with generative rendering to handle uncertainty! Make sure to check out @Mohamma68780050's recent work Scenetok, accepted at #CVPR2026. Links below.
GIF
English
2
26
173
8.4K
Florent BARTOCCIONI 리트윗함
Tengfei Wang
Tengfei Wang@DylanTFWang·
Autoregressive diffusion models drift for long videos? 📉 We fixed it.🚀 Speed + Stability = ✅ Meeting *Test-Time Correction (TTC)*. We stop error accumulation in its tracks without any retraining. ✅ Training-free ✅ 1 minute+ stable generation ✅ Negligible overhead
English
3
17
228
15.5K
Florent BARTOCCIONI 리트윗함
Nan Rosemary Ke
Nan Rosemary Ke@rosemary_ke·
We’ll be presenting our paper at the Multi-turn Interactions and Embodied World Models workshops at #NeurIPS2025. Frontier foundation models are powerful—but how well can they explore and learn in interactive environments? Paper 👇 arxiv.org/abs/2412.06438 🧵1/13
GIF
English
3
3
12
965
Florent BARTOCCIONI 리트윗함
Evan Kim
Evan Kim@evnkimm·
How do you train compute-optimal novel view synthesis models? In our CVPR ‘26 paper Scaling View Synthesis Transformers, we uncover key design choices through scaling and careful ablations--and along the way train a new SoTA with 3x less compute. (1/n)
Evan Kim tweet media
English
13
19
165
33.1K
Florent BARTOCCIONI 리트윗함
Photoroom
Photoroom@photoroom_ML·
How far can you push diffusion training in 24 hours and $1500? We ran a diffusion speedrun in the next post of our PRX series. 32× H200 1 day of training The result is a surprisingly capable text-to-image model. Full recipe and code open sourced 🧵
Photoroom tweet media
English
3
22
166
12.1K
Florent BARTOCCIONI 리트윗함
George Bredis
George Bredis@BredisGeorge·
Most imagination-based world models learn representations by reconstructing pixels. But reconstruction may not be the right objective for control. In our new paper we explore a different idea: 👉 predict the next embedding instead of reconstructing observations. Introducing NE-Dreamer. Project page: corl-team.github.io/nedreamer/ Paper: arxiv.org/pdf/2603.02765 Code: github.com/corl-team/nedr…
GIF
English
12
58
366
46.8K
Florent BARTOCCIONI 리트윗함
Hila Chefer
Hila Chefer@hila_chefer·
New research from @bfl_ml 🥳 Meet Self-Flow: our self-supervised framework for image, audio, video & world models 🤖 bfl.ai/research/self-… Do generative models really need DINO to learn strong representations? We propose teaching them directly via a joint framework instead 🧵
Hila Chefer tweet media
English
11
61
272
57.3K