Florent BARTOCCIONI

382 posts

Florent BARTOCCIONI

Florent BARTOCCIONI

@fbartoc

Building world models at valeoAI

شامل ہوئے Mayıs 2020
1.1K فالونگ87 فالوورز
Florent BARTOCCIONI ری ٹویٹ کیا
Xichen Pan
Xichen Pan@xichen_pan·
There has been a lot of debate around the choice of denoising space. But it’s hard to get both semantics/diffusability and strong low-level reconstruction at the same time. REPA and VA-VAE are great explorations of adding semantics into the VAE space. After JiT came out, we started thinking about adding semantics directly into pixel space to improve generation. We explore co-denoising as another form of visual representation alignment and provide a detailed training recipe. The final results show improvements over vanilla JiT and outperform simply applying REPA. Thanks @hanlin_hl for leading this project!
Han Lin@hanlin_hl

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English
1
9
55
6K
Florent BARTOCCIONI ری ٹویٹ کیا
Kwang Moo Yi
Kwang Moo Yi@kwangmoo_yi·
Yu et al., "MosaicMem: Hybrid Spatial Memory for Controllable Video World Models" A patch-based spatial memory that you raster into views + glues to make things work.
English
6
18
144
10.8K
Florent BARTOCCIONI ری ٹویٹ کیا
Chelsea Finn
Chelsea Finn@chelseabfinn·
Usually, we expect more diverse data >> less diverse data. Cross-embodiment transfer seems to benefit from paired data across embodiments, more so than increasing diversity. Webpage & code: data-analogies.github.io Paper: arxiv.org/abs/2603.06450
Chelsea Finn tweet media
English
12
54
474
37.9K
Florent BARTOCCIONI ری ٹویٹ کیا
Huan Ling
Huan Ling@HuanLing6·
Can we use genie-like world model as a real world simulator? Today we introduce Nvidia AlpaDream, now you can drive in a video model! (The video in the attached demo are all generated by a real-time video model!) Come test our interactive real time demo with a gaming wheel at GTC booth.
Zan Gojcic@ZGojcic

A new generation in AV simulation is here! We are announcing AlpaDreams, a real time interactive generative world model for AV simualtion! Just a year ago it took minutes to generate a few seconds of video, today it is real time and interactive! research.nvidia.com/labs/sil/proje…

English
6
14
39
4.5K
Florent BARTOCCIONI ری ٹویٹ کیا
DailyPapers
DailyPapers@HuggingPapers·
Seoul World Model Navigate the real streets of Seoul for kilometers without leaving your screen. This city-scale world model uses retrieval-augmented generation to ground every frame in actual street-view data. You can even spawn Godzilla or summon a tsunami via text prompts.
English
2
24
90
7.7K
Florent BARTOCCIONI ری ٹویٹ کیا
Zhikai Zhang
Zhikai Zhang@Zhikai273·
🎾Introducing LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data Dynamic movements, agile whole-body coordination, and rapid reactions. A step toward athletic humanoid sports skills. Project: zzk273.github.io/LATENT/ Code: github.com/GalaxyGeneralR…
English
162
644
4.1K
1.3M
Florent BARTOCCIONI ری ٹویٹ کیا
Sophie Wang
Sophie Wang@SophieLWang·
I made an interactive blog post about how JPEG image compression works: sophielwang.com/blog/jpeg
English
42
401
3.7K
166.7K
Florent BARTOCCIONI ری ٹویٹ کیا
Ying Wang
Ying Wang@yingwww_·
What is a good latent space for world modeling and planning? 🤔 Inspired by the perceptual straightening hypothesis in human vision, we introduce temporal straightening to improve representation learning for latent planning. 📑: agenticlearning.ai/temporal-strai…
Ying Wang tweet media
English
29
131
776
221.3K
Florent BARTOCCIONI ری ٹویٹ کیا
Arnas Uselis
Arnas Uselis@a_uselis·
How do embedding spaces of models that generalize from limited data look? We study what structure such models should exhibit. Turns out: linear and orthogonal. And modern embedding models like CLIP and SigLIP already show signs of it! 🧵 (1/n)
English
4
101
708
75.1K
Florent BARTOCCIONI ری ٹویٹ کیا
Alec Helbling
Alec Helbling@alec_helbling·
Most of the visualizations have interactive elements that work by running an actual flow model on the front end using Tensorflow.js. It should even work on most mobile devices. Link to blog: alechelbling.com/blog/rectified…
English
2
5
77
6.8K
Florent BARTOCCIONI ری ٹویٹ کیا
Jan Eric Lenssen
Jan Eric Lenssen@janericlenssen·
Can 3D scenes be represented by and rendered from a set of compressed tokens? It turns out they can and it pairs very well with generative rendering to handle uncertainty! Make sure to check out @Mohamma68780050's recent work Scenetok, accepted at #CVPR2026. Links below.
GIF
English
2
26
173
8.4K
Florent BARTOCCIONI ری ٹویٹ کیا
Tengfei Wang
Tengfei Wang@DylanTFWang·
Autoregressive diffusion models drift for long videos? 📉 We fixed it.🚀 Speed + Stability = ✅ Meeting *Test-Time Correction (TTC)*. We stop error accumulation in its tracks without any retraining. ✅ Training-free ✅ 1 minute+ stable generation ✅ Negligible overhead
English
3
17
228
15.5K
Florent BARTOCCIONI ری ٹویٹ کیا
Chuanxia Zheng
Chuanxia Zheng@ChuanxiaZ·
#ICLR2026 🔥 Excited to share NOVA3R, the scene-level version of our previous Amodal3R. ✨ Key highlights: - Amodal reasoning: reconstructs occluded geometry - Physically plausible 3D with fewer duplicated structures Page: wrchen530.github.io/nova3r/ Page: arxiv.org/pdf/2603.04179
English
3
24
137
9.4K
Florent BARTOCCIONI ری ٹویٹ کیا
Nan Rosemary Ke
Nan Rosemary Ke@rosemary_ke·
We’ll be presenting our paper at the Multi-turn Interactions and Embodied World Models workshops at #NeurIPS2025. Frontier foundation models are powerful—but how well can they explore and learn in interactive environments? Paper 👇 arxiv.org/abs/2412.06438 🧵1/13
GIF
English
3
3
12
965
Florent BARTOCCIONI ری ٹویٹ کیا
Evan Kim
Evan Kim@evnkimm·
How do you train compute-optimal novel view synthesis models? In our CVPR ‘26 paper Scaling View Synthesis Transformers, we uncover key design choices through scaling and careful ablations--and along the way train a new SoTA with 3x less compute. (1/n)
Evan Kim tweet media
English
13
19
165
33.1K
Florent BARTOCCIONI ری ٹویٹ کیا
Photoroom
Photoroom@photoroom_ML·
How far can you push diffusion training in 24 hours and $1500? We ran a diffusion speedrun in the next post of our PRX series. 32× H200 1 day of training The result is a surprisingly capable text-to-image model. Full recipe and code open sourced 🧵
Photoroom tweet media
English
3
22
166
12.1K
Florent BARTOCCIONI ری ٹویٹ کیا
George Bredis
George Bredis@BredisGeorge·
Most imagination-based world models learn representations by reconstructing pixels. But reconstruction may not be the right objective for control. In our new paper we explore a different idea: 👉 predict the next embedding instead of reconstructing observations. Introducing NE-Dreamer. Project page: corl-team.github.io/nedreamer/ Paper: arxiv.org/pdf/2603.02765 Code: github.com/corl-team/nedr…
GIF
English
12
58
366
46.8K
Florent BARTOCCIONI ری ٹویٹ کیا
Hila Chefer
Hila Chefer@hila_chefer·
New research from @bfl_ml 🥳 Meet Self-Flow: our self-supervised framework for image, audio, video & world models 🤖 bfl.ai/research/self-… Do generative models really need DINO to learn strong representations? We propose teaching them directly via a joint framework instead 🧵
Hila Chefer tweet media
English
11
61
272
57.3K
Florent BARTOCCIONI ری ٹویٹ کیا
Mohammad Asim
Mohammad Asim@Mohamma68780050·
📢Super excited to announce that our work "𝗦𝗰𝗲𝗻𝗲𝗧𝗼𝗸: 𝗔 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗲𝗱, 𝗗𝗶𝗳𝗳𝘂𝘀𝗮𝗯𝗹𝗲 𝗧𝗼𝗸𝗲𝗻 𝗦𝗽𝗮𝗰𝗲 𝗳𝗼𝗿 𝟯𝗗 𝗦𝗰𝗲𝗻𝗲𝘀" has been accepted at #CVPR2026📢
GIF
English
3
21
156
13.7K