Florent BARTOCCIONI

382 posts

Florent BARTOCCIONI

@fbartoc

Building world models at valeoAI

Katılım Mayıs 2020

1.1K Takip Edilen87 Takipçiler

Florent BARTOCCIONI retweetledi

Xichen Pan@xichen_pan·2d

There has been a lot of debate around the choice of denoising space. But it’s hard to get both semantics/diffusability and strong low-level reconstruction at the same time. REPA and VA-VAE are great explorations of adding semantics into the VAE space. After JiT came out, we started thinking about adding semantics directly into pixel space to improve generation. We explore co-denoising as another form of visual representation alignment and provide a detailed training recipe. The final results show improvements over vanilla JiT and outperform simply applying REPA. Thanks @hanlin_hl for leading this project!

Han Lin@hanlin_hl

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English

Florent BARTOCCIONI retweetledi

Kwang Moo Yi@kwangmoo_yi·3d

Yu et al., "MosaicMem: Hybrid Spatial Memory for Controllable Video World Models" A patch-based spatial memory that you raster into views + glues to make things work.

English

144

10.8K

Florent BARTOCCIONI retweetledi

Chelsea Finn@chelseabfinn·16 Mar

Usually, we expect more diverse data >> less diverse data. Cross-embodiment transfer seems to benefit from paired data across embodiments, more so than increasing diversity. Webpage & code: data-analogies.github.io Paper: arxiv.org/abs/2603.06450

English

474

37.8K

Florent BARTOCCIONI retweetledi

Huan Ling@HuanLing6·6d

Can we use genie-like world model as a real world simulator? Today we introduce Nvidia AlpaDream, now you can drive in a video model! (The video in the attached demo are all generated by a real-time video model!) Come test our interactive real time demo with a gaming wheel at GTC booth.

Zan Gojcic@ZGojcic

A new generation in AV simulation is here! We are announcing AlpaDreams, a real time interactive generative world model for AV simualtion! Just a year ago it took minutes to generate a few seconds of video, today it is real time and interactive! research.nvidia.com/labs/sil/proje…

English

4.5K

Florent BARTOCCIONI retweetledi

DailyPapers@HuggingPapers·6d

Seoul World Model Navigate the real streets of Seoul for kilometers without leaving your screen. This city-scale world model uses retrieval-augmented generation to ground every frame in actual street-view data. You can even spawn Godzilla or summon a tsunami via text prompts.

English

7.6K

Florent BARTOCCIONI retweetledi

Zhikai Zhang@Zhikai273·15 Mar

🎾Introducing LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data Dynamic movements, agile whole-body coordination, and rapid reactions. A step toward athletic humanoid sports skills. Project: zzk273.github.io/LATENT/ Code: github.com/GalaxyGeneralR…

English

162

644

4.1K

1.3M

Florent BARTOCCIONI retweetledi

Sophie Wang@SophieLWang·13 Mar

I made an interactive blog post about how JPEG image compression works: sophielwang.com/blog/jpeg

English

401

3.7K

166.6K

Florent BARTOCCIONI retweetledi

Ying Wang@yingwww_·13 Mar

What is a good latent space for world modeling and planning? 🤔 Inspired by the perceptual straightening hypothesis in human vision, we introduce temporal straightening to improve representation learning for latent planning. 📑: agenticlearning.ai/temporal-strai…

English

130

776

218.6K

Florent BARTOCCIONI retweetledi

Arnas Uselis@a_uselis·11 Mar

How do embedding spaces of models that generalize from limited data look? We study what structure such models should exhibit. Turns out: linear and orthogonal. And modern embedding models like CLIP and SigLIP already show signs of it! 🧵 (1/n)

English

101

708

75.1K

Florent BARTOCCIONI retweetledi

Alec Helbling@alec_helbling·16 Oca

Most of the visualizations have interactive elements that work by running an actual flow model on the front end using Tensorflow.js. It should even work on most mobile devices. Link to blog: alechelbling.com/blog/rectified…

English

6.8K

Florent BARTOCCIONI retweetledi

Jan Eric Lenssen@janericlenssen·9 Mar

Can 3D scenes be represented by and rendered from a set of compressed tokens? It turns out they can and it pairs very well with generative rendering to handle uncertainty! Make sure to check out @Mohamma68780050's recent work Scenetok, accepted at #CVPR2026. Links below.

GIF

English

173

8.4K

Florent BARTOCCIONI retweetledi

Tengfei Wang@DylanTFWang·9 Mar

Autoregressive diffusion models drift for long videos? 📉 We fixed it.🚀 Speed + Stability = ✅ Meeting *Test-Time Correction (TTC)*. We stop error accumulation in its tracks without any retraining. ✅ Training-free ✅ 1 minute+ stable generation ✅ Negligible overhead

English

228

15.5K

Florent BARTOCCIONI retweetledi

Chuanxia Zheng@ChuanxiaZ·7 Mar

#ICLR2026 🔥 Excited to share NOVA3R, the scene-level version of our previous Amodal3R. ✨ Key highlights: - Amodal reasoning: reconstructs occluded geometry - Physically plausible 3D with fewer duplicated structures Page: wrchen530.github.io/nova3r/ Page: arxiv.org/pdf/2603.04179

English

137

9.4K

Florent BARTOCCIONI retweetledi

Nan Rosemary Ke@rosemary_ke·17 Eki

We’ll be presenting our paper at the Multi-turn Interactions and Embodied World Models workshops at #NeurIPS2025. Frontier foundation models are powerful—but how well can they explore and learn in interactive environments? Paper 👇 arxiv.org/abs/2412.06438 🧵1/13

GIF

English

965

Florent BARTOCCIONI retweetledi

Evan Kim@evnkimm·6 Mar

How do you train compute-optimal novel view synthesis models? In our CVPR ‘26 paper Scaling View Synthesis Transformers, we uncover key design choices through scaling and careful ablations--and along the way train a new SoTA with 3x less compute. (1/n)

English

165

33.1K

Florent BARTOCCIONI retweetledi

Photoroom@photoroom_ML·4 Mar

How far can you push diffusion training in 24 hours and $1500? We ran a diffusion speedrun in the next post of our PRX series. 32× H200 1 day of training The result is a surprisingly capable text-to-image model. Full recipe and code open sourced 🧵

English

166

12.1K

Florent BARTOCCIONI retweetledi

George Bredis@BredisGeorge·4 Mar

Most imagination-based world models learn representations by reconstructing pixels. But reconstruction may not be the right objective for control. In our new paper we explore a different idea: 👉 predict the next embedding instead of reconstructing observations. Introducing NE-Dreamer. Project page: corl-team.github.io/nedreamer/ Paper: arxiv.org/pdf/2603.02765 Code: github.com/corl-team/nedr…

GIF

English

366

46.7K

Florent BARTOCCIONI retweetledi

Hila Chefer@hila_chefer·4 Mar

New research from @bfl_ml 🥳 Meet Self-Flow: our self-supervised framework for image, audio, video & world models 🤖 bfl.ai/research/self-… Do generative models really need DINO to learn strong representations? We propose teaching them directly via a joint framework instead 🧵

English

271

57.2K

Florent BARTOCCIONI retweetledi

Zhenjun Zhao@zhenjun_zhao·24 Şub

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes @Mohamma68780050, Christopher Wewer, @janericlenssen tl;dr: in title arxiv.org/abs/2602.18882