Afshin Dehghan

20 posts

Afshin Dehghan

@afshin_dn

Katılım Aralık 2023

45 Takip Edilen87 Takipçiler

Afshin Dehghan retweetledi

Andrei Atanov@andrew_atanov·15 Nis

Moving forward, we're excited to see what coarse-to-fine representations can enable beyond generation: world models that plan at different levels of abstraction using fewer tokens, efficient test-time scaling, as well as extending the framework to longer temporal signals and other modalities. 8/n

English

461

Afshin Dehghan retweetledi

Andrei Atanov@andrew_atanov·15 Nis

In this example, the compactness of VideoFlexTok tokens enables us to train a T2V model on 10-second, 81-frame videos using only 672 tokens, 8x fewer than a comparable 3D grid tokenizer. Importantly, the shorter sequence allows the model to capture long-range dependencies in context without resorting to a sliding-window approach. 7/n

English

369

Afshin Dehghan retweetledi

Andrei Atanov@andrew_atanov·15 Nis

The ability to express the prompt with only a few tokens enables more efficient downstream generative modeling. Here, we scale both the model and dataset sizes and find that VideoFlexTok yields a significantly better scaling trend in both fidelity (gFVD) and text alignment (ViCLIP), reducing training FLOPs by 5-10x compared to standard 3D-grid tokenization. 6/n

English

383

Afshin Dehghan retweetledi

Andrei Atanov@andrew_atanov·15 Nis

Autoregressive generative modeling over VideoFlexTok tokens allows describing the scene with progressively more details by generating more tokens. For text-to-video generation, we find that just 2-4 tokens per chunk can often capture coarse motion and semantic information. Beyond generating pixels, we believe this capability will be crucial for efficient world modeling and planning, enabling reasoning and search at different levels of abstraction. 5/n

GIF

English

428

Afshin Dehghan retweetledi

Andrei Atanov@andrew_atanov·15 Nis

We find that the first VideoFlexTok tokens emergently capture motion. To probe this, we run the following experiment. For a given source video (e.g., rolling orange), we condition the decoder on the original 1-2 tokens and an edited first frame (e.g., swap the orange with an apple). We find that reconstructions preserve the visual appearance of the edited frame and the motion pattern of the original video, suggesting that the first tokens primarily capture motion. 4/n

English

512

Afshin Dehghan retweetledi

Andrei Atanov@andrew_atanov·15 Nis

The VideoFlexTok encoder uses registers to resample a 3D spatiotemporal video into a 2D token sequence with temporal and coarse-to-fine dimensions. Nested dropout then randomly drops tokens from the end along the coarse-to-fine dimension, promoting early tokens to capture the most important information. The REPA (DINOv2 distillation) loss biases these early tokens toward semantics. Finally, a generative flow decoder enables realistic reconstructions from any token count. 3/n

English

691

Afshin Dehghan retweetledi

Andrei Atanov@andrew_atanov·15 Nis

VideoFlexTok builds on the image FlexTok and handles additional sequential structure in data. It assigns each chunk of video frames a flexible-length token sequence that describes the video from coarse to fine. It can reduce the token count by up to 256x while preserving the full video length and important semantic and motion information. This contrasts with fixed-size 3D tokenizers, which can capture only a fraction of the video with a lower token budget. 2/n

GIF

English

1.2K

Afshin Dehghan@afshin_dn·15 Nis

Great work from our intern. We did FlexTok for images last year. You asked for videos and we delivered. #VideoGeneration #VideoTokenization #ComputerVision #GenerativeAI #MachineLearning #Research #EPFL #Apple #AIInternship #MultimodalIntelligence

Andrei Atanov@andrew_atanov

Are all videos worth the same number of tokens? Whether rich in motion or visually minimal, standard 3D-grid tokenizers treat them equally. We present VideoFlexTok, which represents videos using a flexible-length, coarse-to-fine sequence of tokens. Page: videoflextok.epfl.ch Demo: huggingface.co/spaces/EPFL-VI… Paper: arxiv.org/abs/2604.12887 1/n

English

153

Afshin Dehghan@afshin_dn·16 Ara

📢 We’re hiring! My group has several open positions, with hiring across all levels, offering an exciting opportunity to work on some of the most impactful projects at Apple where you will tackle real-world challenges, build a long-term career with a clear growth path, develop a strong multidisciplinary skill set, conduct world-class research, and contribute back to the broader community through high-impact, visible work. Locations: Seattle & Bay Area If you’re interested, please send me a brief message including your CV and a link to your Google Scholar profile. Here are the links for the roles and different levels. Looking forward to connecting! AI Research Scientist - Multimodal Intelligence lnkd.in/dSGWR6J4 lnkd.in/dgzrcJcv Applied AI Scientist - Multimodal Intelligence lnkd.in/dPvg5fvK lnkd.in/drbK8juD #Hiring #AIResearch #MultimodalAI #AIJobs #ResearchScientist #FoundationModels #GenerativeAI #TechJobs

English

782

Afshin Dehghan@afshin_dn·21 Kas

A great, uplifting event at the SF Al mixer this week with some incredibly sharp thinkers!

English

107

Afshin Dehghan@afshin_dn·6 Ağu

When training LLMs, dataset size & quality matter as much as architecture. Scaling laws show: 📈 More compute → broader, less filtered data 📷 Less compute → tighter more curated datasets Small models need precision. Big models thrive on diversity. Optimize accordingly.

English

379

Afshin Dehghan@afshin_dn·18 Tem

Yesterday we shared our latest work on pretraining data curation. What if we stop guessing which data is “good” and directly match pretraining data to the benchmarks we care about? 📄 arxiv.org/abs/2507.12466 #AIResearch #llm #DataCuration #Pretraining #ScalingLaws

English

1.3K

Afshin Dehghan retweetledi

David Mizrahi@dmizrahi_·18 Tem

Excited to share our new work: “Language Models Improve When Pretraining Data Matches Target Tasks” Yes, it sounds obvious (and it is!), but typically this only happens implicitly and indirectly: intuitively select data → benchmark → refine → repeat. We wondered: what happens if we explicitly match pretraining data to benchmarks? The result is a dead simple approach that yields 2x+ compute multipliers over strong baselines and gives us a principled way to study how benchmark choices shape (and constrain!) model capabilities. Bonus: extensive scaling laws from training 500+ models that reveal how optimal data selection evolves as models scale. 🧵 (1/14)

English

402

55.2K

Afshin Dehghan@afshin_dn·10 Haz

Incredibly proud of the work across teams in delivering the latest version of Visual Intelligence. Visual Intelligence makes it faster to do more with what’s right in front of you. #WWDC25 #visualintelligence #AppleIntelligence

English

137

Afshin Dehghan retweetledi

Francis Engelmann@FrancisEngelman·2 Haz

Very excited to announce our final line-up of fantastic speakers at this year's @CVPR workshop on Open-World 3D Scene Understanding with Foundation Models ✨ #OpenSUN3D #cvpr2025 📆 June 12, 2pm-6pm 🏡 opensun3d.github.io

English

6.8K

Afshin Dehghan@afshin_dn·24 Nis

Singapore can get you off a plane, through immigration, and into a cab in under 30 minutes. But at #ICLR25, you’ll need over 2 hours and a 0.5 mile hike just to get your badge. Congrats to #ICLR for breaking the record for most academic patience ever tested. #ICLR25 #ConfLife

English

816

Afshin Dehghan@afshin_dn·4 Nis

Excited to share that we have recently released the source code for FlexTok, bringing a fresh perspective to tokenization. Code on GitHub: lnkd.in/g4iNJFmU. Project Page: flextok.epfl.ch #FlexTok #Tokenization #MachineLearning #MLResearch #OpenSource #AI

English

7.1K

Afshin Dehghan@afshin_dn·3 Nis

🚀 Model and data for our CubifyAnything project are now released! 🔗 github.com/apple/ml-cubif… #SpatialReasoning #3DObjectDetection #transformers #detection #ai #genai

English

276

Afshin Dehghan retweetledi

Amir Zamir@zamir_ar·13 Ara

We'll present at NeurIPS, today at 5pm CST. Spotlight #1022. Effectively bringing sensory modalities to large models is one way to make them more grounded, and ultimately have a more complete World Model. This is a step in that direction hopefully, and more will come.

Amir Zamir@zamir_ar

4M exhibits having learned a solid cross-modal representation. We can use the various modalities to probe how 4M reconciles unusual inputs by manipulating one part of it while keeping the remainder fixed. (8/n)

English

6.3K

Afshin Dehghan retweetledi

Amir Zamir@zamir_ar·13 Ara

We are releasing the 1st version of 4M, a framework for training multimodal foundation models across tens of modalities & tasks, based on scalable masked modeling. Joint effort by @EPFL_en & @Apple. 4M: Massively Multimodal Masked Modeling 🌐4m.epfl.ch 🧵1/n

English

132

602

189.1K

Keşfet

@CVPR @EPFL_en @Apple @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates