Scott Lowe (@scottclowe) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Scott Lowe@scottclowe·27 Mar

New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553

English

14

103

796

105.2K

Scott Lowe@scottclowe·30 Mar

Yes, exactly! We ask the student to compress the information needed to decode multiple hidden layers within the teacher into the space of just one layer (its output layer). The multiple targets are spaced out teacher layers so they are decorrelated from each other, and spatially spaced out by the block masking so they are decorrelated from the seen patches.

English

0

11

Dan Ofer (Was @ICML,@Worldcon )@danofer·30 Mar

@scottclowe Multiple layers is neat. All predicted from the output layer of the student?!?

English

1

0

12

Scott Lowe@scottclowe·27 Mar

New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553

English

14

103

796

105.2K

Scott Lowe@scottclowe·29 Mar

@pravesh Thank you! Yes that would be great.

English

0

1

9

Pravesh Biyani@pravesh·29 Mar

@scottclowe very nice work can I invite you to give an online talk at our Institute

English

1

0

2

116

Scott Lowe@scottclowe·29 Mar

Regular JEPA is with masking, yes. Our innovation is self-distillation of multiple hidden layers - asking the student to predict multiple representations from within the teacher. It had not been done before, but other works have done related things recently. Most notably, V-JEPA 2.1 actually came up an almost identical methodology at the same time (which they dub "deep self-supervision"), which coincidentally was released on arXiv on the same day as Bootleg.

English

1

0

93

Dan Ofer (Was @ICML,@Worldcon )@danofer·29 Mar

@scottclowe It wasn't done already? I've beein doing masked + jepa as the baseline , I assumed everyone did? (It works much better than just jepas')

English

1

0

244

Scott Lowe@scottclowe·29 Mar

@BarneyFlames Yes, I will add this in the future. Thank you for the feedback.

English

0

79

Total NIMBY Death@BarneyFlames·29 Mar

@scottclowe great work, but would have been nice to have at least 1 eval on non-imagenet training set

English

1

0

271

Scott Lowe@scottclowe·29 Mar

(1) Sin-cos position encodings were used for both the encoder and predictor, following the methodologies of MAE and I-JEPA which we "interpolate between". (2) Learnable embeddings were no better than frozen sin-cos position. (3) Preliminary experiments showed swapping to RoPE increased performance, but costs 30% more compute. (4) We found using RoPE was anti-synergistic with using register tokens. Using either one increased performance, but using both at once did not. I am not aware that this phenomenon has been reported before. As register tokens have negligible compute burden, we opted for frozen sin-cos pos + registers instead of RoPE w/o registers. See final paragraph of Appendix B for details. #A2" target="_blank" rel="nofollow noopener">arxiv.org/html/2603.1555…

English

0

1

131

Wcabca@WCelhen·29 Mar

@scottclowe why do you use frozen sin-cos encoding for the predictor only?

English

1

0

120

Scott Lowe@scottclowe·29 Mar

@PhilipAkomolaf_ @ylecun Yes, let me know how it goes!

English

1

0

1

14

Philip Akomolafe@PhilipAkomolaf_·29 Mar

@scottclowe @ylecun Thanks for the guidance would do them.. And also reach out on progress.. Would that be good sir?

English

1

0

1

25

Scott Lowe@scottclowe·29 Mar

@PhilipAkomolaf_ @ylecun You can use 1 GPU without issue. I used 8 GPUs to replicate the I-JEPA recipe as best I could, but we have no batch-size dependent loss terms, so you can reduce it down to 1 without issue. If you want to replicate our work exactly on 1 GPU, just use 8-batch gradient accumulation.

English

1

0

1

88

Philip Akomolafe@PhilipAkomolaf_·29 Mar

@scottclowe @ylecun LeWorldModel used 1 GPU. What should I expect if implementing this?

English

1

0

1

105

Scott Lowe@scottclowe·28 Mar

Really cool work - the core idea is strikingly similar to what we found with Bootleg. We also predict multiple teacher layers spread across depth, and find that targeting the hierarchy beats any single layer. For efficiency, our approach concatenates all target levels into a single prediction rather than separate predictor pathways, though I was toying with that during prototyping. We run at full ImageNet-1k scale across ViT-S/B/L if you're curious how it scales: arxiv.org/abs/2603.15553

English

1

0

1

90

James Chen@jchencxh·20 Mar

New blog post (w/ experiements)! I lay out the principles for learning good lower level abstractions, and further show that contrary to popular belief, predicting lower level abstractions is useful for downstream representations. I lay out this problem statement for deep objectives: Deep supervision should aim to balance the composing, retaining, and dispersing of intermediate abstractions in order to learn the ideal set of abstractions in the final representation. I propose a prediction task on an abstraction hierarchy as a good deep objective, and show that by weighing the prediction of higher vs lower level abstractions, we are able to control the biasing of how many lower level abstractions we retain vs how many higher level abstractions we compose in the representations. Concretely, I design a prototype objective based off of I-JEPA that explicitly learns good lower level abstractions via this deep supervision objective, and explore the properties of weighing the prediction task over the hierarchy, showing that this deep supervision task helps improve the semantic performance of the final representation (compared to vanilla I-JEPA). Further, I show that shaping the final representation by predicting well-shaped lower level abstractions boosts semantic performance. Code + Blog link: jchencxh.github.io/blog/construct… Experiments were done at ViT-B scale on ImageNet-100.

English

8

26

265

26.2K

Scott Lowe@scottclowe·28 Mar

@leothecurious Agreed - MAE and JEPA have been increasingly popular in other modalities, with recent papers in audio and multimodal models. So I think there is a lot of scope for transferring improvements and building this out into a general framework.

English

0

1

22

davinci@leothecurious·28 Mar

@scottclowe i remain optimistic there's tons of potential left to be unlocked in this research direction. can't wait to see where this goes next. i can very well see it becoming the default for general-purpose perceptual learning (even for other high-dim modalities such as audio).

English

1

0

23

davinci@leothecurious·28 Mar

paper dropped literally within the same week of me posting these. research pace so fast in the singularity u can see ur hypotheses being more or less validated in real time! x.com/i/status/20375…

Scott Lowe@scottclowe

New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553

English

6

15

165

18.3K

Scott Lowe@scottclowe·28 Mar

@leothecurious Thank you! It means a lot to have the work well received after so long working on the project.

English

1

0

1

20

davinci@leothecurious·28 Mar

@scottclowe yes, was very pleased to see the high overlap in terminology from those screenshots from ur paper. i also found the collapse despite pixel grounding surprising tbh. congrats, very interesting work! x.com/i/status/20377…

davinci@leothecurious

note: there's admittedly (and expectedly) more nuance to it than my previous posts implied and it seems like representation collapse can still be possible even with pixel grounding if the masked and seen patches are too visually close and can be trivially predicted

English

1

0

2

44

Scott Lowe@scottclowe·28 Mar

@murloren @ylecun @AdrienBardes Interesting convergence - we independently arrived at the same core idea as your "deep self-supervision" component (multi-layer hidden self-distillation) in our Bootleg paper, released on arXiv on the exact same day: arxiv.org/abs/2603.15553

English

0

1

29

Loren@murloren·20 Mar

I am very happy to share the result of my internship at FAIR (Meta): V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning with @ylecun @AdrienBardes Our approach learns dense, spatially coherent features from video while preserving strong global understanding

English

18

46

331

71.3K

Scott Lowe@scottclowe·28 Mar

cc V-JEPA 2.1 team: @murloren @mattmucklm @_amirbar @mido_assran @koustuvsinha @ylecun @ballas_nicolas @AdrienBardes

Indonesia

0

6

515

Scott Lowe@scottclowe·28 Mar

Interesting timing — Meta released V-JEPA 2.1 on literally the same day as our Bootleg paper, independently arriving at the same core idea: self-distillation of multiple hidden layers as prediction targets, evenly spaced across the encoder. The details are strikingly similar: ~4 target blocks, per-level normalization, concatenated along the channel dimension, EMA teacher. Their ablation actually shows that multi-level prediction is what makes their new context loss viable — without it, the context loss destroys classification accuracy (-10pp). Hidden self-distillation is doing the heavy lifting. Great to see convergent evidence from Meta's JEPA team confirming that this is a fundamental improvement to the framework. Our paper provides detailed ablations and analysis of why it works; V-JEPA 2.1 shows it scales to ViT-G and video. Bootleg paper: arxiv.org/abs/2603.15553 V-JEPA 2.1 paper: arxiv.org/abs/2603.14482

Scott Lowe@scottclowe

New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553

English

7

34

260

22.9K

Scott Lowe@scottclowe·28 Mar

Thanks for sharing! Yes, I'll cite LayerLock in future as a related method. It would be nice if we could do a like-for-like comparison of the two methods in the future. Incidentally, your LayerLock method reminds me a lot of how deep-RBMs used to be trained (presumably that was part of your inspiration).

English

0

2

21

Nikhil Parthasarathy@nikparth1·28 Mar

@scottclowe @osageev @ev4n3sce @VectorInst @Carleton_U @DalhousieU @UBC @uofg Cool work, congrats! For related work, I'd encourage citing our LayerLock method that operates in the same spirit (but for VideoMAE models): arxiv.org/abs/2509.10156 Instead of predicting all layers with an EMA teacher we freeze progressively and predict last frozen layer output.

English

1

0

73

Scott Lowe@scottclowe·28 Mar

Note that even if we controlled for the compute and data used for pretraining, the comparison still would not be like-for-like as our method doesn't use any in-batch contrastive components. This makes our method more portable to other domains (more general) and to low-compute settings (no need for an enormous batch size), but it will underperform SOTA methodologies that were tuned to be optimal for image data.

English

0

5

96

Scott Lowe@scottclowe·28 Mar

@omarmoustafa280 We compare against DINOv2 in the full version of the table shown in the paper. DINOv2 outperforms us, but we don't have a fair comparison as their model was trained with a lot more compute on a larger dataset, for longer, distilled from a teacher.

English

1

0

4

398

Scott Lowe@scottclowe·28 Mar

@LatosLouis Yes, exactly! Our self-supervision methodology is analogous to a predictive coding model of the brain. The model has to predict what latent representations it would receive from neighbouring locations, with hidden blocks being analogous to different regions of the visual cortex.

English

0

4

451

AgenticCaterpillar@LatosLouis·28 Mar

@scottclowe Reinventing predictive coding from first principles

English

1

0

9

801

Scott Lowe@scottclowe·27 Mar

Paper: arxiv.org/abs/2603.15553 Code and model checkpoints will be released soon! Work in collaboration with Anthony Fuller, Sageev Oore (@osageev), Evan Shelhamer (@ev4n3sce), and Graham Taylor at @VectorInst , @Carleton_U, @DalhousieU, @UBC, @UofG.

English

1

2

19

1.6K

Scott Lowe@scottclowe·27 Mar

A few surprising findings from the paper: The best single target layer for I-JEPA isn't the final layer — it's block 2 for ViT-S and block 6 for ViT-B. But you don't need to pick: targeting multiple layers simultaneously beats any single layer. MAE with hidden self-distillation completely collapses at deeper target layers. But just switching from random masking to block-structured masking is enough to stabilize it. Masking strategy matters far more than you'd expect.

English

1

2

21

2.4K

Scott Lowe

Keşfet