Scott Lowe

404 posts

Scott Lowe banner
Scott Lowe

Scott Lowe

@scottclowe

Postdoctoral Research Fellow in Machine Learning at Vector Institute. PhD in Computational Neuroscience. British.

Katılım Nisan 2009
186 Takip Edilen333 Takipçiler
Sabitlenmiş Tweet
Scott Lowe
Scott Lowe@scottclowe·
New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553
Scott Lowe tweet mediaScott Lowe tweet media
English
14
103
796
105.2K
Scott Lowe
Scott Lowe@scottclowe·
Yes, exactly! We ask the student to compress the information needed to decode multiple hidden layers within the teacher into the space of just one layer (its output layer). The multiple targets are spaced out teacher layers so they are decorrelated from each other, and spatially spaced out by the block masking so they are decorrelated from the seen patches.
English
0
0
0
11
Scott Lowe
Scott Lowe@scottclowe·
New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553
Scott Lowe tweet mediaScott Lowe tweet media
English
14
103
796
105.2K
Pravesh Biyani
Pravesh Biyani@pravesh·
@scottclowe very nice work can I invite you to give an online talk at our Institute
English
1
0
2
116
Scott Lowe
Scott Lowe@scottclowe·
Regular JEPA is with masking, yes. Our innovation is self-distillation of multiple hidden layers - asking the student to predict multiple representations from within the teacher. It had not been done before, but other works have done related things recently. Most notably, V-JEPA 2.1 actually came up an almost identical methodology at the same time (which they dub "deep self-supervision"), which coincidentally was released on arXiv on the same day as Bootleg.
English
1
0
0
93
Dan Ofer (Was @ICML,@Worldcon )
@scottclowe It wasn't done already? I've beein doing masked + jepa as the baseline , I assumed everyone did? (It works much better than just jepas')
English
1
0
0
244
Scott Lowe
Scott Lowe@scottclowe·
@BarneyFlames Yes, I will add this in the future. Thank you for the feedback.
English
0
0
0
79
Total NIMBY Death
Total NIMBY Death@BarneyFlames·
@scottclowe great work, but would have been nice to have at least 1 eval on non-imagenet training set
English
1
0
0
271
Scott Lowe
Scott Lowe@scottclowe·
(1) Sin-cos position encodings were used for both the encoder and predictor, following the methodologies of MAE and I-JEPA which we "interpolate between". (2) Learnable embeddings were no better than frozen sin-cos position. (3) Preliminary experiments showed swapping to RoPE increased performance, but costs 30% more compute. (4) We found using RoPE was anti-synergistic with using register tokens. Using either one increased performance, but using both at once did not. I am not aware that this phenomenon has been reported before. As register tokens have negligible compute burden, we opted for frozen sin-cos pos + registers instead of RoPE w/o registers. See final paragraph of Appendix B for details. #A2" target="_blank" rel="nofollow noopener">arxiv.org/html/2603.1555…
English
0
0
1
131
Wcabca
Wcabca@WCelhen·
@scottclowe why do you use frozen sin-cos encoding for the predictor only?
English
1
0
0
120
Philip Akomolafe
Philip Akomolafe@PhilipAkomolaf_·
@scottclowe @ylecun Thanks for the guidance would do them.. And also reach out on progress.. Would that be good sir?
English
1
0
1
25
Scott Lowe
Scott Lowe@scottclowe·
@PhilipAkomolaf_ @ylecun You can use 1 GPU without issue. I used 8 GPUs to replicate the I-JEPA recipe as best I could, but we have no batch-size dependent loss terms, so you can reduce it down to 1 without issue. If you want to replicate our work exactly on 1 GPU, just use 8-batch gradient accumulation.
English
1
0
1
88
Scott Lowe
Scott Lowe@scottclowe·
Really cool work - the core idea is strikingly similar to what we found with Bootleg. We also predict multiple teacher layers spread across depth, and find that targeting the hierarchy beats any single layer. For efficiency, our approach concatenates all target levels into a single prediction rather than separate predictor pathways, though I was toying with that during prototyping. We run at full ImageNet-1k scale across ViT-S/B/L if you're curious how it scales: arxiv.org/abs/2603.15553
English
1
0
1
90
James Chen
James Chen@jchencxh·
New blog post (w/ experiements)! I lay out the principles for learning good lower level abstractions, and further show that contrary to popular belief, predicting lower level abstractions is useful for downstream representations. I lay out this problem statement for deep objectives: Deep supervision should aim to balance the composing, retaining, and dispersing of intermediate abstractions in order to learn the ideal set of abstractions in the final representation. I propose a prediction task on an abstraction hierarchy as a good deep objective, and show that by weighing the prediction of higher vs lower level abstractions, we are able to control the biasing of how many lower level abstractions we retain vs how many higher level abstractions we compose in the representations. Concretely, I design a prototype objective based off of I-JEPA that explicitly learns good lower level abstractions via this deep supervision objective, and explore the properties of weighing the prediction task over the hierarchy, showing that this deep supervision task helps improve the semantic performance of the final representation (compared to vanilla I-JEPA). Further, I show that shaping the final representation by predicting well-shaped lower level abstractions boosts semantic performance. Code + Blog link: jchencxh.github.io/blog/construct… Experiments were done at ViT-B scale on ImageNet-100.
James Chen tweet media
English
8
26
265
26.2K
Scott Lowe
Scott Lowe@scottclowe·
@leothecurious Agreed - MAE and JEPA have been increasingly popular in other modalities, with recent papers in audio and multimodal models. So I think there is a lot of scope for transferring improvements and building this out into a general framework.
English
0
0
1
22
davinci
davinci@leothecurious·
@scottclowe i remain optimistic there's tons of potential left to be unlocked in this research direction. can't wait to see where this goes next. i can very well see it becoming the default for general-purpose perceptual learning (even for other high-dim modalities such as audio).
English
1
0
0
23
davinci
davinci@leothecurious·
paper dropped literally within the same week of me posting these. research pace so fast in the singularity u can see ur hypotheses being more or less validated in real time! x.com/i/status/20375…
davinci tweet mediadavinci tweet mediadavinci tweet mediadavinci tweet media
Scott Lowe@scottclowe

New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553

English
6
15
165
18.3K
Scott Lowe
Scott Lowe@scottclowe·
@leothecurious Thank you! It means a lot to have the work well received after so long working on the project.
English
1
0
1
20
davinci
davinci@leothecurious·
@scottclowe yes, was very pleased to see the high overlap in terminology from those screenshots from ur paper. i also found the collapse despite pixel grounding surprising tbh. congrats, very interesting work! x.com/i/status/20377…
davinci@leothecurious

note: there's admittedly (and expectedly) more nuance to it than my previous posts implied and it seems like representation collapse can still be possible even with pixel grounding if the masked and seen patches are too visually close and can be trivially predicted

English
1
0
2
44
Scott Lowe
Scott Lowe@scottclowe·
@murloren @ylecun @AdrienBardes Interesting convergence - we independently arrived at the same core idea as your "deep self-supervision" component (multi-layer hidden self-distillation) in our Bootleg paper, released on arXiv on the exact same day: arxiv.org/abs/2603.15553
English
0
0
1
29
Loren
Loren@murloren·
I am very happy to share the result of my internship at FAIR (Meta): V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning with @ylecun @AdrienBardes Our approach learns dense, spatially coherent features from video while preserving strong global understanding
Loren tweet media
English
18
46
331
71.3K
Scott Lowe
Scott Lowe@scottclowe·
Interesting timing — Meta released V-JEPA 2.1 on literally the same day as our Bootleg paper, independently arriving at the same core idea: self-distillation of multiple hidden layers as prediction targets, evenly spaced across the encoder. The details are strikingly similar: ~4 target blocks, per-level normalization, concatenated along the channel dimension, EMA teacher. Their ablation actually shows that multi-level prediction is what makes their new context loss viable — without it, the context loss destroys classification accuracy (-10pp). Hidden self-distillation is doing the heavy lifting. Great to see convergent evidence from Meta's JEPA team confirming that this is a fundamental improvement to the framework. Our paper provides detailed ablations and analysis of why it works; V-JEPA 2.1 shows it scales to ViT-G and video. Bootleg paper: arxiv.org/abs/2603.15553 V-JEPA 2.1 paper: arxiv.org/abs/2603.14482
Scott Lowe@scottclowe

New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553

English
7
34
260
22.9K
Scott Lowe
Scott Lowe@scottclowe·
Thanks for sharing! Yes, I'll cite LayerLock in future as a related method. It would be nice if we could do a like-for-like comparison of the two methods in the future. Incidentally, your LayerLock method reminds me a lot of how deep-RBMs used to be trained (presumably that was part of your inspiration).
English
0
0
2
21
Scott Lowe
Scott Lowe@scottclowe·
Note that even if we controlled for the compute and data used for pretraining, the comparison still would not be like-for-like as our method doesn't use any in-batch contrastive components. This makes our method more portable to other domains (more general) and to low-compute settings (no need for an enormous batch size), but it will underperform SOTA methodologies that were tuned to be optimal for image data.
English
0
0
5
96
Scott Lowe
Scott Lowe@scottclowe·
@omarmoustafa280 We compare against DINOv2 in the full version of the table shown in the paper. DINOv2 outperforms us, but we don't have a fair comparison as their model was trained with a lot more compute on a larger dataset, for longer, distilled from a teacher.
English
1
0
4
398
Scott Lowe
Scott Lowe@scottclowe·
@LatosLouis Yes, exactly! Our self-supervision methodology is analogous to a predictive coding model of the brain. The model has to predict what latent representations it would receive from neighbouring locations, with hidden blocks being analogous to different regions of the visual cortex.
English
0
0
4
451
Scott Lowe
Scott Lowe@scottclowe·
A few surprising findings from the paper: The best single target layer for I-JEPA isn't the final layer — it's block 2 for ViT-S and block 6 for ViT-B. But you don't need to pick: targeting multiple layers simultaneously beats any single layer. MAE with hidden self-distillation completely collapses at deeper target layers. But just switching from random masking to block-structured masking is enough to stabilize it. Masking strategy matters far more than you'd expect.
Scott Lowe tweet mediaScott Lowe tweet mediaScott Lowe tweet media
English
1
2
21
2.4K