Divyanshu Mishra

518 posts

Divyanshu Mishra

Divyanshu Mishra

@Perceptron97

Research @AmazonScience. DPhil from @UniOfOxford @NobleLabOxford. Interested in video understanding, world foundation models.

Oxford, England Bergabung Aralık 2010
578 Mengikuti198 Pengikut
Tweet Disematkan
Divyanshu Mishra
Divyanshu Mishra@Perceptron97·
🚀 We’re excited to announce that our paper, “STAN-LOC: Visual Query-based Video Clip Localization for Fetal Ultrasound Sweep Videos,” has been accepted to #MICCAI2024! 🎉
English
1
5
10
1.4K
Divyanshu Mishra me-retweet
David Fan
David Fan@DavidJFan·
[1/9] What happens when you treat vision as a first-class citizen during multimodal pretraining? To find out, we studied the design space of training Transfusion-style models that input and output all modalities, from scratch. Here is what we learned about visual representations, data, world modeling, architecture, and scaling behavior! Paper: arxiv.org/abs/2603.03276 Website: beyond-llms.github.io @TongPetersb, @DavidJFan, @__JohnNguyen__, @ellisbrown, @GaoyueZhou, @JasonQSY, @boyangzheng, @webalorn, @han_junlin, @rob_fergus, @NailaMurray, @gh_marjan, @ml_perception, Nicolas Ballas, @_amirbar, Michael Rabbat, Jakob Verbeek, @LukeZettlemoyer, @koustuvsinha, @ylecun, @sainingxie
English
12
62
303
49.9K
Saar Huberman
Saar Huberman@HubermanSaar·
SemanticMoments - Semantic motion similarity How do you find videos with similar motion? It’s harder than it sounds. Models like VideoMAE and V-JEPA encode motion, but their embeddings are dominated by appearance. So how do we build a compact embedding for motion similarity? Joint work with @kfir99 @OPatashnik @BenaimSagie @MokadyRon
GIF
English
8
29
182
26.4K
Divyanshu Mishra
Divyanshu Mishra@Perceptron97·
@alifmunim Amazing work by the team 👏 Really impressive scale and results. Curious whether you considered comparisons with other recent video SSL architectures from around the V-JEPA2 timeframe (~2025), particularly to understand how different SSL methods scale for heart ultrasound.
English
0
0
0
26
Divyanshu Mishra me-retweet
Yash Bhalgat
Yash Bhalgat@ysbhalgat·
PhD applicants take note. As @j_foerst said, funding situation this year is not good. My advice is to consider CDTs if you are applying for an AI PhD in the UK. I am with @aims_oxford and highly recommend applying. Deadline: 28 January 2026 (check this).
Jakob Foerster@j_foerst

Hello World: I am reviewing Phd applications and the level of talent is amazing. Sadly, the funding situation is extremely challenging. SO: If you'd like to gift someone brilliant literally the opportunity of their lifetime and sponsor their Phd in my group please let me know 🙏

English
0
1
4
635
Divyanshu Mishra me-retweet
Martin Ziqiao Ma
Martin Ziqiao Ma@ziqiao_ma·
NEPA: Next-Embedding Predictive Autoregression A simple objective for visual SSL and generative pretraining. Instead of reconstructing pixels or predicting discrete tokens, we train an autoregressive model to predict the next embedding given all previous embeddings. Key ideas: - One self-supervised signal: cosine-style next-embedding prediction - Autoregression runs directly on the embeddings from a native encoder (no offline encoder) - No pixel decoder (and loss), no contrastive pairs, no task-specific heads, no random masks Scales into modern ViT backbones and stays competitive after supervised fine-tuning: - ImageNet-1K (Base 83.8%; Large 85.3%) - ADE20K Fully open-sourced with reproducibility verified: - Homepage: sihanxu.me/nepa/ - Paper: arxiv.org/abs/2512.16922 - Code: github.com/SihanXU/nepa - Weights: huggingface.co/collections/Si… This work is led by @6SihanXu and advised by @SLED_AI, @sainingxie, and Stella X. Yu. Contributors: me, @wenhaocha1, @ChenXuweiyi, and @JinWeiyang18434.
Martin Ziqiao Ma tweet media
English
20
100
732
141.5K
Divyanshu Mishra me-retweet
Tengda Han
Tengda Han@TengdaHan·
Human learns from unique data -- everyone's OWN life -- but our visual representations eventually align. In our recent work "Unique Lives, Shared World" @GoogleDeepMind, we train models with "single-life" videos from distinct sources, and study their alignment and generalisation.
Tengda Han tweet mediaTengda Han tweet mediaTengda Han tweet media
English
10
31
146
12.8K
Divyanshu Mishra me-retweet
Sindhu Hegde
Sindhu Hegde@SindhuBHegde·
🎉Thrilled to be awarded the 2025 Google PhD Fellowship in Machine Perception for my research on human gesture understanding! Huge thanks to my advisor Prof. Andrew Zisserman for his constant guidance & to @GoogleAI, @Googleorg for this incredible honor. @Oxford_VGG @UniofOxford
Google.org@Googleorg

🎉 We're excited to announce the 2025 Google PhD Fellows! @GoogleOrg is providing over $10 million to support 255 PhD students across 35 countries, fostering the next generation of research talent to strengthen the global scientific landscape. Read more: goo.gle/43wJWw8

English
0
2
13
2.8K
Divyanshu Mishra me-retweet
Shashank
Shashank@shawshank_v·
An amazing written blog @ysbhalgat, must read for prospective PhD students.
Yash Bhalgat@ysbhalgat

💡 Should you do a PhD in AI (2025–26)? 🎓 🔗: yashbhalgat.github.io/blog/phd-or-no… Every October, students considering PhD applications ask me: is a PhD still the right path in AI? ⚖️ ⚠️ After a few years moving between academia (@UniofOxford's @Oxford_VGG) and industry (@QCOMResearch, @Meta Reality Labs, and a few startups), I’ve seen both sides of the research world. And the truth is: they’ve never felt further apart. 🌟 Today, most of the *scale-driven* work -- world models, video generation, large VLMs -- happens in industry. Compute access, data scale, and iteration speed make that inevitable. But academia still matters: it’s where new ideas, theory, and deep conceptual work often begin. The difference now is knowing what not to work on. 📢 I’ve written a longer, no-BS post on this -- what makes a PhD worth it, when it doesn’t, and how to think about your timing. 🧭 Read it, share it, debate it -- just don’t decide by inertia. Full post here: yashbhalgat.github.io/blog/phd-or-no… #PhD #AI #ArtificialIntelligence #MachineLearning #PhDLife #Research #AcademicTwitter #GradSchool #CareerAdvice

English
0
3
10
2.1K
Divyanshu Mishra me-retweet
Shashank
Shashank@shawshank_v·
Really excited to be giving a talk on “Openness of Vision Foundation Models” at the FOUND workshop tomorrow (19 Oct) at 10:20am, room 316C. Thanks to @HirokatuKataoka and colleagues for the invite. Looking forward to interacting with you all.
Hirokatsu Kataoka | 片岡裕雄@HirokatuKataoka

At ICCV 2025, I am organizing two workshops: the LIMIT Workshop and the FOUND Workshop. ◆ LIMIT Workshop (19 Oct, PM): iccv2025-limit-workshop.limitlab.xyz ◆ FOUND Workshop (19 Oct, AM): iccv2025-found-workshop.limitlab.xyz We warmly invite you to attend at these workshops in ICCV 2025 Hawaii!

English
1
11
14
6.8K
Divyanshu Mishra me-retweet
Yuki
Yuki@y_m_asano·
Our paper 'Self‑Labelling via Simultaneous Clustering and Representation Learning' just got its 1000th citation. On that occasion, I want to give my perspective on this question: Who or what is Sinkhorn–Knopp? Short answer: It’s the little ~1960s matrix‑normalization workhorse that now underpins modern self‑supervised vision training—think DINOv2, DINOv3, and Franca. Medium‑long answer. It’s an entropy‑regularized clustering routine that can run online. The entropy term spreads mass across prototypes (or equivalently "clusters" or "pseudolabels"), discouraging empty clusters and collapse—hence its popularity in modern SSL losses (DINO/ iBoT‑style heads). Long answer (history, intuition, relation to models like DINOv3) 👇
English
2
22
125
14.7K
Divyanshu Mishra me-retweet
Hermione Warr
Hermione Warr@Hermionegrace76·
📣 Excited to present our work at the ELAMI workshop #MICCAI_2025! 🗣️ Talk — Sept 27, 9:36am 🔍 Does the way we tokenize language affect performance of modern LMs in Radiology? 📄 paper: arxiv.org/abs/2508.09952 🧵 (1/5)
Hermione Warr tweet media
English
1
2
6
621
Divyanshu Mishra me-retweet
Angus Nicolson
Angus Nicolson@angusjnic·
We're hiring! New postdoc position in our Digital Cardiology Lab at the Medical University of Innsbruck. Checkout the post on LinkedIn and feel free to reach out if you have any questions. linkedin.com/jobs/view/4288…
English
0
1
5
258
Divyanshu Mishra me-retweet
Shashank
Shashank@shawshank_v·
Can open-data models beat DINOv2? Today we release Franca, a fully open-sourced vision foundation model. Franca with ViT-G backbone matches (and often beats) proprietary models like SigLIPv2, CLIP, DINOv2 on various benchmarks setting a new standard for open-source research🧵
Shashank tweet media
English
13
55
275
56.5K
Divyanshu Mishra me-retweet
Cohere Labs
Cohere Labs@Cohere_Labs·
Our Computer Vision group is excited to host David Fan and @TongPetersb next week on Tuesday, August 5th for a presentation on "Scaling Language-Free Visual Representation Learning" (arxiv.org/abs/2504.01017).
Cohere Labs tweet media
English
3
5
21
4K
Divyanshu Mishra me-retweet
Yuki
Yuki@y_m_asano·
Today we release Franca, a new vision Foundation Model that matches and sometimes outperforms DINOv2. The data, the training code and the model weights (with intermediate checkpoints) are open-source, allowing everyone to build on this. Methodologically, we introduce two new SSL components, one is a multi-granularity SK clustering loss that utilizes Matryoshka representations and a quick post-pretraining scheme to remove unwanted spatial biases. This is the result of a close and fun collaboration @valeoai (in France) and @FunAILab (in Franconia)
Shashank@shawshank_v

Can open-data models beat DINOv2? Today we release Franca, a fully open-sourced vision foundation model. Franca with ViT-G backbone matches (and often beats) proprietary models like SigLIPv2, CLIP, DINOv2 on various benchmarks setting a new standard for open-source research🧵

English
3
25
171
13.6K
Divyanshu Mishra me-retweet
Yuki
Yuki@y_m_asano·
New paper accepted at @ICCVConference: MoSiC, our new post-pretraining technique for upgrading vision foundation models like DINOv2R using videos, thanks to strong point trackers and Sinkhorn clustering. Check the thread below :)
Shashank@shawshank_v

New paper out - accepted at @ICCVConference We introduce MoSiC, a self-supervised learning framework that learns temporally consistent representations from video using motion cues. Key idea: leverage long-range point tracks to enforce dense feature coherence across time.🧵

English
0
9
37
2.7K