
Our #ECCV2020 paper is out arxiv.org/abs/2008.01392
By learning to predict a masked token in a caption using its associated image, our model learns generic visual representations from scratch using only image-caption pairs.
w/ @mbsariyildiz @perezjln @naverlabseurope
GIF
English

