
Brian Gordon
36 posts

Brian Gordon
@Brian_Gordon13
Research Intern @ Google | https://t.co/YF6cq9yyny @ Tel-Aviv University




It was a privilege to present in the Google booth our work: RefVNLI: Scalable Evaluation of Subject-driven Text-to-Image Generation (refvnli.github.io) – led by @lovodkin93








Google presents Diffusion Models Are Real-Time Game Engines discuss: huggingface.co/papers/2408.14… We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.




Visual Riddles a Commonsense and World Knowledge Challenge for Large Vision and Language Models Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82\% accuracy, with Gemini-Pro-1.5 leading with 40\% accuracy. Our benchmark comes with automatic evaluation tasks to make assessment scalable. These findings underscore the potential of Visual Riddles as a valuable resource for enhancing vision and language models' capabilities in interpreting complex visual scenarios.




1/📄 Excited to introduce our paper "Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment"!🖼️👀 arxiv.org/abs/2312.03766 Website: mismatch-quest.github.io w. @YonatanBitton, @shafir_yoni, @roopalgarg, Xi Chen, @DaniLischinski, @DanielCohenOr1, Idan Szpektor 🧵




Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment paper page: huggingface.co/papers/2312.03… While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks.
