Yuning Chai retweetledi

(1/N ) Can a large multimodal model not only understand bboxes, but also understand arbitrary visual prompts (scribble, arrow, etc) without explicit region embedding? Yes! Our latest work ViP-LLaVA arxiv.org/abs/2312.00784 shows an extremely simple but effective approach.

English











