
Most teams build RAG systems that only see text. But a lot of queries 𝘳𝘦𝘲𝘶𝘪𝘳𝘦 visual understanding to answer correctly. Most RAG systems treat PDFs the same way: OCR the text, chunk it, embed it, done. But that approach misses 𝗳𝗶𝗴𝘂𝗿𝗲𝘀, 𝘁𝗮𝗯𝗹𝗲𝘀, 𝘀𝗽𝗮𝘁𝗶𝗮𝗹 𝗹𝗮𝘆𝗼𝘂𝘁, 𝗮𝗻𝗱 𝘃𝗶𝘀𝘂𝗮𝗹 𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻𝘀𝗵𝗶𝗽𝘀 that are a core part of the data. My colleagues at @weaviate_io just published IRPAPERS, a benchmark that directly compares text-based vs. image-based retrieval over 3,230 pages from 166 scientific papers. The setup is straightforward: take the same PDFs and process them two ways. For text-based retrieval, run OCR with GPT-4.1, then embed with Arctic 2.0 + BM25 hybrid search. For image-based retrieval, embed the raw page images with ColModernVBERT multi-vector embeddings. Then test both on 180 needle-in-haystack questions targeting specific methodological details. Text-based retrieval edges out images at the top rank (46% vs 43% Recall@1), but images match or exceed text at deeper recall levels (93% vs 91% Recall@20). But these two approaches fail on 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘵 𝘲𝘶𝘦𝘳𝘪𝘦𝘴. At Recall@1: • 22 queries succeed with text but fail with images • 18 queries succeed with images but fail with text This complementarity is what makes 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗛𝘆𝗯𝗿𝗶𝗱 𝗦𝗲𝗮𝗿𝗰𝗵 so effective. By fusing scores from both text and image retrieval, they achieved 49% Recall@1 and 95% Recall@20 - beating either modality alone. 𝗧𝗵𝗶𝘀 𝗶𝘀𝗻'𝘁 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 "𝗶𝗺𝗮𝗴𝗲𝘀 𝘃𝘀 𝘁𝗲𝘅𝘁" 𝘀𝘁𝗼𝗿𝘆. Text-based retrieval provides stronger precision and works for most content. But image-based retrieval can use visual structure and handle abstract visualizations way better. This is why once again (multimodal) hybrid search beats out either option alone - getting you the best of both worlds. The most promising direction seems like it might be agentic systems that dynamically weight text vs image signals based on query characteristics, to emphasize image retrieval when more visually grounded information is needed, and text when you need keyword precision. Paper: 𝗮𝗿𝘅𝗶𝘃.𝗼𝗿𝗴/𝗮𝗯𝘀/𝟮𝟲𝟬𝟮.𝟭𝟳𝟲𝟴𝟳









