Max Conti

27 posts

Max Conti banner
Max Conti

Max Conti

@mlpc123

Research in NLP @illuintech, CS MSc @epfl

Paris Katılım Eylül 2024
82 Takip Edilen40 Takipçiler
Max Conti retweetledi
paul
paul@pteiletche·
Great work from @weaviate_io that compares the performances of text retrievers and multimodal ones. It appears that their errors are complementary, which makes their combination in hybrid search promising. Check their paper! arxiv.org/pdf/2602.17687
Victoria Slocum@victorialslocum

Most teams build RAG systems that only see text. But a lot of queries 𝘳𝘦𝘲𝘶𝘪𝘳𝘦 visual understanding to answer correctly. Most RAG systems treat PDFs the same way: OCR the text, chunk it, embed it, done. But that approach misses 𝗳𝗶𝗴𝘂𝗿𝗲𝘀, 𝘁𝗮𝗯𝗹𝗲𝘀, 𝘀𝗽𝗮𝘁𝗶𝗮𝗹 𝗹𝗮𝘆𝗼𝘂𝘁, 𝗮𝗻𝗱 𝘃𝗶𝘀𝘂𝗮𝗹 𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻𝘀𝗵𝗶𝗽𝘀 that are a core part of the data. My colleagues at @weaviate_io just published IRPAPERS, a benchmark that directly compares text-based vs. image-based retrieval over 3,230 pages from 166 scientific papers. The setup is straightforward: take the same PDFs and process them two ways. For text-based retrieval, run OCR with GPT-4.1, then embed with Arctic 2.0 + BM25 hybrid search. For image-based retrieval, embed the raw page images with ColModernVBERT multi-vector embeddings. Then test both on 180 needle-in-haystack questions targeting specific methodological details. Text-based retrieval edges out images at the top rank (46% vs 43% Recall@1), but images match or exceed text at deeper recall levels (93% vs 91% Recall@20). But these two approaches fail on 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘵 𝘲𝘶𝘦𝘳𝘪𝘦𝘴. At Recall@1: • 22 queries succeed with text but fail with images • 18 queries succeed with images but fail with text This complementarity is what makes 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗛𝘆𝗯𝗿𝗶𝗱 𝗦𝗲𝗮𝗿𝗰𝗵 so effective. By fusing scores from both text and image retrieval, they achieved 49% Recall@1 and 95% Recall@20 - beating either modality alone. 𝗧𝗵𝗶𝘀 𝗶𝘀𝗻'𝘁 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 "𝗶𝗺𝗮𝗴𝗲𝘀 𝘃𝘀 𝘁𝗲𝘅𝘁" 𝘀𝘁𝗼𝗿𝘆. Text-based retrieval provides stronger precision and works for most content. But image-based retrieval can use visual structure and handle abstract visualizations way better. This is why once again (multimodal) hybrid search beats out either option alone - getting you the best of both worlds. The most promising direction seems like it might be agentic systems that dynamically weight text vs image signals based on query characteristics, to emphasize image retrieval when more visually grounded information is needed, and text when you need keyword precision. Paper: 𝗮𝗿𝘅𝗶𝘃.𝗼𝗿𝗴/𝗮𝗯𝘀/𝟮𝟲𝟬𝟮.𝟭𝟳𝟲𝟴𝟳

English
1
4
9
1.3K
Max Conti retweetledi
Max Conti retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
Most practicionners would agree that text embeddings should be "contextual" - ie. they should encode a passage w.r.t. the wider scope of the entire document the passage stems from; "They beat the British" could refer to football or french history without further context... In ConTEB (arxiv.org/abs/2505.24782), we highlight the standard failure modes of embedding models on retrieval tasks that require context to be properly embedded. We also propose a training strategy that extends standard "late chunking" to teach models to infuse embeddings with just the right amount of contextual knowledge to optimize retrieval. Super happy to see some new work by @perplexity_ai on contextual embedding models. They eval on ConTEB and use our in-sequence contrastive loss, along with a ton of cool techniques in multiple phases of training. Love the work @bo_wangbo and will read in details, but super happy to see one more stone towards contextual embedding models, in the path already traveled by @hxiao and @jxmnop ! Link to the paper: arxiv.org/abs/2602.11151…
English
2
6
37
2K
Max Conti retweetledi
Macé Quentin
Macé Quentin@MaceQuent1·
Very proud of this paper, we lead many more experiments since the first release. I think we made a pretty complete analysis of what is currently possible with the benchmark ! Thanks again to everyone involved !
António Loison@antonio_loison

Two months ago, we released ViDoRe V3 benchmark. Now, the full paper is out! 📄arxiv.org/abs/2601.08620 Here is a recap of the benchmark along with a comprehensive breakdown of our findings on multimodal RAG. 🧵 (1/N)

English
0
2
5
69
Max Conti retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
In our EMNLP 2025 Oral paper with @mlpc123, we propose an extension to Late Chunking and demonstrate how we can embed contextual information within passage embeddings... and why it's often very useful to improve document retrieval! (9/15) arxiv.org/abs/2505.24782
English
1
2
3
125
Max Conti retweetledi
paul
paul@pteiletche·
And happy to see our dear ModernVBERT competing with models much larger on it!
paul tweet media
English
1
2
4
83
Manuel Faysse
Manuel Faysse@ManuelFaysse·
If you are at EMNLP in China and interested in retrieval, @mlpc123 will be presenting our work (Oral) on how to propagate information from the entire document to a given passage embedding in order to better contextualize it! Don't hesitate to reach out to him!
English
2
0
6
297
Max Conti
Max Conti@mlpc123·
@JinaAI_ Why did you organize the BoF at the same time as a Information Extraction and Retrieval Oral session 😢 Will be presenting our work that builds on top of Late Chunking there instead :)
English
0
0
1
57
Jina AI
Jina AI@JinaAI_·
And time flies—this is already our 3rd EMNLP BoF on retrieval models, following Singapore 2023 and Miami 2024! If you've never attended a BoF before, think of it as a mini-workshop where everyone can jump in and share their work. Our BoF is an in-person session brings together researchers and practitioners working on retrieval models.
Jina AI tweet media
English
2
0
1
2.6K
Jina AI
Jina AI@JinaAI_·
In 2 weeks, we're presenting at #EMNLP2025 and hosting a BoF on Embeddings, Rerankers, Small LMs for Better Search, again! Come check out our research on training data for multi-hop reasoning, multimodal embeddings, and where retrieval models are headed in 2025/26. Say hi to our team and know about the opportunities.
Jina AI tweet media
English
1
3
14
2K
Max Conti
Max Conti@mlpc123·
I'll be in Suzhou next week to present this project as an Oral at #EMNLP2025! 🥳 Let me know if you're there and wanna get in touch, or if you know anyone who'd be interested :) Looking forward! 🙌🇨🇳 x.com/mlpc123/status…
Max Conti@mlpc123

🕺Super happy to release our latest work with @ManuelFaysse: in our paper "Context Is Gold to Find the Gold Passage", we share all our findings on how to train embedding models to meaningfully include doc-wide context into chunks - leading to convincing results! 🧑‍🍳 🧵1/N

English
0
0
2
96
Max Conti
Max Conti@mlpc123·
Besides our main results, it was also really interesting to look at some understudied training dynamics in more details Our findings suggest that we've been using visual encoders far below their potential, and I think we can expect a lot of improvements building on top of this!
English
1
0
1
47
Max Conti retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
🚨Should We Still Pretrain Encoders with Masked Language Modeling? We have recently seen massively trained causal decoders take the lead in embedding benchmarks, surpassing encoders w/ bidirectional attention. We revisit whether Bert-style encoders are a thing of the past? (1/N)
Manuel Faysse tweet media
English
7
35
298
37.5K
Max Conti retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
Dear Reviewer #2, the placeholder URL "hf.co / <anonymous_url>" does lead to a non-existent page as you so correctly note as the main paper weakness, but don't fret, you have everything zipped in the uploaded materials a few pixels down
English
1
1
6
453
Max Conti retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
🚨 Context matters for effective retrieval—but most embedding models cannot leverage crucial information outside of the passage they embed. Our new paper "Context Is Gold to Find the Gold Passage" explores how context-aware embeddings can be trained to boost performance! 🧵(1/N)
Manuel Faysse tweet media
English
7
27
174
20.5K
Max Conti
Max Conti@mlpc123·
Huge thanks to @ManuelFaysse for the mentorship on this project, to @antoine_chaffin for his help with Late Interaction models, and to everyone else who helped along the way! Stoked to publish my first paper and looking forward to the next projects 🕺
English
1
0
2
119
Max Conti
Max Conti@mlpc123·
🕺Super happy to release our latest work with @ManuelFaysse: in our paper "Context Is Gold to Find the Gold Passage", we share all our findings on how to train embedding models to meaningfully include doc-wide context into chunks - leading to convincing results! 🧑‍🍳 🧵1/N
Max Conti tweet media
English
1
5
27
1.7K