Pierre Colombo

632 posts

Pierre Colombo banner
Pierre Colombo

Pierre Colombo

@PierreColombo6

Associate Professor at CentraleSupelec (Paris Saclay) - CSO https://t.co/TxJBsM6y4N - NLP/Law

Katılım Ekim 2020
1K Takip Edilen527 Takipçiler
Sabitlenmiş Tweet
Pierre Colombo
Pierre Colombo@PierreColombo6·
🚀 Introducing SaulLM-141B and SaulLM-54B: The First Open Family of Legal Models. After #SaulLM-7B the family is growing! We are proud to unveil the latest innovations from our team: the SaulLM-141B and 54B generative AI models, specifically designed for the legal domain.
English
1
2
9
1.3K
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
Most practicionners would agree that text embeddings should be "contextual" - ie. they should encode a passage w.r.t. the wider scope of the entire document the passage stems from; "They beat the British" could refer to football or french history without further context... In ConTEB (arxiv.org/abs/2505.24782), we highlight the standard failure modes of embedding models on retrieval tasks that require context to be properly embedded. We also propose a training strategy that extends standard "late chunking" to teach models to infuse embeddings with just the right amount of contextual knowledge to optimize retrieval. Super happy to see some new work by @perplexity_ai on contextual embedding models. They eval on ConTEB and use our in-sequence contrastive loss, along with a ton of cool techniques in multiple phases of training. Love the work @bo_wangbo and will read in details, but super happy to see one more stone towards contextual embedding models, in the path already traveled by @hxiao and @jxmnop ! Link to the paper: arxiv.org/abs/2602.11151…
English
2
6
37
2K
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
In August, I joined FAIR at Meta in @hjegou's group for an end of thesis internship. I can't talk much for the moment about what we have been doing (hint: not retrieval), but it's very exciting and I am having lots of fun working with great people! (13/15)
English
1
1
2
140
Pierre Colombo retweetledi
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
In a follow-up project, we carefully investigate the differences between Masked Language Modeling (encoder) and Next Token Prediction (decoder) objectives to produce text representations and uncover many nice insights into training efficiency. (11/15) arxiv.org/abs/2507.00994
English
1
1
2
110
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
2025 was also the year of encoder models! In the EuroBert project led by @N1colAIs and @gisship, we trained a series of multilingual bidirectional encoders up to 2025 standards. This led to an acceptance at COLM! (10/15) arxiv.org/abs/2503.05500
English
1
1
2
117
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
Closing the series on Visual Retrieval, we managed in ModernVBert to match the original ColPali model performance with a model 10x smaller by carefully revisiting all steps of the training process and uncovering cool insights in the process! (8/15) arxiv.org/abs/2510.01149
English
1
1
3
121
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
In our EMNLP 2025 Oral paper with @mlpc123, we propose an extension to Late Chunking and demonstrate how we can embed contextual information within passage embeddings... and why it's often very useful to improve document retrieval! (9/15) arxiv.org/abs/2505.24782
English
1
2
3
125
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
As models became very good at visual retrieval, we needed better benchmarks! We revisited the entire data annotation process in ViDoRe V2, and further were able to scale the quantity and quality of the annotation thanks to a collaboration with Nvidia in ViDoRe V3. The V3 paper will be out very soon! @MaceQuent1 @antonio_loison (7/15) huggingface.co/blog/QuentinJG…
English
1
1
3
200
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
We also had fun with @pteiletche experimenting with Agentic Visual Document Retrieval. We documented some initial attempts, but I believe this type of things will become very powerful once OpenAI etc make reasoning with images simpler in the API. (6/15) huggingface.co/blog/paultltc/…
English
1
2
5
197
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
Notably, we experimented with omni-modal retrieval (retrieving images but also text, audio, videos) and trained ColQwen-Omni. (5/15) huggingface.co/blog/manu/colq…
English
1
1
2
128
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
ColPali and more broadly visual document retrieval has been one of the main focus of my PhD and with the team at @illuintech, we continued to improve the code repository, supporting new models and features! (4/15) github.com/illuin-tech/co…
English
1
1
4
143
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
I was also part of another ICLR paper; MMTEB - a community initiative to extend retrieval benchmarking to many languages. We contributed datasets and novel retrieval confidence metrics inspired by our "Trustworthy Reranking" paper with @gisship (3/15) arxiv.org/abs/2502.13595
English
1
1
2
225
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
In January, came some good conference related news: CroissantLLM was accepted at TMLR, ColPali was accepted at ICLR. Going to ICLR a few months later was amazing and I was able to meet a ton of great people ! (2/15) arxiv.org/abs/2407.01449
English
1
1
2
236
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
2025 was a year of transition for me, as I wrapped up many PhD projects with great collaborators, then joined @AIatMeta in August to work on unrelated but very exciting things. A thread where I quickly go over some of my work from the year (1/15) 🧵
English
2
1
16
474
Pierre Colombo
Pierre Colombo@PierreColombo6·
Amazing to see multi-vector image retrieval getting the spotlight! ColPali — developed by @ManuelFaysse — is featured in this new course from @DeepLearningAI and @qdrant_engine @AndrewYNg
DeepLearning.AI@DeepLearningAI

🚀 New short course with @qdrant_engine: Multi-vector Image Retrieval. Taught by @LukawskiKacper, Senior Developer Advocate at Qdrant, the course shows how multi-vector techniques outperform single-vector methods by matching text tokens to image patches directly. You’ll implement ColBERT to understand multi-vector search, apply ColPali for patch-level image retrieval, reduce memory with quantization and pooling, and use MUVERA to enable fast HNSW search. The course concludes with a full multi-modal RAG pipeline built on ColPali and MUVERA. Learn more and enroll now: hubs.la/Q03XCQZ10

English
2
0
5
325
Pierre Colombo retweetledi
Nicolas Boizard
Nicolas Boizard@N1colAIs·
What a great work by @cmpatino_ and the @huggingface teams! So happy to see that ULD Loss inspired the idea of their GOLD method and that they solved the early sequence and vocabulary alignment issues we faced. This recognition is the greatest I could have hoped for when I first introduced ULD Loss at the start of my thesis 🤗
Carlos Miguel Patiño@cmpatino_

On-policy distillation is a promising way to train small models, but it’s usually limited to teacher–student pairs sharing the same tokenizer. With our GOLD method, you can now distill across different model families and even outperform GRPO! huggingface.co/spaces/Hugging…

English
0
1
5
323
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
🚨 We release ModernVBert - a small bidirectional ModernBert encoder trained to process image inputs alongside text. When finetuned for document retrieval, it matches the original ColPali performance on ViDoRe with almost 10x fewer parameters (250M). (1/N)
Manuel Faysse tweet media
English
1
9
45
3K