Pierre Colombo

647 posts

Pierre Colombo banner
Pierre Colombo

Pierre Colombo

@PierreColombo6

Omni-modal AI researcher. Creator of ColPali, BidirLM-omi, EuroBERT & EuroLLM. Ex-Prof@ Centrale (Paris-Saclay) · Ex-CSO@ https://t.co/ncQ9gT1TzM (legaltech - SaulLM)

Katılım Ekim 2020
1.1K Takip Edilen534 Takipçiler
Sabitlenmiş Tweet
Pierre Colombo
Pierre Colombo@PierreColombo6·
🚀 Introducing SaulLM-141B and SaulLM-54B: The First Open Family of Legal Models. After #SaulLM-7B the family is growing! We are proud to unveil the latest innovations from our team: the SaulLM-141B and 54B generative AI models, specifically designed for the legal domain.
English
1
2
10
1.4K
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
🚨 Do LLMs need to store everything they read in memory? To reduce KV cache size and improve decoding speeds, we propose Self-Pruned KV attention, a mechanism where the model learns to decide which KVs to write in the persistent KV cache, discarding all the rest! @AIatMeta🧵
Manuel Faysse tweet media
English
5
30
125
8.6K
Pierre Colombo retweetledi
Nicolas Boizard
Nicolas Boizard@N1colAIs·
@JinaAI_ has hopped on the omnimodal train🚂 They just dropped a collection of two Omni embedding models (0.9B & 2B). Similar to BidirLM, they seem to rely on the Qwen modality head for the larger one, while sticking with EuroBERT for the nano version 🥰 huggingface.co/collections/ji…
English
1
2
16
1.8K
Pierre Colombo retweetledi
Nicolas Boizard
Nicolas Boizard@N1colAIs·
BidirLM-Omni is on MTEB and Sentence-Transformer! huggingface.co/spaces/mteb/le… 🥇#1 Open-Source Model on MTEB (#15 overall) 🖼️#1 across all sizes on MIEB (Image) 🎧#1 sub-7B model on MAEB (Audio, #2 overall) Small size, massive performance, Fully open Model: huggingface.co/BidirLM
Nicolas Boizard tweet media
tomaarsen@tomaarsen

BidirLM-Omni-2.5B-Embedding is live: a single bidirectional encoder that embeds text, images, and audio into the same space! Three modalities, all in one 2048-dim space. 🧵

English
2
7
29
2.2K
Pierre Colombo retweetledi
DailyPapers
DailyPapers@HuggingPapers·
BERT-as-a-Judge A robust alternative to rigid lexical matching for LLM evaluation. Matches the performance of LLM-as-a-Judge at a fraction of the computational cost.
DailyPapers tweet media
English
7
29
252
14.9K
Pierre Colombo retweetledi
Orion Weller
Orion Weller@orionweller·
Encoders are so much better for classification, why not use them for judging? Awesome study from @N1colAIs - cool to see a 210m BERT model beating much larger Qwen and Gemma models.
Nicolas Boizard@N1colAIs

What’s inside the release: 🔌 Plug & play BERT-as-a-judge model: huggingface.co/collections/ar… 🛠️ Support to train your own custom evaluators: github.com/artefactory/BE… 📄 Study on the limits of lexical methods: arxiv.org/pdf/2604.09497

English
1
6
69
8.1K
Pierre Colombo
Pierre Colombo@PierreColombo6·
Evaluation is underrated. If your eval signal is noisy, you're flying blind. BERT-as-a-Judge gives you a fast, cheap way to improve your signal-to-noise ratio without spinning up a full LLM judge. Exactly the kind of infra work that compounds. @gisship @N1colAIs congrats!
Nicolas Boizard@N1colAIs

🎉 Second paper this month! Introducing BERT-as-a-Judge (x @gisship) ⚖️ Evaluating LLMs with rigid lexical methods often fails right answers due to bad formatting. While "LLM-as-a-Judge" solves this, it remains costly & slow. Our fix? A lightweight, encoder-driven approach.

English
0
0
4
467
Pierre Colombo retweetledi
Nicolas Boizard
Nicolas Boizard@N1colAIs·
🎉 Second paper this month! Introducing BERT-as-a-Judge (x @gisship) ⚖️ Evaluating LLMs with rigid lexical methods often fails right answers due to bad formatting. While "LLM-as-a-Judge" solves this, it remains costly & slow. Our fix? A lightweight, encoder-driven approach.
Nicolas Boizard tweet media
English
1
16
118
7K
Pierre Colombo retweetledi
Niklas Muennighoff
Niklas Muennighoff@Muennighoff·
There's a wave of omni embedding models (gemini, nemotron, bidirlm). Excited to support this trend with our multimodal mteb versions (mieb, maeb) - video coming soon🎥
Niklas Muennighoff tweet media
Nicolas Boizard@N1colAIs

🚀 New model family release with an OMNIMODAL version ! After Eurobert, I'm excited to introduce BidirLM, a family of 5 frontier bidirectional encoders including an OMNIMODAL encoder at just 2.5B parameters. 🧵👇 huggingface.co/BidirLM

English
1
13
62
9.9K
Pierre Colombo retweetledi
Antoine Chaffin
Antoine Chaffin@antoine_chaffin·
The world needs more encoders Turning decoders into encoders is a very strong path forward considering the edge of public decoders (see Ettin/previous work from Nicolas) Happy to see more work towards this in the omni setup and also public models!! Can’t wait to try them out
Nicolas Boizard@N1colAIs

🚀 New model family release with an OMNIMODAL version ! After Eurobert, I'm excited to introduce BidirLM, a family of 5 frontier bidirectional encoders including an OMNIMODAL encoder at just 2.5B parameters. 🧵👇 huggingface.co/BidirLM

English
2
7
40
3.4K
Pierre Colombo retweetledi
Nicolas Boizard
Nicolas Boizard@N1colAIs·
🚀 New model family release with an OMNIMODAL version ! After Eurobert, I'm excited to introduce BidirLM, a family of 5 frontier bidirectional encoders including an OMNIMODAL encoder at just 2.5B parameters. 🧵👇 huggingface.co/BidirLM
Nicolas Boizard tweet media
English
5
11
55
15K
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
Most practicionners would agree that text embeddings should be "contextual" - ie. they should encode a passage w.r.t. the wider scope of the entire document the passage stems from; "They beat the British" could refer to football or french history without further context... In ConTEB (arxiv.org/abs/2505.24782), we highlight the standard failure modes of embedding models on retrieval tasks that require context to be properly embedded. We also propose a training strategy that extends standard "late chunking" to teach models to infuse embeddings with just the right amount of contextual knowledge to optimize retrieval. Super happy to see some new work by @perplexity_ai on contextual embedding models. They eval on ConTEB and use our in-sequence contrastive loss, along with a ton of cool techniques in multiple phases of training. Love the work @bo_wangbo and will read in details, but super happy to see one more stone towards contextual embedding models, in the path already traveled by @hxiao and @jxmnop ! Link to the paper: arxiv.org/abs/2602.11151…
English
2
6
37
2K
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
In August, I joined FAIR at Meta in @hjegou's group for an end of thesis internship. I can't talk much for the moment about what we have been doing (hint: not retrieval), but it's very exciting and I am having lots of fun working with great people! (13/15)
English
1
1
2
148
Pierre Colombo retweetledi
Pierre Colombo retweetledi
Manuel Faysse
Manuel Faysse@ManuelFaysse·
In a follow-up project, we carefully investigate the differences between Masked Language Modeling (encoder) and Next Token Prediction (decoder) objectives to produce text representations and uncover many nice insights into training efficiency. (11/15) arxiv.org/abs/2507.00994
English
1
1
2
111