Łukasz Borchmann

232 posts

Łukasz Borchmann

@LukaszBorchmann

AI researcher at @Snowflake, coffee addict, @MonsterEnergy connoisseur, passionate about everything but LLMs' prompting. #deeplearning

Poznań, Poland Katılım Haziran 2011

437 Takip Edilen599 Takipçiler

Sabitlenmiş Tweet

Łukasz Borchmann@LukaszBorchmann·13 Mar

1/10 Are agents navigating enterprise data strategically, or just stumbling until they get lucky? To answer this, we introduce MADQA, which benchmarks not just final answers but also search trajectories. A collab with @UniofOxford, @UNC, and @huggingface. 🧵

English

292

545.9K

Łukasz Borchmann@LukaszBorchmann·1d

@yoavgo Since (1) more tokens → more compute → presumably higher accuracy with the same parameter count, and (2) smaller vocab -> presumably lower accuracy; then perhaps they have some "functional" tokens, e.g, local context summary, tags, something similar to KV-memory, etc.

English

(((ل()(ل() 'yoav))))👾@yoavgo·1d

claude 4.7 has a new tokenizer, and it splits words to *more* tokens than before? meaning they probably *decreased* the vocabulary size? this is really counter intuitive for me.

English

195

38K

Łukasz Borchmann@LukaszBorchmann·1d

@arankomatsuzaki The more stock options they hold, the more confident they are that Mythos will be a game-changer.

English

634

Aran Komatsuzaki@arankomatsuzaki·2d

Nearly 1/3 of surveyed people in Anthropic now think entry-level engineers and researchers are likely replaced by Mythos within 3 months

English

841

116.6K

Łukasz Borchmann@LukaszBorchmann·6d

@OlgaraPixels @Igarashi5101 "Wood" is not the best example because it contains the relaxed, short /ʊ/ sound that does not exist in Polish. The Polish ó/u is pronounced like a slightly shorter version of "oo" in "boot".

English

OlgarasWorkshop@OlgaraPixels·11 Nis

@Igarashi5101 Ó is like the English double-o sound (like in "wood") - Polish U is the same. Also yeah, Polish J is like the English Y - there's a couple letters like that, i.e Ł is English W (also in "wood"), V doesn't appear in the Polish alphabet but the Polish W is the same sound, etc.

English

369

いがら氏🇵🇱@Igarashi5101·11 Nis

ポーランド語アルファベットの名称 A：ア Ą：オン B：ベ C：ツェ・・・ J：ヨット←なんか居る・・ Ó：オ･クレスコヴァネ←？？！？！・・・ Y：イ･グレク←…グロック？ Z：ゼット Ź：ジェット Ż：ジェット←Źと似てるけど違うんだな

日本語

194

Łukasz Borchmann@LukaszBorchmann·5 Nis

@part_harry_ The ML community actually explored this a lot in 2020. Early approaches like Reformer used LSH, and Routing Transformer used k-means (a classic MIPS/ANN approach).

English

215

Harry Partridge@part_harry_·5 Nis

Sparse attention is just a form of appropriate maximum inner product search. This is a widely researched field, with leading techniques like Hierarchical Navigable Small World Graphs achieving ~log N time complexity. Meanwhile, DSA and even HISA are both still linear time algorithms. It is interesting to me that the ML community doesn’t draw more on the existing literature in this area.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

This paper aside, crazy how much work is being done in sparse attention. Now that we know that DSA works (and well), it's clear that 1M contexts are just the appetizer. HISA, though, looks like DSA + good ideas of NSA. In the limit, cuts indexer costs by B. Plug and play! @_xjdr

English

254

39K

Łukasz Borchmann@LukaszBorchmann·4 Nis

@yoavgo Was using the token count API too mainstream?

English

5.1K

(((ل()(ل() 'yoav))))👾@yoavgo·4 Nis

TIL claude-code sometimes calculates the token count for a context by sending the request to haiku model and observing the resulting token count field in the response.

English

526

77K

Łukasz Borchmann@LukaszBorchmann·3 Nis

@BoWang87 Cool. I actually suggested this exact "embarrassingly simple" approach as a missing baseline for Anthropic's ICM paper last year: x.com/LukaszBorchman…

Łukasz Borchmann@LukaszBorchmann

@jiaxinwen22 I'm a bit surprised there is no self-training baseline (labels generated 0-shot used to fine-tune the model directly). Even with low 0-shot accuracy, mini-batch training could average out noise if the model's annotations are systematically consistent across similar examples.

English

3.8K

Bo Wang@BoWang87·3 Nis

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English

202

1.7K

518.4K

Łukasz Borchmann@LukaszBorchmann·29 Mar

@SosnaArno @LandsknechtPike Except that Warsaw and Kraków have different regional pronunciations. You can distinguish a person from Warsaw and Kraków more easily than a person from Kraków and a village near Kraków.

Română

360

Arno Sosna@SosnaArno·29 Mar

That is not true for Polish. The city accents, like Warsaw and Kraków, are higher class coded. Regional dialects are called ‘gwara’ are often lower class coded. Speak Silesian to them and see how attitudes change. Social media is transforming the youth’s language usage, but it is far from uniform.

English

7.2K

Aristocratic Fury@LandsknechtPike·29 Mar

In Eastern Europe communists tried very hard to erase cultural differences between classes and this left some lasting social consequences, which is why today there aren't such strong class distinctions in the way people speak. This is actually a very interesting topic and explains some of the cultural and political differences between Western Europe and Eastern Europe today. For example one of the things that communists did in Eastern Europe was that they tried to "mix" social classes together, like for example when they built those huge apartment blocks they deliberately housed people from various social backgrounds in there. Their ideal was to have workers, (former) peasants, bureaucrats, teachers, intellectuals etc. all living together in the same buildings, same neighborhoods. Communists also purged the old elites from power, so members of old elites would end up mixed with people of lower classes in this way. One specific example of this I can think is years ago I was listening to an interview with the president of UEFA (governing body of European football) Aleksander Čeferin, who is from Slovenia, and he revealed that his father was a prestigious lawyer but their neighbor was a cleaning lady, and their apartments were of exactly same size. This is what communists in Eastern Europe were pushing for, having people of (former) elite living next to working class. Because of this, certain cultural differences between classes were erased, including the way people speak. People who grew up in the same city or region would develop the same accent regardless of their social background. There would be no accent associated with elite, because such cultural elite no longer existed, the communist officials who were the new "elite" generally came from lower classes. Of course educated people would generally have a richer vocabulary but they would still speak more or less the same as lower classes in terms of accent, pronunciation, intonation of words etc. It's still like this today, in Eastern European countries (at least the ones I'm familiar with) you don't have anything equivalent to "posh accent" or Received Pronunciation or whatever it's called in Britain. In Britain this accent is associated with upper classes, especially in some exaggerated version, and few people in the country speak like this in daily life. It's not regional but related to specific class. In Eastern Europe, accents are regional, people from the same region speak the same regardless of their class. Of course there is a need to learn standard pronunciation of the national language for education purposes and to be better understood by people from other parts of the country, but this is not some snobbish class thing. Some rural dialects might be looked down upon, but those are regional differences, not class differences. So yes, there are very few distinct class markers in Eastern Europe in terms of accent and the way people speak, especially in terms of their economic class background. I think this is largely due to communists aggressively purging the culture of upper classes. The interesting thing is that the attempt of communists to erase cultural differences between classes had some completely unintended consequences. One could easily argue that this strengthened the sense of nationalism in the Eastern European countries, because it erased many distinctions between people within the same nation, and basically integrated the nation more strongly. Before communists took power, Eastern European countries still had many internal divisions, remnants of old ruling classes, different ethnic groups, large rural populations etc. But communists made these societies much more homogeneous in every way. So even though they were trying to build something completely different, they just ended up concluding 19th century nationalism, but in an even more radical way than it was done in Western Europe. As a result of this, Eastern European countries are more nationalistic and socially conservative today, there simply isn't a strong enough upper class that would be associated with cosmopolitan liberalism. Ironically, the communists made Eastern Europe more "reactionary", as they would say, in the long term.

Veronica, Collagen Scientist@celestialbe1ng

Pretty fascinating that in England the moment a person opens their mouth, you can read their entire life story. Where they went to school, what their family is like, their economic background, everything. A friend of mine says he literally can’t date in London anymore bc the second a girl speaks, there’s zero mystery. He already knows her postcode, her school and her dad’s job. It’s over before it starts. In Poland, none of that exists. People speak the same regardless of where they grew up, where they went to school, what they do for a living. A homeless man and a university professor can sound nearly identical. Nobody can tell that I basically grew up in London. Nobody can tell I haven’t lived here in years. There are no accent clues and no class markers, no education giveaways, nothing. So getting to know someone here is a completely different experience. You actually have to be curious. You have to ask and you have to discover a person the old-fashioned way bc nothing about the way they speak is going to hand you the answers. Pretty cool

English

118

221

2.2K

184.5K

Łukasz Borchmann@LukaszBorchmann·28 Mar

@KinasRemek @Wojtek_44 Ze zdjęcia OpenAI akurat wszyscy Polacy mają artykuły napisane w zeszłym roku.

Polski

Remek Kinas@KinasRemek·28 Mar

@Wojtek_44 No ale oni nie publikują.

Polski

718

Remek Kinas@KinasRemek·28 Mar

Polska 🇵🇱 … gdzie?

Polski

136

15K

Łukasz Borchmann@LukaszBorchmann·27 Mar

@celestialbe1ng We have this structural politeness even in insults, like 'Are you an idiot, Sir?'

English

2.5K

Veronica, Collagen Scientist@celestialbe1ng·27 Mar

My favourite thing about Poland is that you don’t address strangers as “you.” You say Pan, Pani, Państwo (Mr/Mrs/+this plural I can’t translate): formal address is built into the grammar. Even in a shop, you’d say “Czy Państwo mają…” not “do you have…” and it isn’t performative politeness but actually structural respect. There is no casual “you” for someone you haven’t been invited to be familiar with. When I do this, people often rush to correct me or rather announce familiarity. “Oh, don’t call me Madame, call me Catherine.” And I’ll still address them formally until they give me clear permission to stop or until I decide I’m familiar and done with the formal. Pure elegance. The kind that assumes every stranger deserves dignity before they’ve earned familiarity. The West abolished formality for uhhh friendliness. Poland kept it bc respect.

English

207

213

3.5K

346.1K

Łukasz Borchmann retweetledi

Niels Rogge@NielsRogge·24 Mar

Very cool to see MixedBread include our newly introduced MADQA, an agentic RAG benchmark, in its results Shows that late-interaction models are a a lot better than bi-encoders, but it will take time for the industry to adopt these

Mixedbread@mixedbreadai

For Agentic tasks, Oracle-level performance is the maximum performance a system can achieve, assuming it is able to retrieve all relevant documents perfectly, every time. We're proud to show that Mixedbread Search approaches the Oracle on multiple knowledge intensive benchmarks.

English

4.8K

Łukasz Borchmann@LukaszBorchmann·24 Mar

Seeing our MADQA benchmark used to evaluate top-tier systems just days after we released it is incredibly rewarding!

Mixedbread@mixedbreadai

English

665

Łukasz Borchmann@LukaszBorchmann·20 Mar

I can see the argument for engineering coverage of more challenging query types that might otherwise be underrepresented, but we opted for a different design choice: preserving the distribution humans naturally produce when looking at real documents, even if it skews toward extraction, etc. The intuition is that the natural skew itself is informative about what people actually need to know and how they might use document understanding systems. We may disagree on what makes a representative distribution, but that's exactly the kind of methodological tension that leads to better benchmarks. Thanks for flagging the citation. It was previously pointing to the blog post. It has now been updated in the paper. Appreciate the constructive back-and-forth: both benchmarks improve the conversation! Perhaps for the next initiative, let's join forces such that we can agree and lock in design choices from all perspectives and experience :)

English

Manuel Faysse@ManuelFaysse·20 Mar

To be perfectly precise, the query distribution type is actually something that we strive for by design. We constructed a taxonomy of query types (extractive, binary, comparative, open-ended, enumerative) and then use a mix of human questions and synthetic questions to obtain a targeted distribution of all query types. In practice, both humans and LLMs tend to ask too much extractive questions during data annotation otherwise, which is not what is reflected by practical usecases. @antonio_loison can give more details if need be. Having said that, I really appreciate the effort to modify the preprint and I don't aim to be adversarial here, modifying here is more than we could have hoped for. Maybe a last point that was brought up is that the citation to ViDoRe V3 probably refers to the V2 ? The V3 is Loison et al. so maybe there was a confusion there ? Congrats again on the solid work !

English

Niels Rogge@NielsRogge·13 Mar

Look mom I'm on a paper! Introducing an agentic RAG benchmark with 2,250 human-authored questions grounded in 800 heterogeneous PDF documents Gemini-3 Pro with good-old BM25 as a tool takes the lead, but large gaps with humans remain. I set up baselines with the Claude Agents SDK + CLIs (semtools by @llama_index) 🤗

DailyPapers@HuggingPapers

Strategic Navigation or Stochastic Search? New MADQA benchmark reveals that agents matching human accuracy on document QA rely on brute-force search to compensate for weak strategic planning. 2,250 questions over 800 PDFs expose a 20% gap to oracle performance.

English

15.6K

Łukasz Borchmann@LukaszBorchmann·20 Mar

@ManuelFaysse Following up as promised. You raised fair points, and we've addressed them in the revision: (1) Renamed the section from "Data Integrity" to "Data Provenance". The original heading read more confrontationally than intended. (2) The intro now distinguishes ViDoRe V1 from V3, explicitly acknowledging the contextually-blind generation and human verification pipeline. (3) Added a dedicated ViDoRe V3 assessment in the appendix, noting the human component is considerably stronger than most benchmarks in the same category. Our remaining distinction, aside from problem framing, is that a generative LLM still shapes the distribution of question types—a concern absent in fully human-authored questions. But that's a methodological preference, not a quality judgment. Updated revision will land on arXiv over the weekend. Hope this addresses your concerns!

English

Manuel Faysse@ManuelFaysse·14 Mar

Hey Niels @LukaszBorchmann, awesome to see new benchmarks and cool work ! As an author, I however wanted to clarify some framing around ViDoRe V3 "integrity" which is not quite accurate... A key design principle we used to reduce extractive bias is that synthetic queries are "contextually blind", meaning they are generated by LLMs without access to the document and are simply conditioned on a short document description. We then filter out the majority of queries that are unanswerable given available document information. Other queries are human generated. All answers are then labeled and verified by multiple human annotators. In fact ViDoRe V3 is built with 12k hours of human annotation to guarantee quality, 10x more than here, and we really tried to have something that was better and more challenging than what humans annotators would do alone. There is no bias towards MLLMs. Another point is that Vidore V3 also contains multi-hop queries (203), in fact more than in MAQDA, and could be used with agentic retrieval frameworks (but agreed this was not our main objective). I believe the efforts we made are reflected in the overall retrieval difficulty - it's quite hard to make query-document pairs realistic and non-trivial with good multimodal retrievers as the MixedBread submission on the MAQDA leaderboard from @aaxsh18 shows. Again great work, but I got to defend the "integrity" of my work haha 😘

English

443

Łukasz Borchmann@LukaszBorchmann·18 Mar

@avt_im @Deep_Burner In some sense, it is unfair because I need to put more effort into reviews while allowing others to review my work with the help of LLMs.

English

423

Łukasz Borchmann@LukaszBorchmann·18 Mar

@avt_im @Deep_Burner I don't think this is how it works. I declared I am OK with using LLMs for reviews (Policy B), but was assigned to Policy A (LLMs prohibited).

English

456

Alexander Terenin@avt_im·17 Mar

Sad times as an ICML AC: I've just learned that a mid-career faculty whose work I know and respect is getting their papers rejected for agreeing to the no-LLM reviewing policy, and then violating it.

English

130

31.4K

Łukasz Borchmann@LukaszBorchmann·17 Mar

@davezfr @NielsRogge But still. MMLU is knowledge-based, and MoE increases knowledge capacity compared to dense with the same number of (active) params.

English

Davide@davezfr·17 Mar

@NielsRogge For those who don’t know: 119B is total params (MoE), not active. It only uses ~6.5B per token. So this isn’t really 119B vs 4B.

English

421

Niels Rogge@NielsRogge·17 Mar

Qwen-3.5-4B outperforms Mistral-Small-119B on MMLU-Pro Wait, what?

Niels Rogge@NielsRogge

👀 small, 119B parameters? huggingface.co/mistralai/Mist…

English

484

63.5K

Łukasz Borchmann@LukaszBorchmann·14 Mar

@aaxsh18 Impressive! Especially when compared to Gemini File Search which is 10 points weaker.

English

Aamir@aaxsh18·14 Mar

this is very nice, with hold out and private test set! just tested mixedbreads v3 search stores on it... and it's performing pretty well.

Niels Rogge@NielsRogge

English

4.5K

Łukasz Borchmann@LukaszBorchmann·14 Mar

@ManuelFaysse @NielsRogge Thanks for pointing out. I asked the team to look into this.

English

Łukasz Borchmann@LukaszBorchmann·13 Mar

@_akhaliq 🧵

Łukasz Borchmann@LukaszBorchmann

QME

AK@_akhaliq·13 Mar

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections paper: huggingface.co/papers/2603.12…

English

5.8K

Łukasz Borchmann@LukaszBorchmann·13 Mar

10/10 📄 Paper: huggingface.co/papers/2603.12… 🏆 Leaderboard: huggingface.co/spaces/Snowfla… 💻 Code: github.com/OxRML/MADQA 🤗 Data: huggingface.co/datasets/OxRML…

English

248

Łukasz Borchmann@LukaszBorchmann·13 Mar

9/10 Humans score ~14.6 (highly calibrated, investing effort rationally). Best agents score ~22.9 (poorly calibrated, stuck in stochastic search loops). Scaling context gives models a bigger flashlight, but doesn't teach them to read the map.

English

236

Łukasz Borchmann@LukaszBorchmann·13 Mar

2/10 Our setup enforces a strictly "closed-world" environment. Agents must iteratively search, gather, and reason over a large collection of documents. They cannot rely on external training data to guess answers; all evidence must be found strictly within the provided files.

English

1.1K

Keşfet

@yoavgo @arankomatsuzaki @OlgaraPixels @Igarashi5101 @part_harry_ @BoWang87 @SosnaArno @LandsknechtPike