Jimmy Lin

4.5K posts

Jimmy Lin

@lintool

I profess CS-ly at @UWaterloo. Previously, I monkeyed code for @Twitter, slides for @Cloudera, and scienced for @yupp_ai.

Nearby data lake Katılım Şubat 2010

864 Takip Edilen15.2K Takipçiler

Jimmy Lin retweetledi

TREC RAG @ 2026@TREC_RAG·9h

Does retrieval help RAG or did the LLM already memorize the answer? 🤔 Too often, the overlap between RAG corpora and what LLMs “know” is unclear Better RAG evaluation needs tighter alignment between NLP and IR 📚 That's why for RAG 2026 we are using @nvidia's ClimbMix corpus

English

1.1K

Jimmy Lin@lintool·9h

But I think we can do better... what about zero parameters? Let me introduce you to something else that's awesome: It's called grep. arxiv.org/abs/2605.05242

English

226

Jimmy Lin@lintool·10h

Since we're counting model parameters, let me introduce you to a two-parameter model for agentic search that's awesome: It's called BM25. I haven't tried it yet, but I think fp4 will work fine. arxiv.org/abs/2605.10848

English

1.7K

Jimmy Lin@lintool·1d

Thus, our conclusions: This I believe is the first demonstration of the need for hybrid search. Hence the claim that hybrid search is a @UWaterloo innovation. You're welcome! The broader lesson is that old baselines are still surprisingly important. Let's not forget them.

English

4.1K

Jimmy Lin@lintool·1d

But that's not what we found: even with DPR, a dense-sparse hybrid with BM25 is significantly better than DPR alone. arxiv.org/abs/2104.05740

English

686

Jimmy Lin@lintool·1d

I think @xueguang_ma is being too modest, so I'll provide context: he along with @rpradeep42 and a UWaterloo ugrad (Kai Sun) popularized hybrid search in its current form. So, if you're using hybrid search today, thank them. 🙏 Yes, this is clickbait-y, so I'll support my claims 🧵

Xueguang Ma@xueguang_ma

This plot reminds me of my first IR work reproducing DPR in Pyserini, where we found BM25 is amazingly helpful when hybrid with a dense retriever. BM25 is never just a simple baseline -- used the right way, it can easily outperform many fancy methods. BM25 was the most robust method shown in BEIR, the most effective and efficient method for long-context search shown in LongEmbed, and now @mattjustram and @xuzihuan4 show that BM25 can push the search agents into the best efficiency frontier. p.s. Pyserini and pi-serini are two different repos.

English

5.1K

Jimmy Lin retweetledi

Jheng-Hong Yang@mattjustram·1d

x.com/i/article/2054…

ZXX

1.1K

Jimmy Lin retweetledi

Jheng-Hong Yang@mattjustram·3d

someone already wrote a love letter to pi, by @badlogicgames. so we wrote a love paper to pi :) with my teammates @xuzihuan4 and @lintool. a few days ago, i promised i’d share some fun plots once Pi-Serini joined the BrowseComp-Plus deep research agent party. now, it’s about time. here weeeee goooooo. bear with the sloppy images first. the serious one is at the end. the question was simple: how far can we push deep research with BM25 + pi? turns out: weirdly far.

English

16K

Jimmy Lin retweetledi

TREC RAG @ 2026@TREC_RAG·3d

TREC RAG is returning for 2026! 🎉 This year’s iteration is special because agents 🤖 can join the fun… but what might agent-first community evaluation look like? 🧵👇

English

702

Jimmy Lin retweetledi

Tz-Huan Hsu@xuzihuan4·3d

Does a lexical retriever suffice for agentic search when agents can keep refining their queries? As LLMs become more capable in agentic loops, agents can continuously refine their actions based on environmental feedback. We couldn’t help but ask the question above.

English

1.5K

Jimmy Lin@lintool·4d

What I'm cooking up... 👨‍🍳

English

5.1K

Jimmy Lin retweetledi

Zhuofeng Li@zhuofengli96475·8 May

🔥 Introducing Direct Corpus Interaction (DCI)! The best retriever for agentic search is no retriever. 🚀 We replaced the entire agentic search pipeline — embedding model, vector index, top-k retrieval — with only `grep` and `bash`. 🔧 📄 Paper: huggingface.co/papers/2605.05… DCI unlocks the full agentic potential of any Claude Sonnet 4.6: 69.0% → 80.0% on BrowseComp-Plus (+11.0, −$424). 💡The Magic: The agent searches the raw corpus directly — `grep`, `find`, `bash`, shell pipelines — exactly like a coding agent navigating a codebase. No preprocess. No embedding model. No vector index. No offline indexing. 📊The Results: DCI outperforms top baselines across 13 benchmarks, with average gains of: 🔍 Agentic Search: +11.0% 🧠 Multi-hop QA: +30.7% 📈 IR Ranking: +21.5% 💡 Insights: Beyond accuracy, we conduct a series of controlled ablation studies to pinpoint the sources of DCI’s gains. Specifically, we examine trajectory-level search, evidence utilization corpus, context management, and tool usage (RQ2-RQ6). Try it yourself! 🛠️Code: github.com/DCI-Agent/DCI-… 🤖 Demo: huggingface.co/spaces/DCI-Age… 🔎 Eval logs: huggingface.co/datasets/DCI-A…

English

229

37.2K

Jimmy Lin@lintool·6 May

@s_gaweda Two criteria come to mind: (1) accuracy - did the agent do what the skill promises? (2) token efficiency - how many tokens did the agent have to burn?

English

Bash@s_gaweda·6 May

@lintool How do you measure the success of a skill? Presumably any change should measurably improve the skill if you want to have any sort of confidence.

English

Jimmy Lin@lintool·6 May

⁉️ What's the goal of code review for SKILLz? 🙋‍♂️ I'm interested in hearing your opinion on this: What are CR best practices for SKILLz?

English

Jimmy Lin@lintool·6 May

Does this change if the SKILL is shared widely within an org? Does this change if an org uses multiple agents? Should I get Codex and Claude to iterate on the PR until they're both happy (and stay out of it)? What are emerging best practices here? ⁉️

English

403

Jimmy Lin@lintool·6 May

Since it is unlikely that the SKILL.md etc. 📜 was actually written by a human, who am I to second-guess what an agent writes, mainly for another agent (or itself at a later point in time)? 🤖 I assume the human has already iterated with the agent to refine the SKILL?

English

437

Jimmy Lin@lintool·29 Nis

I'm a bit late to the party ✨🎊🥳 but I suppose this is what AI psychosis looks like... All-time Codex token usage: 1,008,533,821 tokens across 563 threads.

English

717

Jimmy Lin@lintool·27 Nis

As expected, Codex spent far longer cleaning the data 🗑️🔧 than building the initial interface... some things still never change.

English

349

Jimmy Lin@lintool·27 Nis

My latest 🤯 moment: took a 2-year old issue github.com/castorini/ura-… and one-shotted an app using Codex by just pointing at the issue. Result is here: castorini.github.io/ura-projects/i…

English

1.1K

Keşfet

@nvidia @UWaterloo @xueguang_ma @rpradeep42 @badlogicgames @xuzihuan4 @s_gaweda @elonmusk