Jimmy Lin

4.5K posts

Jimmy Lin banner
Jimmy Lin

Jimmy Lin

@lintool

I profess CS-ly at @UWaterloo. Previously, I monkeyed code for @Twitter, slides for @Cloudera, and scienced for @yupp_ai.

Nearby data lake Katılım Şubat 2010
864 Takip Edilen15.2K Takipçiler
Jimmy Lin retweetledi
TREC RAG @ 2026
TREC RAG @ 2026@TREC_RAG·
Does retrieval help RAG or did the LLM already memorize the answer? 🤔 Too often, the overlap between RAG corpora and what LLMs “know” is unclear Better RAG evaluation needs tighter alignment between NLP and IR 📚 That's why for RAG 2026 we are using @nvidia's ClimbMix corpus
English
1
6
8
1.1K
Jimmy Lin
Jimmy Lin@lintool·
But I think we can do better... what about zero parameters? Let me introduce you to something else that's awesome: It's called grep. arxiv.org/abs/2605.05242
English
0
1
0
226
Jimmy Lin
Jimmy Lin@lintool·
Since we're counting model parameters, let me introduce you to a two-parameter model for agentic search that's awesome: It's called BM25. I haven't tried it yet, but I think fp4 will work fine. arxiv.org/abs/2605.10848
English
1
3
23
1.7K
Jimmy Lin
Jimmy Lin@lintool·
Thus, our conclusions: This I believe is the first demonstration of the need for hybrid search. Hence the claim that hybrid search is a @UWaterloo innovation. You're welcome! The broader lesson is that old baselines are still surprisingly important. Let's not forget them.
Jimmy Lin tweet media
English
0
2
10
4.1K
Jimmy Lin
Jimmy Lin@lintool·
But that's not what we found: even with DPR, a dense-sparse hybrid with BM25 is significantly better than DPR alone. arxiv.org/abs/2104.05740
Jimmy Lin tweet media
English
1
0
5
686
Jimmy Lin
Jimmy Lin@lintool·
I think @xueguang_ma is being too modest, so I'll provide context: he along with @rpradeep42 and a UWaterloo ugrad (Kai Sun) popularized hybrid search in its current form. So, if you're using hybrid search today, thank them. 🙏 Yes, this is clickbait-y, so I'll support my claims 🧵
Xueguang Ma@xueguang_ma

This plot reminds me of my first IR work reproducing DPR in Pyserini, where we found BM25 is amazingly helpful when hybrid with a dense retriever. BM25 is never just a simple baseline -- used the right way, it can easily outperform many fancy methods. BM25 was the most robust method shown in BEIR, the most effective and efficient method for long-context search shown in LongEmbed, and now @mattjustram and @xuzihuan4 show that BM25 can push the search agents into the best efficiency frontier. p.s. Pyserini and pi-serini are two different repos.

English
1
6
41
5.1K
Jimmy Lin retweetledi
Jheng-Hong Yang
Jheng-Hong Yang@mattjustram·
someone already wrote a love letter to pi, by @badlogicgames. so we wrote a love paper to pi :) with my teammates @xuzihuan4 and @lintool. a few days ago, i promised i’d share some fun plots once Pi-Serini joined the BrowseComp-Plus deep research agent party. now, it’s about time. here weeeee goooooo. bear with the sloppy images first. the serious one is at the end. the question was simple: how far can we push deep research with BM25 + pi? turns out: weirdly far.
English
5
11
59
16K
Jimmy Lin retweetledi
TREC RAG @ 2026
TREC RAG @ 2026@TREC_RAG·
TREC RAG is returning for 2026! 🎉 This year’s iteration is special because agents 🤖 can join the fun… but what might agent-first community evaluation look like? 🧵👇
English
1
4
6
702
Jimmy Lin retweetledi
Tz-Huan Hsu
Tz-Huan Hsu@xuzihuan4·
Does a lexical retriever suffice for agentic search when agents can keep refining their queries? As LLMs become more capable in agentic loops, agents can continuously refine their actions based on environmental feedback. We couldn’t help but ask the question above.
English
1
2
19
1.5K
Jimmy Lin
Jimmy Lin@lintool·
What I'm cooking up... 👨‍🍳
Jimmy Lin tweet media
English
4
4
58
5.1K
Jimmy Lin retweetledi
Zhuofeng Li
Zhuofeng Li@zhuofengli96475·
🔥 Introducing Direct Corpus Interaction (DCI)! The best retriever for agentic search is no retriever. 🚀 We replaced the entire agentic search pipeline — embedding model, vector index, top-k retrieval — with only `grep` and `bash`. 🔧 📄 Paper: huggingface.co/papers/2605.05… DCI unlocks the full agentic potential of any Claude Sonnet 4.6: 69.0% → 80.0% on BrowseComp-Plus (+11.0, −$424). 💡The Magic: The agent searches the raw corpus directly — `grep`, `find`, `bash`, shell pipelines — exactly like a coding agent navigating a codebase. No preprocess. No embedding model. No vector index. No offline indexing. 📊The Results: DCI outperforms top baselines across 13 benchmarks, with average gains of: 🔍 Agentic Search: +11.0% 🧠 Multi-hop QA: +30.7% 📈 IR Ranking: +21.5% 💡 Insights: Beyond accuracy, we conduct a series of controlled ablation studies to pinpoint the sources of DCI’s gains. Specifically, we examine trajectory-level search, evidence utilization corpus, context management, and tool usage (RQ2-RQ6). Try it yourself! 🛠️Code: github.com/DCI-Agent/DCI-… 🤖 Demo: huggingface.co/spaces/DCI-Age… 🔎 Eval logs: huggingface.co/datasets/DCI-A…
Zhuofeng Li tweet mediaZhuofeng Li tweet mediaZhuofeng Li tweet mediaZhuofeng Li tweet media
English
24
58
229
37.2K
Jimmy Lin
Jimmy Lin@lintool·
@s_gaweda Two criteria come to mind: (1) accuracy - did the agent do what the skill promises? (2) token efficiency - how many tokens did the agent have to burn?
English
1
0
0
43
Bash
Bash@s_gaweda·
@lintool How do you measure the success of a skill? Presumably any change should measurably improve the skill if you want to have any sort of confidence.
English
1
0
0
30
Jimmy Lin
Jimmy Lin@lintool·
⁉️ What's the goal of code review for SKILLz? 🙋‍♂️ I'm interested in hearing your opinion on this: What are CR best practices for SKILLz?
English
1
0
3
1K
Jimmy Lin
Jimmy Lin@lintool·
Does this change if the SKILL is shared widely within an org? Does this change if an org uses multiple agents? Should I get Codex and Claude to iterate on the PR until they're both happy (and stay out of it)? What are emerging best practices here? ⁉️
English
1
0
0
403
Jimmy Lin
Jimmy Lin@lintool·
Since it is unlikely that the SKILL.md etc. 📜 was actually written by a human, who am I to second-guess what an agent writes, mainly for another agent (or itself at a later point in time)? 🤖 I assume the human has already iterated with the agent to refine the SKILL?
English
1
0
0
437
Jimmy Lin
Jimmy Lin@lintool·
I'm a bit late to the party ✨🎊🥳 but I suppose this is what AI psychosis looks like... All-time Codex token usage: 1,008,533,821 tokens across 563 threads.
Jimmy Lin tweet media
English
0
0
4
717
Jimmy Lin
Jimmy Lin@lintool·
As expected, Codex spent far longer cleaning the data 🗑️🔧 than building the initial interface... some things still never change.
English
0
0
1
349