SID

85 posts

SID banner
SID

SID

@SID_AI

solving retrieval one model at a time | @ycombinator

San Francisco, CA Katılım Aralık 2022
0 Takip Edilen1.1K Takipçiler
Sabitlenmiş Tweet
SID
SID@SID_AI·
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).
SID tweet mediaSID tweet mediaSID tweet mediaSID tweet media
English
18
41
394
150.8K
Max Rumpf
Max Rumpf@maxrumpf·
turbopuffer x SID An easy way to tell a good from a great AI researcher: how much do they think about infrastructure. Infra extends beyond what’s running on the GPUs: Slow environments will bottleneck your training steps. More parallel and powerful models make this problem worse. RL environment specifics are usually secret, but we shared some details in a recent post with our friends at @turbopuffer Training great models requires great infrastructure and we’re excited to be working with the best.
Max Rumpf tweet mediaMax Rumpf tweet mediaMax Rumpf tweet mediaMax Rumpf tweet media
turbopuffer@turbopuffer

SID-1 is an agentic search model by @SID_AI → 1.9x recall over RAG + rerank → 24x faster, 99% cheaper than GPT-5.1 trained using large-scale RL on turbopuffer at 1k+ QPS bursts over 10M+ document corpora across thousands of steps tpuf.link/sid-1

English
14
18
212
36.2K
SID retweetledi
turbopuffer
turbopuffer@turbopuffer·
SID-1 is an agentic search model by @SID_AI → 1.9x recall over RAG + rerank → 24x faster, 99% cheaper than GPT-5.1 trained using large-scale RL on turbopuffer at 1k+ QPS bursts over 10M+ document corpora across thousands of steps tpuf.link/sid-1
English
6
26
212
108.3K
SID
SID@SID_AI·
@maxrumpf all the best ones are.
English
0
0
2
48
SID
SID@SID_AI·
We now have the tools to solve search.
SID tweet media
English
2
0
5
351
SID
SID@SID_AI·
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).
SID tweet mediaSID tweet mediaSID tweet mediaSID tweet media
English
18
41
394
150.8K
Max Rumpf
Max Rumpf@maxrumpf·
@trychroma We published our research in December and told Chroma's CEO Jeff. 4 months later, Chroma republished it without citing it. We think this sets a pretty bad precedent: x.com/maxrumpf/statu…
Max Rumpf@maxrumpf

Chroma's "new" model sure seems familiar. A story. Imitation is the sincerest form of flattery. But there is a point where it goes from "inspiration" to whatever Context-1 is: 6 months ago, Chroma's CEO @jeffreyhuber asked us about our research. 4 months ago, we proudly shared SID-1's tech report with him. An exchange I now understand very differently (see the emails). Today, they released a report heavily "inspired" by ours. Charts, datasets, methods, and the whole model itself. Down to the toggle for Figure 1 and our 4x RRF rollouts. They never reached out to benchmark our model. Their claims of "pareto-optimality" ring hollow. They provable knew there was another model. Unfortunately, we can't benchmark their model: While their weights are open, the harness they say one needs isn't yet. Their claims of "pareto-optimality" ring hollow. They knew there was another model. I know Jeff well and our offices neighbor. We shared a lot of insights in our tech report. Maybe more than prudent. But we believe in advancing human knowledge. (Making search better is our way of doing so). We applaud companies like @thinkymachines that are brave enough to share the ideas that make the work possible. But where do we go as a research community when we stop respecting each other's work? When we don't give credit where it's due? And trick "friends" into sharing more, just to steal it? While claiming moral high ground by calling this "open-source?" This completely destroys any incentive for us (and others) to go into as much depth as we did in our tech report. It’s sad to see the poor research practices that are sadly common in academia making their way into startups. Context-1 has some interesting ideas: Pruning is clever. I wish I were writing about them. Followers and copycats, even if they're bigger, don't scare us. I'm very proud of what we've built. And even more proud of who I'm building this with. We're also hiring original thinkers.

English
1
8
66
8.2K
Chroma
Chroma@trychroma·
Introducing Chroma Context-1, a 20B parameter search agent. > pushes the pareto frontier of agentic search > order of magnitude faster > order of magnitude cheaper > Apache 2.0, open-source
English
141
404
4.2K
1.1M
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
Chroma's "new" model sure seems familiar. A story. Imitation is the sincerest form of flattery. But there is a point where it goes from "inspiration" to whatever Context-1 is: 6 months ago, Chroma's CEO @jeffreyhuber asked us about our research. 4 months ago, we proudly shared SID-1's tech report with him. An exchange I now understand very differently (see the emails). Today, they released a report heavily "inspired" by ours. Charts, datasets, methods, and the whole model itself. Down to the toggle for Figure 1 and our 4x RRF rollouts. They never reached out to benchmark our model. Their claims of "pareto-optimality" ring hollow. They provable knew there was another model. Unfortunately, we can't benchmark their model: While their weights are open, the harness they say one needs isn't yet. Their claims of "pareto-optimality" ring hollow. They knew there was another model. I know Jeff well and our offices neighbor. We shared a lot of insights in our tech report. Maybe more than prudent. But we believe in advancing human knowledge. (Making search better is our way of doing so). We applaud companies like @thinkymachines that are brave enough to share the ideas that make the work possible. But where do we go as a research community when we stop respecting each other's work? When we don't give credit where it's due? And trick "friends" into sharing more, just to steal it? While claiming moral high ground by calling this "open-source?" This completely destroys any incentive for us (and others) to go into as much depth as we did in our tech report. It’s sad to see the poor research practices that are sadly common in academia making their way into startups. Context-1 has some interesting ideas: Pruning is clever. I wish I were writing about them. Followers and copycats, even if they're bigger, don't scare us. I'm very proud of what we've built. And even more proud of who I'm building this with. We're also hiring original thinkers.
Max Rumpf tweet mediaMax Rumpf tweet mediaMax Rumpf tweet mediaMax Rumpf tweet media
Chroma@trychroma

Introducing Chroma Context-1, a 20B parameter search agent. > pushes the pareto frontier of agentic search > order of magnitude faster > order of magnitude cheaper > Apache 2.0, open-source

English
27
34
517
72K
SID
SID@SID_AI·
@maxrumpf CUDA out of memory.
English
2
1
9
941
Max Rumpf
Max Rumpf@maxrumpf·
@SID_AI Infrastructure is research. Research is infrastructure.
English
1
0
8
457
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
We improve both pass@1 AND pass@n during training. The issue is that lots of claimants: 1) train on domains with heavy mid/posttraining in the base models (math) 2) don't train for very long In many of these small-scale experiments, gains come from re-learning the format (paper's format vs model maker's). Most real RL benefits come quite late and much more slowly than is practical for academic researchers. Also: We had to learn the hard way that insights from small models don't generalize well to larger ones. Especially when the smaller ones weren't natively RL-trained (all small Qwen3 models for example).
Max Rumpf tweet media
Sasha Rush@srush_nlp

There is significant discussion in the academic literature about RL making models better at pass@1 and *worse* at pass@N (or related claims). We run a lot of RL runs at Cursor and don't see this issue systematically. Not doubting it occurs, but something else might be going on.

English
7
9
99
21.8K
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
Label noise really matters in RL SID-1's task requires reporting the documents most likely to contain the answer to a question. When the ground truth data contains errors, the model will start overreporting in hopes of catching spurious targets. For one public dataset where the average number of ground truth docs is 2, the model starts reporting up to 8 -- most of which are bad. We created a custom dataset heavily controlled for noise. We use a mix of techniques, but this was quite expensive and cumbersome. The tesult: the model reports the correct number of docs. I assume this phenomenon generalizes to other tasks (math, code), but manifests differently: try/catch, broad types, etc.
Max Rumpf tweet mediaMax Rumpf tweet media
SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English
0
1
23
3.1K
Max Rumpf
Max Rumpf@maxrumpf·
Most RL frameworks are fundamentally unstable. We wasted more H100 hours on debugging this than any other issue fornour multi-turn, multi-env RL run (below). When using OpenAI-style messages for env interactions, parsing and retokenizing leads to subtly different tokens. This creates extremely unlikely tokens, which dominate the gradient and over time lead to collapse. The screenshots describe the mechanism in more detail. We tried a lot of interventions, but ended up reimplementing our environments to use token lists directly (Tokens-in/Tokens-out). This fixed it immediately. Always inspect logprobs!
Max Rumpf tweet mediaMax Rumpf tweet mediaMax Rumpf tweet mediaMax Rumpf tweet media
SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English
18
57
548
121.9K
SID
SID@SID_AI·
@maxrumpf our training repo isn't called TITO by chance
SID tweet media
English
0
0
6
1.4K
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
Good RL environments are much richer than you think. We evaluate training for 100 epochs and see eval reward increase steadily. Partly, this is because our RL setting allows obfuscating the answer between epochs, largely mitigating memorization (when inspecting train rollouts). Obfuscation et al. has the ability to extend to other domains. We posit that domains with a high share of environment tokens in the rollout are especially attractive candidates.
Max Rumpf tweet media
SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English
6
18
145
19.8K
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
We believe retrieval is the ideal playground for self-play RL on LLMs. SID-1 was trained with "pseudo self-play:" New questions were generated to cover gaps in model behavior as training progressed. We think we're not far away from closing that loop: Generating hard, verifiable questions and solving them within the same batch. This won't be easy, but given how self-play RL outperforms in chess and go, getting it right for LLMs will be nothing short of revolutionary. SID is hiring research and infrastructure engineers to work on this (and many more things).
Max Rumpf tweet media
SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English
6
9
69
9K