SID

72 posts

SID banner
SID

SID

@SID_AI

solving retrieval one model at a time | @ycombinator

San Francisco, CA Katılım Aralık 2022
0 Takip Edilen845 Takipçiler
Sabitlenmiş Tweet
SID
SID@SID_AI·
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).
SID tweet mediaSID tweet mediaSID tweet mediaSID tweet media
English
18
39
375
136.9K
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
We improve both pass@1 AND pass@n during training. The issue is that lots of claimants: 1) train on domains with heavy mid/posttraining in the base models (math) 2) don't train for very long In many of these small-scale experiments, gains come from re-learning the format (paper's format vs model maker's). Most real RL benefits come quite late and much more slowly than is practical for academic researchers. Also: We had to learn the hard way that insights from small models don't generalize well to larger ones. Especially when the smaller ones weren't natively RL-trained (all small Qwen3 models for example).
Max Rumpf tweet media
Sasha Rush@srush_nlp

There is significant discussion in the academic literature about RL making models better at pass@1 and *worse* at pass@N (or related claims). We run a lot of RL runs at Cursor and don't see this issue systematically. Not doubting it occurs, but something else might be going on.

English
7
9
99
21.4K
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
Label noise really matters in RL SID-1's task requires reporting the documents most likely to contain the answer to a question. When the ground truth data contains errors, the model will start overreporting in hopes of catching spurious targets. For one public dataset where the average number of ground truth docs is 2, the model starts reporting up to 8 -- most of which are bad. We created a custom dataset heavily controlled for noise. We use a mix of techniques, but this was quite expensive and cumbersome. The tesult: the model reports the correct number of docs. I assume this phenomenon generalizes to other tasks (math, code), but manifests differently: try/catch, broad types, etc.
Max Rumpf tweet mediaMax Rumpf tweet media
SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English
0
1
23
2.9K
Max Rumpf
Max Rumpf@maxrumpf·
Most RL frameworks are fundamentally unstable. We wasted more H100 hours on debugging this than any other issue fornour multi-turn, multi-env RL run (below). When using OpenAI-style messages for env interactions, parsing and retokenizing leads to subtly different tokens. This creates extremely unlikely tokens, which dominate the gradient and over time lead to collapse. The screenshots describe the mechanism in more detail. We tried a lot of interventions, but ended up reimplementing our environments to use token lists directly (Tokens-in/Tokens-out). This fixed it immediately. Always inspect logprobs!
Max Rumpf tweet mediaMax Rumpf tweet mediaMax Rumpf tweet mediaMax Rumpf tweet media
SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English
18
56
544
114.9K
SID
SID@SID_AI·
@maxrumpf our training repo isn't called TITO by chance
SID tweet media
English
0
0
6
1.4K
SID
SID@SID_AI·
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).
SID tweet mediaSID tweet mediaSID tweet mediaSID tweet media
English
18
39
375
136.9K
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
Good RL environments are much richer than you think. We evaluate training for 100 epochs and see eval reward increase steadily. Partly, this is because our RL setting allows obfuscating the answer between epochs, largely mitigating memorization (when inspecting train rollouts). Obfuscation et al. has the ability to extend to other domains. We posit that domains with a high share of environment tokens in the rollout are especially attractive candidates.
Max Rumpf tweet media
SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English
6
17
147
19.7K
Max Rumpf
Max Rumpf@maxrumpf·
We believe retrieval is the ideal playground for self-play RL on LLMs. SID-1 was trained with "pseudo self-play:" New questions were generated to cover gaps in model behavior as training progressed. We think we're not far away from closing that loop: Generating hard, verifiable questions and solving them within the same batch. This won't be easy, but given how self-play RL outperforms in chess and go, getting it right for LLMs will be nothing short of revolutionary. SID is hiring research and infrastructure engineers to work on this (and many more things).
Max Rumpf tweet media
SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English
6
10
70
8.9K
SID
SID@SID_AI·
@samsja19 thank you! SID-1 is 14B. but we'll have a few more sizes soon (training rn).
English
1
0
6
297
SID
SID@SID_AI·
@JoschkaBraun thinking about images or thinking in images?
English
0
0
4
2.2K
Joschka Braun
Joschka Braun@JoschkaBraun·
@SID_AI I’m curious what the impact of going multimodal will be
English
1
0
5
2.5K
SID
SID@SID_AI·
@binvnese2303 only took 80 years since vannevar bush first wrote about it. it won't take another 80.
English
0
0
6
1.7K
Ba Finn
Ba Finn@binvnese2303·
@SID_AI speedrun retrieval arc unlocked
English
1
0
4
2K
SID
SID@SID_AI·
@SJDauncey cheaper than a reranker, too!
English
0
0
1
93
Sam Dauncey
Sam Dauncey@SJDauncey·
@SID_AI The reranker she tells you not to worry about
English
1
0
3
131
SID retweetledi
Max Rumpf
Max Rumpf@maxrumpf·
computer use and code gen progress is outpacing general intelligence improvements. why? you can easily create synthetic data for both. let me explain: if you have *more* high-quality data on a task, a model trained on that data will be better at it. currently, that data is mostly created by humans (for free on the internet or for money at scale ai). synthetic data asks the question: current models are smart, why don't we use their outputs as a source of high-quality data to train the next model generation? but if you train on all model-generated data, you will most likely run into model collapse* or slop. this is bad. what you need is a way to determine which of the data is "high-quality": what you want is a verifier. for general email writing skills, there really isn't a verifier (llm as a judge has its own problems). and not coincidentally we haven't seen much improvement in email writing skill. for code, we have a decent verifier: the compiler (+some static analysis tools). if the program compiles and maybe even runs inside of a sandbox, it's probably not awful – so we can include it as high-quality. this is probably enough verification. for math we have proof solving languages like lean. we can tell if a generated proof is "okay" by seeing if the lean solver accepts it. DeepSeek-Prover uses this technique effectively. they let the model iteratively train on the outputs of it's last generation – overseen by a verifier to discard bad data. for computer use, you can verify that the outputs are correct. if i tell the model to update a record in salesforce, i can then use the api to check if the record was updated correctly and discard all the runs in which the record wasn't updated correctly. importantly, there is no ceiling to how much synthetic data you can generate! if scaling laws hold, this implies there is no ceiling to the total performance on the task – it could be 1000x better than any human! another thing that will prove important: synthetic data doesn't just allow self-play it is also WAY cheaper than human-labelled data. at $50/h, a good human labeler costs ~$8000 per 1M output tokens. LLM generated data costs $0.1-10 per 1M tokens. only ai labs can afford humans, but any well-capitalized small startup can afford synthetic data! i wouldn't be surprized to see a $10M-raised startup deliver the best React-code generation model. my assumption is that by 2025, 99% of all code that was ever written (measured in tokens) will have been written by AI for synthetic data training runs. more generally: we will see model improvements continue to accelerate in tasks where a verifier exists, while progress will likely stagnate in tasks where no verifier exists. the "verifier gap" will become glaringly obvious. but luckily verifier gap also gives us a recipe: if you want better model performance for your task, just invent a verifier!
English
0
4
16
1.6K
Max Rumpf
Max Rumpf@maxrumpf·
three letter AI companies that start with S are really having their day
English
1
0
4
960
SID
SID@SID_AI·
@maxrumpf Not too long of a lunch break though…
English
1
0
3
3.9K
Max Rumpf
Max Rumpf@maxrumpf·
SF life hack: if chatgpt is down, there's no line at tartine's
Max Rumpf tweet mediaMax Rumpf tweet media
English
11
20
732
132.2K