SID (@SID_AI) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

SID@SID_AI·5 Ara

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English

18

39

375

136.9K

SID retweetledi

Max Rumpf@maxrumpf·20 Ara

We improve both pass@1 AND pass@n during training. The issue is that lots of claimants: 1) train on domains with heavy mid/posttraining in the base models (math) 2) don't train for very long In many of these small-scale experiments, gains come from re-learning the format (paper's format vs model maker's). Most real RL benefits come quite late and much more slowly than is practical for academic researchers. Also: We had to learn the hard way that insights from small models don't generalize well to larger ones. Especially when the smaller ones weren't natively RL-trained (all small Qwen3 models for example).

Sasha Rush@srush_nlp

There is significant discussion in the academic literature about RL making models better at pass @1 and *worse* at pass@N (or related claims). We run a lot of RL runs at Cursor and don't see this issue systematically. Not doubting it occurs, but something else might be going on.

English

7

9

99

21.4K

SID retweetledi

Max Rumpf@maxrumpf·20 Ara

Label noise really matters in RL SID-1's task requires reporting the documents most likely to contain the answer to a question. When the ground truth data contains errors, the model will start overreporting in hopes of catching spurious targets. For one public dataset where the average number of ground truth docs is 2, the model starts reporting up to 8 -- most of which are bad. We created a custom dataset heavily controlled for noise. We use a mix of techniques, but this was quite expensive and cumbersome. The tesult: the model reports the correct number of docs. I assume this phenomenon generalizes to other tasks (math, code), but manifests differently: try/catch, broad types, etc.

SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English

0

1

23

2.9K

Max Rumpf@maxrumpf·17 Ara

Most RL frameworks are fundamentally unstable. We wasted more H100 hours on debugging this than any other issue fornour multi-turn, multi-env RL run (below). When using OpenAI-style messages for env interactions, parsing and retokenizing leads to subtly different tokens. This creates extremely unlikely tokens, which dominate the gradient and over time lead to collapse. The screenshots describe the mechanism in more detail. We tried a lot of interventions, but ended up reimplementing our environments to use token lists directly (Tokens-in/Tokens-out). This fixed it immediately. Always inspect logprobs!

SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English

18

56

544

114.9K

SID@SID_AI·17 Ara

@maxrumpf our training repo isn't called TITO by chance

English

0

6

1.4K

SID@SID_AI·17 Ara

@yash_347 @ambitioninc there can only be one.

English

0

373

Yash More@onemoreyash·14 Ara

@SID_AI @ambitioninc u gotta be talking to them right?

English

1

0

2

417

SID@SID_AI·5 Ara

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English

18

39

375

136.9K

SID retweetledi

Max Rumpf@maxrumpf·14 Ara

Good RL environments are much richer than you think. We evaluate training for 100 epochs and see eval reward increase steadily. Partly, this is because our RL setting allows obfuscating the answer between epochs, largely mitigating memorization (when inspecting train rollouts). Obfuscation et al. has the ability to extend to other domains. We posit that domains with a high share of environment tokens in the rollout are especially attractive candidates.

SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English

6

17

147

19.7K

Max Rumpf@maxrumpf·13 Ara

We believe retrieval is the ideal playground for self-play RL on LLMs. SID-1 was trained with "pseudo self-play:" New questions were generated to cover gaps in model behavior as training progressed. We think we're not far away from closing that loop: Generating hard, verifiable questions and solving them within the same batch. This won't be easy, but given how self-play RL outperforms in chess and go, getting it right for LLMs will be nothing short of revolutionary. SID is hiring research and infrastructure engineers to work on this (and many more things).

SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English

6

10

70

8.9K

SID@SID_AI·13 Ara

@samsja19 thank you! SID-1 is 14B. but we'll have a few more sizes soon (training rn).

English

1

0

6

297

samsja@samsja19·13 Ara

this is great work, they RL-ed qwen 4b into a strong and cheap retriever, their paper has a lot of great detail around multi turn rl, worth a read

SID@SID_AI

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and expensive ones (see chart). we trained SID-1 using multi-environment, multi-turn RL on Qwen. it was a lot of work (a lot of which is documented in our tech report -- see pinned tweet). our RL environments build on the idea that humans with search tools can find almost any information given sufficient iteration. like humans, SID-1 makes a first search, read the results, and adapts its strategy. and it can do this much faster *and* better than frontier LLMs: 24x faster than GPT-5.1, 27x faster than Gemini 3 Pro. the better part is critical! if a model is fast and wrong, it's just wrong. that's why we trained SID-1 until it was the most likely to deliver the correct results. bar none. we're partnering with a small number of companies today and have a waitlist for everyone else. (we don't have enough inference compute for everyone yet).

English

8

17

321

32.1K

SID@SID_AI·13 Ara

@maxrumpf Please see SID.ai for more details.

English

1

0

3

217

SID@SID_AI·6 Ara

@JoschkaBraun thinking about images or thinking in images?

English

0

4

2.2K

Joschka Braun@JoschkaBraun·6 Ara

@SID_AI I’m curious what the impact of going multimodal will be

English

1

0

5

2.5K

SID@SID_AI·6 Ara

@SidUnnithan @churroschia @ycombinator Granted.

English

0

39

Sid Unnithan@SidUnnithan·6 Ara

@churroschia @SID_AI @ycombinator Petition to rebrand from Evergrove to Jacob-AI

English

1

0

2

42

SID@SID_AI·5 Ara

@CarloWillem first of many!

English

0

1

1.3K

Carlo Kobe@CarloWillem·5 Ara

@SID_AI so cool

English

1

0

3

1.5K

SID@SID_AI·5 Ara

@binvnese2303 only took 80 years since vannevar bush first wrote about it. it won't take another 80.

English

0

6

1.7K

Ba Finn@binvnese2303·5 Ara

@SID_AI speedrun retrieval arc unlocked

English

1

0

4

2K

SID@SID_AI·5 Ara

SID-1: Tech Report It has way more detail than is prudent. sid.ai/research/sid-1…

English

1

2

11

865

SID@SID_AI·5 Ara

@SJDauncey cheaper than a reranker, too!

English

0

1

93

Sam Dauncey@SJDauncey·5 Ara

@SID_AI The reranker she tells you not to worry about

English

1

0

3

131

SID@SID_AI·5 Ara

@lotteseifert only the finest

English

0

5

2.8K

Lotte Seifert@lotteseifert·5 Ara

@SID_AI what a luxurious beige!

English

1

0

9

3K

SID retweetledi

Max Rumpf@maxrumpf·22 Eki

computer use and code gen progress is outpacing general intelligence improvements. why? you can easily create synthetic data for both. let me explain: if you have *more* high-quality data on a task, a model trained on that data will be better at it. currently, that data is mostly created by humans (for free on the internet or for money at scale ai). synthetic data asks the question: current models are smart, why don't we use their outputs as a source of high-quality data to train the next model generation? but if you train on all model-generated data, you will most likely run into model collapse* or slop. this is bad. what you need is a way to determine which of the data is "high-quality": what you want is a verifier. for general email writing skills, there really isn't a verifier (llm as a judge has its own problems). and not coincidentally we haven't seen much improvement in email writing skill. for code, we have a decent verifier: the compiler (+some static analysis tools). if the program compiles and maybe even runs inside of a sandbox, it's probably not awful – so we can include it as high-quality. this is probably enough verification. for math we have proof solving languages like lean. we can tell if a generated proof is "okay" by seeing if the lean solver accepts it. DeepSeek-Prover uses this technique effectively. they let the model iteratively train on the outputs of it's last generation – overseen by a verifier to discard bad data. for computer use, you can verify that the outputs are correct. if i tell the model to update a record in salesforce, i can then use the api to check if the record was updated correctly and discard all the runs in which the record wasn't updated correctly. importantly, there is no ceiling to how much synthetic data you can generate! if scaling laws hold, this implies there is no ceiling to the total performance on the task – it could be 1000x better than any human! another thing that will prove important: synthetic data doesn't just allow self-play it is also WAY cheaper than human-labelled data. at $50/h, a good human labeler costs ~$8000 per 1M output tokens. LLM generated data costs $0.1-10 per 1M tokens. only ai labs can afford humans, but any well-capitalized small startup can afford synthetic data! i wouldn't be surprized to see a $10M-raised startup deliver the best React-code generation model. my assumption is that by 2025, 99% of all code that was ever written (measured in tokens) will have been written by AI for synthetic data training runs. more generally: we will see model improvements continue to accelerate in tasks where a verifier exists, while progress will likely stagnate in tasks where no verifier exists. the "verifier gap" will become glaringly obvious. but luckily verifier gap also gives us a recipe: if you want better model performance for your task, just invent a verifier!

English

0

4

16

1.6K