Grad

3.8K posts

Grad

Grad

@Grad62304977

Katılım Ekim 2020
2.7K Takip Edilen9.1K Takipçiler
Grad
Grad@Grad62304977·
@tszzl Ya ppl seem to forget the only difference with CoT and something like looped transformers is not passing the final hidden state (instead the token it produced) All other info in all other layers is passed as it would and in neuralese
English
0
0
13
549
Grad retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
The next step toward automating AI is automating RL environments Introducing General-Agent: A fully synthetic environment whose task corpus self-evolves and grows harder over time 4,504 tool-use tasks · 1,040 domains · 8,159 unique tools
GIF
English
48
124
1.3K
284.9K
Grad
Grad@Grad62304977·
Always painful to see wasted data in the form of evals that no one uses Open source community should every couple of months take all evals that no one uses and allow it to be trained on without being shamed
English
2
2
30
1.8K
Grad
Grad@Grad62304977·
Ya but this is holistic Like to say specifically that this chunk of tokens or turn was bad or not is hard If ur not careful u could kill exploration but also in general like what if the model called a tool to look at an “unrelated file” but then after reading this file it realised which file it needs to look at Was it wrong for the model to make the “wrong” tool call?
English
1
0
2
47
Tim Kostolansky
Tim Kostolansky@thkostolansky·
@Grad62304977 @stochasticchasm > But very little on the actual solving the task side is intermediate and verifiable could just have a model reflect on why it was bad? or an ensemble?
English
1
0
1
38
Grad
Grad@Grad62304977·
Ya my main issue with it is it’s hard to generally and at scale handle this PI information and balance it Bcs if it’s too good then ur sort of cooked The way cursor did it was nice but I fear its very hard to provide PI info reliably to solve the task correctly without running into issues
English
2
0
7
427
Chinmay
Chinmay@ChinmayKak·
@Grad62304977 im assuming you also don't like OPSD for the PI formulation is quite tricky/not clear? i agree with your post
English
1
0
3
414
Grad
Grad@Grad62304977·
Cool to see work on this although I think many misunderstood the main work here Making credit assignment work is mainly held back by the fact that in the objective to solve a task, it’s rare to have intermediate parts of a rollout that are verifiable directly There is no correct intermediate tool call to make for example Now something like model communication and code style it’s different Like if we want no failed tool calls (more code style then something to solve the task) then this is a verifiable intermediate task Same with model communication, like if we want the model to communicate its progress over time outside thinking Main question is if u can use credit assignment to actually improve the performance of the model on solving tasks correctly We’re currently quite bullish on some directions here that are elegant and general and hopefully can push some stuff out on this soon
Cursor@cursor_ai

We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. For example, we use text feedback during RL to learn faster by assigning credit in rollouts spanning hundreds of thousands of tokens.

English
1
3
94
9.5K
Grad retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline
Prime Intellect tweet media
English
57
154
1.7K
586.7K
Grad retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
Introducing Renderers RL trainers work in tokens. Environments work in messages. Going back and forth corrupts sampled tokens, wasting compute on every agentic turn. With Renderers, we fix this mismatch. This unlocks >3x throughput on popular open models.
Prime Intellect tweet media
English
15
72
698
192.5K
Grad retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
The next wave of AI will not be won by better prompts. It will be won by systems that learn from experience. Today, Prime Intellect Lab is out of beta, open for you to start training your own models. The era of self-improving agents is here.
English
82
196
1.9K
1.3M
Grad
Grad@Grad62304977·
@teortaxesTex @rationaleist @rasbt Not that cheap if u target super long context tbf
Yushi Bai@realYushiBai

🧵 1/4 Still waiting for DeepSeek-V4? We (@Zai_org) made DSA 1.8× faster with minimal code change — and it's ready to deliver real inference gains on GLM-5. IndexCache removes 50% of indexer computations in DeepSeek Sparse Attention with virtually zero quality loss. On GLM-5 (744B), we get ~1.2× E2E speedup while matching the original across both long-context and reasoning tasks. On our experimental-sized 30B model, removing 75% of indexers gives 1.82× prefill and 1.48× decode speedup at 200K context. How? 🧵👇 #DeepSeek #GLM5 #Deepseekv4 #LLM #Inference #Efficiency #LongContext #MLSys #SparseAttention

English
0
0
3
177
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
pfft yeah, you get to choose bad baselines for a promo We need one good unified visualization of memory&compute cost per ith token @rasbt you probably have all formulas for different architectures ready, what do you think?
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Oliver Sieberling@osieberling

Would not be too surprised if this was just sth. like: 60 layer hybrid - 256k "sliding window" attention every 4 blocks ("linear") - GDN in the remaining blocks compared to full attn: (60 * (1M)^2) / (15 * (256K)^2 + 45 * little) ≈ 52x speedup This is Qwen3.5-397B-A17B btw

English
1
0
19
8.6K
Casper Hansen
Casper Hansen@casper_hansen_·
@hallerite quality has gone down since ~2024. most of the conferences had work from 2022/2023 published, before agent slop. nowadays, there is so much noise that it’s hard to find signal. I wonder if @Grad62304977 still is able to find some nice papers for you internally tho :)
English
1
0
5
2.5K
hallerite
hallerite@hallerite·
It has been quite some time since I was deeply impressed by an AI paper from academia..
English
26
5
262
332.2K
sankalp
sankalp@dejavucoder·
i was surprised to see that qwen 3.5 4b frequently fumbled at reversing text sequences. but it was not a shocker when i recalled models do not see text as text, they see tokens... anyways you can RLmaxx the model to make it better at reversing strings. a worthy quest if you ask.
sankalp tweet media
English
3
3
69
10.3K
Grad
Grad@Grad62304977·
@teortaxesTex @zephyr_z9 Effort control is just different system prompt with different length penalty in training
English
0
0
3
105
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Juice is entropix where did you think frog disappeared to, after saying that LLMs can't become butterflies? Like a few months before they started proving their mathematical chops?
Ankith 🐋/acc@dhtikna

It would be funny if juice in gpt was simply extracted out by the scaffolding and they do some sort of steering vector to nudge to close thinking based on some pre decided values of number of thinking tokens used

English
6
0
44
9.3K
Grad
Grad@Grad62304977·
@_arohan_ Kimi k1.5 was also out the same day as the r1 paper (although DeepSeek r1 preview was a while before iirc)
English
1
0
36
2.3K
rohan anil
rohan anil@_arohan_·
If O1 had not mentioned thinking traces, thinking etc. i think rest of the companies would have taken longer to get there. In some sense, the world should be thankful for O1 team for creating the intelligence explosion for decisions they implicitly made. As well DeepSeek for reproducing an open recipe, that allowed for more folks to catch up (iiuc, deepseek was the second thinking release after o1?)
English
31
20
511
38K
Grad
Grad@Grad62304977·
@HeMuyu0327 U should make the 0.5 learnable Can do lambda * v1 + (1-lambda) * v Lambda being a learnable scalar initialised as 0.5 Also using 0.01 lr and no weight decay (for smaller scale works better but larger scale smaller lr works better)
English
0
0
6
392
Muyu He
Muyu He@HeMuyu0327·
It always baffles me how adding the first value vector to later attention layers' value vector improves the model. Ran a few ablations on nanochat and found the reasons might be more evasive than the paper suggested. The paper made at least two claims based on the observations that adding v1 to subsequent v (0.5 * v1 + 0.5 * v) helps: (1) the improvement comes from the information in the first value vector; (2) adding information to the key/query vector "confuses the model" and hurts performance. Regarding (1), some earlier studies (+ nanochat's native implementation) have found that v1 is not special, and what is special is some sort of linear transformation of the initial token embedding, h0. To prove this, we pass h0 to the same value matrix in subsequent attention layers, and add the resulting v to the current layer's v (0.5 * v1' + 0.5 * v). We find comparable performance as adding v1 directly (p1). Regarding (2), the paper's experiment is unprincipled: they add h0 directly to the current residual h (0.5 * h0 + 0.5 * h) and find the performance drops from just adding v. However, this technique directly modifies the residual stream, so the comparison is not apples-to-apples. To prove that adding h0 information to QK does not hurt, we bypass residual stream and directly add pass h0 to the same QK matrices in subsequent attention layers, and find comparable performances as adding v1 (p2). This shows that feeding h0 to QK neither hurts nor improves the model. There are a few more hypotheses to test regarding adding v1 to subsequent attention layers, and they together form this great puzzle of "why value vector", and "why from the first embedding". We will share more results that try to understand what's behind this phenomenon better. At least two ablations have already lined up!
Muyu He tweet mediaMuyu He tweet media
English
6
2
66
4.2K
Grad
Grad@Grad62304977·
@saurabh_shah2 @agarwl_ Ya we were the first to use the same setup shown in these slides in intellect-3 (can see the paper) Was also done after in glm-5
English
1
0
7
388
Rishabh Agarwal
Rishabh Agarwal@agarwl_·
I gave a talk at ICLR 2026 about how we are scaling RL on frontier LLMs with 1T+ parameters, on experimental data from our physical lab at Periodic! Here's a rough recording of the talk:
English
15
172
1.8K
213.3K
Grad
Grad@Grad62304977·
Ya true SWA is defo not enough I still think DSA is up there with the biggest arch changes I think lots believed something like it should work but making it work was the hardest part I do think tho that a DSA SWA hybrid efficiency wise is the same or better than the DeepSeek arch (and then have to model performance)
English
0
0
13
879