Grad

3.8K posts

Grad

@Grad62304977

Katılım Ekim 2020

2.7K Takip Edilen9.1K Takipçiler

Grad@Grad62304977·5d

@tszzl Ya ppl seem to forget the only difference with CoT and something like looped transformers is not passing the final hidden state (instead the token it produced) All other info in all other layers is passed as it would and in neuralese

English

549

roon@tszzl·5d

btw any transformer forward pass is neuralese

Adrien Ecoffet@AdrienLE

It bothers me that people are increasingly using the word “recurrence” to refer to neuralese. It is possible for an RNN to have legible CoTs or for a transformer to think in neuralese. Aside from betraying technical confusion, this stigmatizes harmless research programs (“reinvestigate LSTMs”) while normalizing risky ones (“train CoTs to use soft tokens”)

English

286

38.4K

Grad retweetledi

Prime Intellect@PrimeIntellect·6d

The next step toward automating AI is automating RL environments Introducing General-Agent: A fully synthetic environment whose task corpus self-evolves and grows harder over time 4,504 tool-use tasks · 1,040 domains · 8,159 unique tools

GIF

English

124

1.3K

284.9K

Grad@Grad62304977·6d

Always painful to see wasted data in the form of evals that no one uses Open source community should every couple of months take all evals that no one uses and allow it to be trained on without being shamed

English

1.8K

Grad@Grad62304977·6d

Ya but this is holistic Like to say specifically that this chunk of tokens or turn was bad or not is hard If ur not careful u could kill exploration but also in general like what if the model called a tool to look at an “unrelated file” but then after reading this file it realised which file it needs to look at Was it wrong for the model to make the “wrong” tool call?

English

Tim Kostolansky@thkostolansky·6d

@Grad62304977 @stochasticchasm > But very little on the actual solving the task side is intermediate and verifiable could just have a model reflect on why it was bad? or an ensemble?

English

stochasm@stochasticchasm·18 May

amazing, love to see it. much better for the long tail i imagine. cursor doing cool stuff yet again. @Grad62304977 told you privileged info opd is a promising paradigm

Cursor@cursor_ai

Introducing Composer 2.5, our most powerful model yet. It's more intelligent, better at sustained work on long-running tasks, and more reliable at following complex instructions. For the next week, we’re doubling the included usage of the model.

English

123

6.5K

Grad@Grad62304977·6d

Ya my main issue with it is it’s hard to generally and at scale handle this PI information and balance it Bcs if it’s too good then ur sort of cooked The way cursor did it was nice but I fear its very hard to provide PI info reliably to solve the task correctly without running into issues

English

427

Chinmay@ChinmayKak·6d

@Grad62304977 im assuming you also don't like OPSD for the PI formulation is quite tricky/not clear? i agree with your post

English

414

Grad@Grad62304977·6d

Cool to see work on this although I think many misunderstood the main work here Making credit assignment work is mainly held back by the fact that in the objective to solve a task, it’s rare to have intermediate parts of a rollout that are verifiable directly There is no correct intermediate tool call to make for example Now something like model communication and code style it’s different Like if we want no failed tool calls (more code style then something to solve the task) then this is a verifiable intermediate task Same with model communication, like if we want the model to communicate its progress over time outside thinking Main question is if u can use credit assignment to actually improve the performance of the model on solving tasks correctly We’re currently quite bullish on some directions here that are elegant and general and hopefully can push some stuff out on this soon

Cursor@cursor_ai

We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. For example, we use text feedback during RL to learn faster by assigning credit in rollouts spanning hundreds of thousands of tokens.

English

9.5K

Grad retweetledi

Prime Intellect@PrimeIntellect·15 May

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English

154

1.7K

586.7K

Grad retweetledi

Prime Intellect@PrimeIntellect·13 May

Introducing Renderers RL trainers work in tokens. Environments work in messages. Going back and forth corrupts sampled tokens, wasting compute on every agentic turn. With Renderers, we fix this mismatch. This unlocks >3x throughput on popular open models.

English

698

192.5K

Grad retweetledi

Prime Intellect@PrimeIntellect·7 May

The next wave of AI will not be won by better prompts. It will be won by systems that learn from experience. Today, Prime Intellect Lab is out of beta, open for you to start training your own models. The era of self-improving agents is here.

English

196

1.9K

1.3M

Grad@Grad62304977·6 May

@teortaxesTex @rationaleist @rasbt Not that cheap if u target super long context tbf

Yushi Bai@realYushiBai

🧵 1/4 Still waiting for DeepSeek-V4? We (@Zai_org) made DSA 1.8× faster with minimal code change — and it's ready to deliver real inference gains on GLM-5. IndexCache removes 50% of indexer computations in DeepSeek Sparse Attention with virtually zero quality loss. On GLM-5 (744B), we get ~1.2× E2E speedup while matching the original across both long-context and reasoning tasks. On our experimental-sized 30B model, removing 75% of indexers gives 1.82× prefill and 1.48× decode speedup at 200K context. How? 🧵👇 #DeepSeek #GLM5 #Deepseekv4 #LLM #Inference #Efficiency #LongContext #MLSys #SparseAttention

English

177

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·6 May

@rationaleist @rasbt > DSA analog with a working linear indexer I don't believe in this made up BS lightning indexer as is already has implausible performance

English

340

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·6 May

pfft yeah, you get to choose bad baselines for a promo We need one good unified visualization of memory&compute cost per ith token @rasbt you probably have all formulas for different architectures ready, what do you think?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Oliver Sieberling@osieberling

Would not be too surprised if this was just sth. like: 60 layer hybrid - 256k "sliding window" attention every 4 blocks ("linear") - GDN in the remaining blocks compared to full attn: (60 * (1M)^2) / (15 * (256K)^2 + 45 * little) ≈ 52x speedup This is Qwen3.5-397B-A17B btw

English

8.6K

Grad@Grad62304977·3 May

@casper_hansen_ @hallerite Ya I do agree quality is down Most papers rn are “hey I thought of that before, cool that it prob works and someone tried it” There have been some cool ones tho esp from @jaseweston There’s also arxiv.org/abs/2603.18534 which was nice

English

177

Casper Hansen@casper_hansen_·3 May

@hallerite quality has gone down since ~2024. most of the conferences had work from 2022/2023 published, before agent slop. nowadays, there is so much noise that it’s hard to find signal. I wonder if @Grad62304977 still is able to find some nice papers for you internally tho :)

English

2.5K

hallerite@hallerite·2 May

It has been quite some time since I was deeply impressed by an AI paper from academia..

English

262

332.2K

Grad@Grad62304977·2 May

@dejavucoder Can see some ideas from here for inspiration

Prime Intellect@PrimeIntellect

Over the past months, Cohort I of our RL Residency has been shipping. Highlights - continual learning - automating AI research (from GPU programming to RL itself) - embodied environments - multi-agent systems - materials science discovery

English

220

sankalp@dejavucoder·2 May

@Grad62304977 any suggestions boss

English

776

sankalp@dejavucoder·2 May

i was surprised to see that qwen 3.5 4b frequently fumbled at reversing text sequences. but it was not a shocker when i recalled models do not see text as text, they see tokens... anyways you can RLmaxx the model to make it better at reversing strings. a worthy quest if you ask.

English

10.3K

Grad@Grad62304977·2 May

@teortaxesTex @zephyr_z9 Effort control is just different system prompt with different length penalty in training

English

105

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·1 May

@zephyr_z9 I'm kidding about that part I do think some entropix-shaped trick is involved in OpenAI's excellent effort control

English

831

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·1 May

Juice is entropix where did you think frog disappeared to, after saying that LLMs can't become butterflies? Like a few months before they started proving their mathematical chops?

Ankith 🐋/acc@dhtikna

It would be funny if juice in gpt was simply extracted out by the scaffolding and they do some sort of steering vector to nudge to close thinking based on some pre decided values of number of thinking tokens used

English

9.3K

Grad@Grad62304977·1 May

@_arohan_ But ya

Grad@Grad62304977

I don’t think ppl praise OpenAI enough for their openness with o1. Of course not very open, but key details like confirming it’s just one autoregressive model generating a CoT trained with rl were really enough to understand closely how to make an o1 model, and for DeepSeek to go ahead and prove it. Did really seem like all big labs were heading in some sort of wrong direction given the quick pivot to o1 style models. Sad that many were just not connecting the dots and misinterpreting or refusing to believe they were saying the truth, but really everyone was clueless before this info and it wouldn’t be clear what the state of things would be if OpenAI just gave o1 with no details and completely hidden reasoning

Indonesia

879

Grad@Grad62304977·1 May

@_arohan_ Kimi k1.5 was also out the same day as the r1 paper (although DeepSeek r1 preview was a while before iirc)

English

2.3K

rohan anil@_arohan_·1 May

If O1 had not mentioned thinking traces, thinking etc. i think rest of the companies would have taken longer to get there. In some sense, the world should be thankful for O1 team for creating the intelligence explosion for decisions they implicitly made. As well DeepSeek for reproducing an open recipe, that allowed for more folks to catch up (iiuc, deepseek was the second thinking release after o1?)

English

511

38K

Grad@Grad62304977·30 Nis

@HeMuyu0327 U should make the 0.5 learnable Can do lambda * v1 + (1-lambda) * v Lambda being a learnable scalar initialised as 0.5 Also using 0.01 lr and no weight decay (for smaller scale works better but larger scale smaller lr works better)

English

392

Muyu He@HeMuyu0327·30 Nis

It always baffles me how adding the first value vector to later attention layers' value vector improves the model. Ran a few ablations on nanochat and found the reasons might be more evasive than the paper suggested. The paper made at least two claims based on the observations that adding v1 to subsequent v (0.5 * v1 + 0.5 * v) helps: (1) the improvement comes from the information in the first value vector; (2) adding information to the key/query vector "confuses the model" and hurts performance. Regarding (1), some earlier studies (+ nanochat's native implementation) have found that v1 is not special, and what is special is some sort of linear transformation of the initial token embedding, h0. To prove this, we pass h0 to the same value matrix in subsequent attention layers, and add the resulting v to the current layer's v (0.5 * v1' + 0.5 * v). We find comparable performance as adding v1 directly (p1). Regarding (2), the paper's experiment is unprincipled: they add h0 directly to the current residual h (0.5 * h0 + 0.5 * h) and find the performance drops from just adding v. However, this technique directly modifies the residual stream, so the comparison is not apples-to-apples. To prove that adding h0 information to QK does not hurt, we bypass residual stream and directly add pass h0 to the same QK matrices in subsequent attention layers, and find comparable performances as adding v1 (p2). This shows that feeding h0 to QK neither hurts nor improves the model. There are a few more hypotheses to test regarding adding v1 to subsequent attention layers, and they together form this great puzzle of "why value vector", and "why from the first embedding". We will share more results that try to understand what's behind this phenomenon better. At least two ablations have already lined up!

English

4.2K

Grad@Grad62304977·29 Nis

@saurabh_shah2 @agarwl_ As far as I’m aware

English

Grad@Grad62304977·29 Nis

@saurabh_shah2 @agarwl_ Ya we were the first to use the same setup shown in these slides in intellect-3 (can see the paper) Was also done after in glm-5

English

388

Rishabh Agarwal@agarwl_·29 Nis

I gave a talk at ICLR 2026 about how we are scaling RL on frontier LLMs with 1T+ parameters, on experimental data from our physical lab at Periodic! Here's a rough recording of the talk:

English

172

1.8K

213.3K

Grad@Grad62304977·29 Nis

Ya true SWA is defo not enough I still think DSA is up there with the biggest arch changes I think lots believed something like it should work but making it work was the hardest part I do think tho that a DSA SWA hybrid efficiency wise is the same or better than the DeepSeek arch (and then have to model performance)

English

879

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·29 Nis

but muh SWA bro! SWA is enough! MiMo also has 1M context, DeepSeek got no moat! Why invent such complexity, boss Liang has gone mad!

nicekate@nicekate8888

各大 AI 大模型价格，5月31日前 DeepSeek-V4-Pro 非常有价格优势

English

13.5K

Keşfet

@tszzl @stochasticchasm @teortaxesTex @rationaleist @rasbt @casper_hansen_ @hallerite @jaseweston