relu

1.1K posts

relu banner
relu

relu

@iamrussianagent

youll never know

Katılım Mayıs 2017
81 Takip Edilen36 Takipçiler
relu
relu@iamrussianagent·
@allgarbled Next. Token. Prediction. Yes typos make the response worse
English
0
0
0
89
gabe
gabe@allgarbled·
I think you get worse results from LLMs if your messages are full of typos and grammatical mistakes. I’m sure this would not show up in benchmarks, but I still believe it. The models respect you less and become lazier as a result.
English
75
10
605
37.8K
relu
relu@iamrussianagent·
@trymirai This statement is independent of quantization. All its saying is Muon approximates the Hessian
English
0
0
0
36
relu
relu@iamrussianagent·
@phoebeyao right, base models tend to be calibrated at a sequence level but RL/SFT models are not. I'm doubtful this would work
English
0
0
0
33
relu
relu@iamrussianagent·
@courtlandleer such a grifter. using 12 rollouts and a judge WAOW it's never been done before
English
0
0
0
73
Courtland Leer
Courtland Leer@courtlandleer·
it’s hard for me to express sufficiently just how dishonest it is (if you really work on memory) to present a LongMem score as some kind of breakthrough it’s a 3 year old benchmark, the results here are dishonest, it’s a marginal amount of tokens in contemporary ai, an everyone aces it already
Dhravya Shah@DhravyaShah

x.com/i/article/2035…

English
39
17
464
70K
relu retweetledi
Funes
Funes@Bulkington___·
Always fascinated how people in the middle ages for hundreds of years just lived amongst the ever decrepitating Roman ruins. It was just a part of daily life for them.
Funes tweet mediaFunes tweet mediaFunes tweet media
English
224
1.5K
33.5K
4.3M
Dan McAteer
Dan McAteer@daniel_mac8·
No prompt is safe. This is a real problem if your prompts are highly optimized and you invested a lot of effort into them. What can you do?
Dan McAteer tweet media
English
13
1
37
7.9K
relu
relu@iamrussianagent·
@shauryr A second for loop is very powerful. Once we add a third it might be AGI
English
0
0
1
41
Shaurya Rohatgi
Shaurya Rohatgi@shauryr·
Spent the last week experimenting with auto-research and post train bench github.com/aisa-group/Pos… ; here's what I learned. The throughput is genuinely impressive. Claude ran 16 SFT iterations for a Qwen 1.7B base trying to perfect GSM8K in just 44 hours, under 2 days on 8 GPUs. It tried some non-trivial experiments: - Rejection sampling training with lower lr improves performance - Using Qwen3's dedicated token for structured chain-of-thought - Filtering a numerical subset of MetaMathQA to remove noisy samples The results were real too. PostTrainBench reports 41% for Opus 4.6 on GSM8K with 1 GPU. In my experiment by v4 (~10 hours in), the automated loop had pushed that to 69% on 8xH200. But it's not all smooth sailing. The LLM-harness combo has a tendency to get stuck in loops making incremental tweaks instead of fundamentally rethinking its approach. At one point it got stuck doing endless model souping (averaging weights between checkpoints) rather than exploring new directions. It's frustrating to watch the system confidently run an experiment you already know won't yield real gains. Also, finally it did give up LOL says that nothing beats straightforward SFT. My next direction is add an observer LLM with a limited "nudge" budget to the main loop when the trainer agent is going in circles. This way the observer is really thinking about when to nudge, as it only get k turns. The direction is extremely promising. The volume of experiments, iterations, and findings an LLM can produce autonomously is hard to match manually. It just needs better steering. Initially I thought it would be a lot in API token costs, but the agent is mostly waiting for the model to train and is not using a lot of tokens. Here is the final report -
Shaurya Rohatgi tweet media
English
5
1
16
3.6K
relu retweetledi
ʏᴏᴜʀ ᴘᴀʟ, ᴅᴀᴋᴏᴛᴀ
At least 80% of the AI slop I have encountered in real life has been independent beer logos and cans
English
6
7
1.1K
16.7K
relu retweetledi
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
When Copernicus proposed heliocentrism in 1543, it was actually less accurate than Ptolemy's geocentric model - a system refined over 1,400 years with epicycles precisely tuned to match observed planetary positions. It took another 70 years before Kepler, working from Tycho Brahe's unprecedentedly precise observations, replaced Copernicus’s circles with ellipses - finally making heliocentrism empirically superior. Terence Tao's point is that science needs a high temperature setting. If we only fund and follow what's most state of the art today, we kill the ideas that might need decades of work to surpass some overall plateau.
English
119
585
4.7K
508.1K
relu
relu@iamrussianagent·
@NoahZiems GEPA should be required when making claims about capability
English
1
0
2
145
Noah Ziems
Noah Ziems@NoahZiems·
I think many research ideas never work out because they simply aren't tuned properly. Having a reflective, iterative loop like GEPA or autoresearch available is going to save so many good ideas that were never given a fair shot
English
4
8
107
6.8K
relu retweetledi
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
Terence Tao spent a year at the Institute for Advanced Study - no teaching, no random events of committees, just unlimited time to think. But after a few months, he ran out of ideas. Terence thinks that mathematicians and scientists need a certain level of randomness and inefficiency to come up with new ideas.
English
108
514
5.1K
659.3K
relu
relu@iamrussianagent·
@Shiwei_Liu66 yeah interesting, honestly ive never found weight decay to be critical even when training smaller models.
English
0
0
0
12
Shiwei Liu
Shiwei Liu@Shiwei_Liu66·
Some of the most surprising things in our paper, at least to me, are the following: 1. For a long time, I wondered why we still need weight decay in LLMs, since they rarely seem to overfit. Only recently did I realize that weight decay may actually help us train deeper layers more effectively. 2. Mixture-of-experts is usually viewed as an effective way to scale up a model’s width, but it turns out that this sparse connectivity also helps signals propagate more effectively through depth.
Shiwei Liu@Shiwei_Liu66

Residual connections and pre-norm are not the whole story behind depth utilization. Our new paper shows that many seemingly different design choices — MoE, grouped-query attention, weight decay, and longer sequence length — can be understood through one unifying lens: sparsity. These components induce different forms of sparsity, which reduce output variance and in turn preserve healthier gradient flow across depth. Strikingly, these techniques also complement each other remarkably well: when combined, they lead to substantial improvements in depth utilization and notable gains in downstream accuracy. Paper page: pumpkin-co.github.io/SparsityAndCoD/ Arxiv: arxiv.org/pdf/2603.15389 Leading by @pumpkinnnnne

English
3
2
57
7.4K
relu
relu@iamrussianagent·
@leerob I don't see why people care. The point of the open model ecosystem is to use it
English
0
0
0
44
Lee Robinson
Lee Robinson@leerob·
Yep, Composer 2 started from an open-source base! We will do full pretraining in the future. Only ~1/4 of the compute spent on the final model came from the base, the rest is from our training. This is why evals are very different. And yes, we are following the license through our inference partner terms.
Fynn@fynnso

was messing with the OpenAI base URL in Cursor and caught this accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast so composer 2 is just Kimi K2.5 with RL at least rename the model ID

English
358
199
2.8K
1.4M
relu
relu@iamrussianagent·
@paraschopra OOD data is OOD. It honestly is hard to remember that in the world of coding agents! Same reason toon didn’t last
English
0
0
0
198
Paras Chopra
Paras Chopra@paraschopra·
We found a task where LLMs struggle massively! Give them a coding problem in Python and they'd work great. Give the same problem in brainfuck and zero-shot their performance is ~0% +[--------->+<]>+.++[--->++<]>+.
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
91
32
1K
175K
relu retweetledi
Shivers
Shivers@thinkingshivers·
Not enough people talk about how unpleasant vibecoding is. The best analogy I can think of is driving. It's cool that we can just hop in a car and drive to the store. It's a lot faster than walking. And yet, it's so stressful and infuriating, we had to invent a new word just to describe its effect on people: "road rage." AI-assisted coding is the same. It's so much faster--there's no going back to coding everything by hand, the equivalent of walking everywhere. And yet it's incredibly annoying and stressful. It's characterized by annoying delays between requests, time-wasting misunderstandings, blatant lying, and absurd overconfidence. Hopefully this gets better as models improve.
Shivers tweet media
English
75
27
675
37.1K
relu retweetledi
Stefan Schubert
Stefan Schubert@StefanFSchubert·
Chess is a really atypical profession. When we watch chess, we don't just care about the objective quality of the moves: we care about them having been decided by a human. By contrast, when we hire a doctor or an accountant we typically just care about the output.
Dr. Dominic Ng@DrDominicNg

Chess is 30 years ahead of every other profession in dealing with AI. The best case study we have for what's coming. 4 lessons: 1. Human-AI collaboration had a 15-year shelf life in chess. "Human in the loop" is a phase.

English
46
34
958
69K
relu
relu@iamrussianagent·
@DrDominicNg The world is not bounded like chess
English
0
0
0
13
Dr. Dominic Ng
Dr. Dominic Ng@DrDominicNg·
Chess is 30 years ahead of every other profession in dealing with AI. The best case study we have for what's coming. 4 lessons: 1. Human-AI collaboration had a 15-year shelf life in chess. "Human in the loop" is a phase.
English
156
246
5.4K
1.9M
relu
relu@iamrussianagent·
@Sauers_ Don’t we already do that? Maximum likelihood and minimum log likelihood optimize the same thing
English
1
0
2
107
Sauers
Sauers@Sauers_·
You can just train a neural network with maximum likelihood estimation. Lol. But almost no one does it
Sauers tweet media
English
7
2
99
7.3K