relu

1.1K posts

relu

@iamrussianagent

youll never know

Katılım Mayıs 2017

81 Takip Edilen36 Takipçiler

relu@iamrussianagent·3h

@allgarbled Next. Token. Prediction. Yes typos make the response worse

English

gabe@allgarbled·3h

I think you get worse results from LLMs if your messages are full of typos and grammatical mistakes. I’m sure this would not show up in benchmarks, but I still believe it. The models respect you less and become lazier as a result.

English

605

37.8K

relu@iamrussianagent·4h

@trymirai This statement is independent of quantization. All its saying is Muon approximates the Hessian

English

Mirai@trymirai·8h

Why Muon performs exceptionally well on quantized models

ryan mathieu@gapDEEPry

Why does Muon beat Adam for training quantized networks? It comes down to what each optimizer treats as "distance" in weight space. Adam treats a weight matrix as a flat vector of numbers. Muon treats it as a linear map — and measures change by how much the input-output mapping moved. gradient G has SVD G = U Sigma V^T. Muon's update is just U V^T. keep the directions, throw away the magnitudes

English

8.3K

relu@iamrussianagent·4h

@phoebeyao right, base models tend to be calibrated at a sequence level but RL/SFT models are not. I'm doubtful this would work

English

Phoebe Yao@phoebeyao·10h

our team recently completed a study on model metacognition and found that confidence and capability were almost completely uncorrelated across 19 frontier models from different labs. self-certainty as a reward signal assumes confidence correlates with correctness. i wonder whether this holds outside math and code. more to share soon.

Alex Weers@a_weers

Next read is on how to generalize GRPO to domains without verifiers

English

6.6K

relu@iamrussianagent·4h

@courtlandleer such a grifter. using 12 rollouts and a judge WAOW it's never been done before

English

Courtland Leer@courtlandleer·10h

it’s hard for me to express sufficiently just how dishonest it is (if you really work on memory) to present a LongMem score as some kind of breakthrough it’s a 3 year old benchmark, the results here are dishonest, it’s a marginal amount of tokens in contemporary ai, an everyone aces it already

Dhravya Shah@DhravyaShah

x.com/i/article/2035…

English

464

70K

relu retweetledi

Funes@Bulkington___·1d

Always fascinated how people in the middle ages for hundreds of years just lived amongst the ever decrepitating Roman ruins. It was just a part of daily life for them.

English

224

1.5K

33.5K

4.3M

relu@iamrussianagent·22h

@Adrian_Dev_ @MaxRovensky @daniel_mac8 if prompt in output: refuse

English

Adrián@Adrian_Dev_·1d

@MaxRovensky @daniel_mac8 name it

English

146

Dan McAteer@daniel_mac8·1d

No prompt is safe. This is a real problem if your prompts are highly optimized and you invested a lot of effort into them. What can you do?

English

7.9K

relu@iamrussianagent·23h

@shauryr A second for loop is very powerful. Once we add a third it might be AGI

English

Shaurya Rohatgi@shauryr·1d

Spent the last week experimenting with auto-research and post train bench github.com/aisa-group/Pos… ; here's what I learned. The throughput is genuinely impressive. Claude ran 16 SFT iterations for a Qwen 1.7B base trying to perfect GSM8K in just 44 hours, under 2 days on 8 GPUs. It tried some non-trivial experiments: - Rejection sampling training with lower lr improves performance - Using Qwen3's dedicated token for structured chain-of-thought - Filtering a numerical subset of MetaMathQA to remove noisy samples The results were real too. PostTrainBench reports 41% for Opus 4.6 on GSM8K with 1 GPU. In my experiment by v4 (~10 hours in), the automated loop had pushed that to 69% on 8xH200. But it's not all smooth sailing. The LLM-harness combo has a tendency to get stuck in loops making incremental tweaks instead of fundamentally rethinking its approach. At one point it got stuck doing endless model souping (averaging weights between checkpoints) rather than exploring new directions. It's frustrating to watch the system confidently run an experiment you already know won't yield real gains. Also, finally it did give up LOL says that nothing beats straightforward SFT. My next direction is add an observer LLM with a limited "nudge" budget to the main loop when the trainer agent is going in circles. This way the observer is really thinking about when to nudge, as it only get k turns. The direction is extremely promising. The volume of experiments, iterations, and findings an LLM can produce autonomously is hard to match manually. It just needs better steering. Initially I thought it would be a lot in API token costs, but the agent is mostly waiting for the model to train and is not using a lot of tokens. Here is the final report -

English

3.6K

relu retweetledi

ʏᴏᴜʀ ᴘᴀʟ, ᴅᴀᴋᴏᴛᴀ@DEEP_RED_BELLS·1d

At least 80% of the AI slop I have encountered in real life has been independent beer logos and cans

English

1.1K

16.7K

relu retweetledi

Dwarkesh Patel@dwarkesh_sp·2d

When Copernicus proposed heliocentrism in 1543, it was actually less accurate than Ptolemy's geocentric model - a system refined over 1,400 years with epicycles precisely tuned to match observed planetary positions. It took another 70 years before Kepler, working from Tycho Brahe's unprecedentedly precise observations, replaced Copernicus’s circles with ellipses - finally making heliocentrism empirically superior. Terence Tao's point is that science needs a high temperature setting. If we only fund and follow what's most state of the art today, we kill the ideas that might need decades of work to surpass some overall plateau.

English

119

585

4.7K

508.1K

relu@iamrussianagent·1d

@NoahZiems GEPA should be required when making claims about capability

English

145

Noah Ziems@NoahZiems·1d

I think many research ideas never work out because they simply aren't tuned properly. Having a reflective, iterative loop like GEPA or autoresearch available is going to save so many good ideas that were never given a fair shot

English

107

6.8K

relu retweetledi

Dwarkesh Patel@dwarkesh_sp·1d

Terence Tao spent a year at the Institute for Advanced Study - no teaching, no random events of committees, just unlimited time to think. But after a few months, he ran out of ideas. Terence thinks that mathematicians and scientists need a certain level of randomness and inefficiency to come up with new ideas.

English

108

514

5.1K

659.3K

relu@iamrussianagent·2d

@Shiwei_Liu66 yeah interesting, honestly ive never found weight decay to be critical even when training smaller models.

English

Shiwei Liu@Shiwei_Liu66·3d

Some of the most surprising things in our paper, at least to me, are the following: 1. For a long time, I wondered why we still need weight decay in LLMs, since they rarely seem to overfit. Only recently did I realize that weight decay may actually help us train deeper layers more effectively. 2. Mixture-of-experts is usually viewed as an effective way to scale up a model’s width, but it turns out that this sparse connectivity also helps signals propagate more effectively through depth.

Shiwei Liu@Shiwei_Liu66

Residual connections and pre-norm are not the whole story behind depth utilization. Our new paper shows that many seemingly different design choices — MoE, grouped-query attention, weight decay, and longer sequence length — can be understood through one unifying lens: sparsity. These components induce different forms of sparsity, which reduce output variance and in turn preserve healthier gradient flow across depth. Strikingly, these techniques also complement each other remarkably well: when combined, they lead to substantial improvements in depth utilization and notable gains in downstream accuracy. Paper page: pumpkin-co.github.io/SparsityAndCoD/ Arxiv: arxiv.org/pdf/2603.15389 Leading by @pumpkinnnnne

English

7.4K

relu@iamrussianagent·2d

ok it might actually be over

Adam Kobeissi@TKL_Adam

In a sudden turn of events, US 12-month inflation expectations have surged to 5.2%, the highest level since March 2023. In just 3 weeks, markets have gone from pricing-in rate cuts to rate hikes.

English

relu@iamrussianagent·2d

@leerob I don't see why people care. The point of the open model ecosystem is to use it

English

Lee Robinson@leerob·2d

Yep, Composer 2 started from an open-source base! We will do full pretraining in the future. Only ~1/4 of the compute spent on the final model came from the base, the rest is from our training. This is why evals are very different. And yes, we are following the license through our inference partner terms.

Fynn@fynnso

was messing with the OpenAI base URL in Cursor and caught this accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast so composer 2 is just Kimi K2.5 with RL at least rename the model ID

English

358

199

2.8K

1.4M

relu retweetledi

Reverend Scott@Reverend_Scott·3d

I wanna see the CEO of Pfizer taking benzos

Trung Phan@TrungTPhan

Costco CEO Ron Vachris did the “CEO eats his own product” challenge by destroying a hot dog (and confirms the Costco hot dog combo is staying at $1.50 forever). Legend.

English

117

55.6K

1.3M

relu@iamrussianagent·3d

@paraschopra OOD data is OOD. It honestly is hard to remember that in the world of coding agents! Same reason toon didn’t last

English

198

Paras Chopra@paraschopra·3d

We found a task where LLMs struggle massively! Give them a coding problem in Python and they'd work great. Give the same problem in brainfuck and zero-shot their performance is ~0% +[--------->+<]>+.++[--->++<]>+.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

175K

relu retweetledi

Shivers@thinkingshivers·4d

Not enough people talk about how unpleasant vibecoding is. The best analogy I can think of is driving. It's cool that we can just hop in a car and drive to the store. It's a lot faster than walking. And yet, it's so stressful and infuriating, we had to invent a new word just to describe its effect on people: "road rage." AI-assisted coding is the same. It's so much faster--there's no going back to coding everything by hand, the equivalent of walking everywhere. And yet it's incredibly annoying and stressful. It's characterized by annoying delays between requests, time-wasting misunderstandings, blatant lying, and absurd overconfidence. Hopefully this gets better as models improve.

English

675

37.1K

relu retweetledi

Stefan Schubert@StefanFSchubert·4d

Chess is a really atypical profession. When we watch chess, we don't just care about the objective quality of the moves: we care about them having been decided by a human. By contrast, when we hire a doctor or an accountant we typically just care about the output.

Dr. Dominic Ng@DrDominicNg

Chess is 30 years ahead of every other profession in dealing with AI. The best case study we have for what's coming. 4 lessons: 1. Human-AI collaboration had a 15-year shelf life in chess. "Human in the loop" is a phase.

English

958

69K

relu@iamrussianagent·4d

@DrDominicNg The world is not bounded like chess

English