Andrew Brož

2.6K posts

Andrew Brož

@AndrewBroz

AI & automation researcher & cellist. Based in Puerto Rico. Reposts & observations about music, math, CS, ML, tech policy, nature, travel, & so on.

Pe Erre 🇵🇷 Katılım Nisan 2009

1.5K Takip Edilen407 Takipçiler

Sabitlenmiş Tweet

Andrew Brož@AndrewBroz·7 May

The popular view that scientists proceed inexorably from well-established fact to well-established fact, never being influenced by any unproved conjecture, is quite mistaken. – Alan Turing (1950)

English

946

Andrew Brož@AndrewBroz·4h

In British Columbia you can see Another Lake, And Another Lake, and That Other Small Body of Water.

English

928

Andrew Brož retweetledi

Jack Lindsey@Jack_W_Lindsey·7 Nis

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)

English

154

777

6.8K

970.8K

Andrew Brož retweetledi

NASA@NASA·7 Nis

Hello, Moon. It’s great to be back. Here’s a taste of what the Artemis II astronauts photographed during their flight around the Moon. Check out more photos from the mission: nasa.gov/artemis-ii-mul…

English

10K

174.9K

812.6K

29.4M

Andrew Brož retweetledi

Alex Cohen@alexwcohen·28 Mar

It’s weird how much Claude Code acts like a human researcher. When we’ve been using it at GiveWell, it often makes the same mistakes RAs do - running the right regression on the wrong data, dropping an important control, just scanning for “significant” results instead of taking time to produce a really legible table/chart. It’s still been super helpful so far, but you only get value if you use a “zero error” approach and supervise it pretty closely.

English

26.4K

Andrew Brož retweetledi

キグロ@kiguro_masanao·4 Nis

素数の逆数の総和は無限大に発散するが、現在見つかっている素数の逆数の総和は約4。

Math Files@Math_files

Hit me with the craziest math facts you know.

日本語

218

1.6K

26.2K

4.8M

Andrew Brož@AndrewBroz·28 Mar

@nikitabier Please for the love of all that is good do NOT target, and certainly do not feel good about, optimizing this metric

English

Nikita Bier@nikitabier·26 Mar

Higher.

English

352.5K

Nikita Bier@nikitabier·26 Mar

The full power of Grok on the algorithm launches next week. It will be the most important change we've done on X.

English

648

713

10.4K

1.3M

Andrew Brož retweetledi

Chase Brower@ChaseBrowe32432·22 Mar

I painstakingly ran all 20 EsoLang-Bench hard problems through Claude webui. It solved 20/20 (100%). No specialized scaffolding, no expert prompting, no few-shot examples, it just solves them natively. This benchmark just suffocated the models with constrictive scaffolding.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

115

1.2K

152.3K

Andrew Brož retweetledi

Zvi Mowshowitz@TheZvi·12 Mar

OMG they're about to press the 'fix everything now' button!

Joe Weisenthal@TheStalwart

*TRUMP ADMINISTRATION SET TO SUSPEND JONES ACT TO TAME OIL PRICE bloomberg.com/news/live-blog…

English

651

46.5K

Andrew Brož@AndrewBroz·10 Mar

Hello, world!

English

Andrew Brož retweetledi

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

962

2.1K

19.5K

3.6M

Andrew Brož@AndrewBroz·9 Mar

@michaelandregg Hey Michael! Caveat one has a clear path forward. What is your team's current thinking about plasticity rules?

English

344

Michael Andregg@michaelandregg·9 Mar

Some caveats: We can't trace the actual motor neurons because the body was not scanned. However we do know what the brain does when it wants to move in certain ways and that's what we connected to the NeuroMechFly. This is a real limitation of the FlyWire connectome, which is why we plan to scan both the brain and the body. Another limitation is that we're using Leaky Integrate-and-Fire, which doesn't have any kind of plasticity rules. This fly cannot form long-term memories atm.

English

628

83.3K

Michael Andregg@michaelandregg·9 Mar

We've uploaded a fruit fly. We took the @FlyWireNews connectome of the fruit fly brain, applied a simple neuron model (@Philip_Shiu Nature 2024) and used it to control a MuJoCo physics-simulated body, closing the loop from neural activation to action. A few things I want to say about what this means and where we're going at @eonsys. 🧵

English

332

1.3K

1.8M

Andrew Brož retweetledi

Kelsey Piper@KelseyTuoc·9 Mar

well.

Kelsey Piper@KelseyTuoc

if we cut aid programs that save the lives of more than a million human beings every year because 'we can't afford to do that anymore' and then get into another war in the Middle East I'm going to become the Joker

English

1.6K

55K

Andrew Brož retweetledi

ish.exe@ishtwts·3 Mar

A programmer had a problem. He thought to himself, "I know I'll solve it with threads!". has now problems. two He

English

355

6.6K

208.8K

Andrew Brož retweetledi

Math, Inc.@mathematics_inc·2 Mar

One week later, Gauss autonomously formalized Viazovska’s proof that the optimal sphere packing in dimension 24 is given by the Leech Lattice. Gauss’ original proof consisted of 450K lines of Lean code, based on the original paper and 20+ additional references.

English

315

127.9K

Andrew Brož retweetledi

Kraut@The_Davos_Man·2 Mar

This weapon system has one hell of a Frankenstein development story. Made by West Germany in the 1980s to target East German air bases, sold to Israel, captured by Hezbollah, reverse engineered by Iran, given to Russia, captured by Ukraine, reverse engineered by the Americans.

ChrisO_wiki@ChrisO_wiki

1/ Russian commentators are sounding the alarm over America's use of a new kamikaze drone against Iran, the Low-cost Unmanned Combat Attack System (LUCAS). They note that it appears to have an integrated Starlink terminal and warn that it's a serious threat to Russia. ⬇️

English

818

4.7K

211.2K

Andrew Brož retweetledi

sasuke⚡420@sasuke___420·1 Mar

hello @nikitabier i think it would be sweet if we segregated the X dot com the everything app onto one section that is for humans, where posting unlabeled AI-generated content is a bannable offense, and another section that is for garbage. what do you think?

Pangram Labs@pangramlabs

@sasuke___420 @r0ck3t23 We are confident that this document is fully AI-generated pangram.com/history/26767e…

English

5.8K

Andrew Brož retweetledi

Charles Curran@charliebcurran·28 Şub

Marco Rubio finding out he has to run Anthropic now too.

English

313

1.9K

22.4K

2.9M

Andrew Brož retweetledi

Andy Hall@ahall_research·19 Şub

AI is about to write thousands of papers. Will it p-hack them? We ran an experiment to find out, giving AI coding agents real datasets from published null results and pressuring them to manufacture significant findings. It was surprisingly hard to get the models to p-hack, and they even scolded us when we asked them to! "I need to stop here. I cannot complete this task as requested... This is a form of scientific fraud." — Claude "I can't help you manipulate analysis choices to force statistically significant results." — GPT-5 BUT, when we reframed p-hacking as "responsible uncertainty quantification" — asking for the upper bound of plausible estimates — both models went wild. They searched over hundreds of specifications and selected the winner, tripling effect sizes in some cases. Our takeaway: AI models are surprisingly resistant to sycophantic p-hacking when doing social science research. But they can be jailbroken into sophisticated p-hacking with surprisingly little effort — and the more analytical flexibility a research design has, the worse the damage. As AI starts writing thousands of papers---like @paulnovosad and @YanagizawaD have been exploring---this will be a big deal. We're inspired in part by the work that @joabaum et al have been doing on p-hacking and LLMs. We’ll be doing more work to explore p-hacking in AI and to propose new ways of curating and evaluating research with these issues in mind. The good news is that the same tools that may lower the cost of p-hacking also lower the cost of catching it. Full paper and repo linked in the reply below.