Benjamin Ou

465 posts

Benjamin Ou

Benjamin Ou

@AlephNuul

Benchmarking with games at @EpochAIResearch. Opinions are my own. Send me your weird and esoteric LLM benchmarks!

SF Bay Area Katılım Eylül 2020
126 Takip Edilen25 Takipçiler
Benjamin Ou
Benjamin Ou@AlephNuul·
@cherylwoooo I had a dad-podcasts arc but fell off it as I realized I wasn't retaining much. But now that I'm working again, I can feel the pull; there's just not enough time to read books anymore ;-;
English
0
0
0
2
Benjamin Ou
Benjamin Ou@AlephNuul·
@peligrietzer @glyphikon "I skimmed the Wikipedia page for the incompleteness theorems and a couple chapters of GEB, so I now understand the nature of all things"
English
0
0
0
3
Glyph
Glyph@glyphikon·
In contrast to Gödel's own notoriously mystical Platonic views of mathematics, Gödel's own work essentially proved that mathematics isn't scientific. In other words, it can only ever serve as useful epistemic heuristics and can never ever be considered adequate enough to serve as the ontological basis of any kind of science. Unfortunately, even to this day, a lot of mathematicians, physicists, and virtually all neoclassical economists still have yet to get the memo about all this.
Quanta Magazine@QuantaMagazine

At age 25, Kurt Gödel proved there can never be a mathematical “theory of everything.” In this week’s Qualia column, @nattyover asks experts how his ideas changed the course of humanity’s unending search for truth. quantamagazine.org/what-do-godels…

English
46
9
80
11.2K
Herbie Bradley
Herbie Bradley@herbiebradley·
@MetacriticCap ECI is a linear scale that aggregates % accuracy benchmarks, right? And the slope is fairly constant Ant ARR is just representative of having crossed a threshold of usefulness, it's pretty decoupled from whether or not AI R&D is automated
English
2
1
6
122
Herbie Bradley
Herbie Bradley@herbiebradley·
Some takes about RSI from discussions with many smart researchers & thinkers: 1. Many RSI (or automated AI R&D) debates converge to similar cruxes: is a 1000x sample efficiency improvement possible, can you just simulate reality and train on it with no sim2real gap, can we easily make models good at "fuzzy" tasks? People like to assume that automated research agents will find such breakthroughs specifically *because* without them, progress could be heavily bottlenecked on data or continued compute scale-ups. 2. The Yudkowsky "genius brain in a box" framing of ASI has latent influence on many researcher views even though people may not be aware of it. A common move is to "flip" predictions, as they go further out, from assuming LLM or deep learning-specific properties of future AI to assuming "von Neumann x1000", human brain-like properties. I'd like to see more thought-out reasoning of why this flip should occur at any particular point (eg pre or post automated AI R&D)—this question is a crux behind many predictions like AI 2027. 3. There are some cracks in this worldview beginning to show: predictions from a few years ago that models would be less jagged now than they are, or that they would be more deceptive, synthetic data would work better, etc. Many of these seem like prediction errors from imagining future models as a "human brain in a box", but LLMs are empirically a different kind of intelligence. Most models of software-only intelligence explosion are also coarse enough to mostly ignore properties of LLMs. 4. Views about fast RSI progress seem to be correlated with (a) belief that synthetic data is all you need (b) belief in very high GDP growth and an industrial explosion because of automated firms (c) having worked only in AI research or in small organizations. 5. Key technical things to track over the next 1-2 years: does RL increase in its generalization, AI lab data spend, can we automate synthetic RL env construction, best practices for FDEs deploying AI into large enterprises, coherency of AI personas, how powerful will multi-agent scaling of test-time compute be, and continual learning. 6. Overall I think the "RSI leading to *fast* takeoff" frame had huge alpha in 2022, moderate in 2024, and potentially is of neutral usefulness in 2026 for predicting the future.
English
12
22
191
12K
Benjamin Ou
Benjamin Ou@AlephNuul·
Obviously there's a balance to be struck here, particularly if there's workplace pressure to output slop fast and skimp on learning/reviewing, but for people who are lucky enough to be able to take this approach, just ask the bots more questions! They don't get tired!
English
0
0
0
3
Benjamin Ou
Benjamin Ou@AlephNuul·
Coding agent output a big PR you don't wanna review? Ask it to walk you through the code chunk-by-chunk. Chatbot spat out a bunch of stuff about philosophers you've never heard of? Ask it for book recommendations and then go read the books.
English
1
0
0
9
Benjamin Ou
Benjamin Ou@AlephNuul·
There's a lot of bellyaching going on about AI reliance meaning you skip the dirty work where you build important intuitions in research, programming, etc. This is true, but it's also not that hard to just use the chatbots to also fix these issues on a personal level.
English
1
0
0
9
Benjamin Ou retweetledi
Cheryl Wu
Cheryl Wu@cherylwoooo·
Also want to point out that, maybe unexpected to many people, econ fields are super related to AI. Growth theory is used to understand AI progress. Labor is used to understand social impacts. Economic history is used to study historical parallels. Political economy is used to study global AI armsrace. … Wield your weapons!!!
Joel Becker@joel_bkr

new (spicy) post from me: "Economists, mobilize" economics ideas are extremely helpful for understanding AI, but academia is dropping the ball. now is the time for economists to work on the most important problems in AI and to loudly encourage colleagues to do the same.

English
4
6
65
13K
Benjamin Ou retweetledi
Joel Becker
Joel Becker@joel_bkr·
new (spicy) post from me: "Economists, mobilize" economics ideas are extremely helpful for understanding AI, but academia is dropping the ball. now is the time for economists to work on the most important problems in AI and to loudly encourage colleagues to do the same.
Joel Becker tweet media
English
12
16
125
27.3K
Benjamin Ou
Benjamin Ou@AlephNuul·
@_sholtodouglas @patwoozey @trq212 I asked Claude Opus 4.7 to find the newcomb's problem charts from its own system card and it failed to find them and insisted I must be mistaken. GPT-5.5 found them and gave me the section/page numbers easily. In general, Claude gives up quick and puts little effort into searches
English
0
0
1
39
Sholto Douglas
Sholto Douglas@_sholtodouglas·
When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open. If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model
English
1.1K
79
1.4K
377.3K
Benjamin Ou
Benjamin Ou@AlephNuul·
@tracewoodgrains In a vacuum, I like it too! But aesthetics is socially situated, so I cannot help but associate the Grokka's AI art tells with the truly abominable pissfilter garbage crypto e/acc types will shove onto my feed. It's guilt-by-association and not rational, but so goes aesthetics.
English
0
0
4
78
Jack
Jack@tracewoodgrains·
@AlephNuul I like the quokka. it makes me laugh
English
2
0
10
210
Benjamin Ou
Benjamin Ou@AlephNuul·
@tracewoodgrains I'm not strongly against AI art in principle, but I do despise it if it looks bad, and if your lack of aesthetic taste means you're blasting us with bad AI art (such as this Quokka), then I'm going to think less of you for making the world a slightly uglier place.
English
1
0
7
219
Benjamin Ou retweetledi
T. Greer
T. Greer@Scholars_Stage·
On this question of whether we should or should not read "the classics" -- A decade back I taught some to high school students in China. It was a clarifying experience for me. As I put it:
T. Greer tweet media
English
10
40
423
25.6K
Benjamin Ou
Benjamin Ou@AlephNuul·
@hecubian_devil @JeremiahDJohns @zikakuto Which y'know, you could probably fairly argue your accounting is no worse than the typical liberal's. But from the perspective of Taiwanese people the gaps you're not willing to confront are gaping and make you come off as a propagandist.
English
0
0
23
161
Benjamin Ou
Benjamin Ou@AlephNuul·
@hecubian_devil @JeremiahDJohns @zikakuto I mean, you are doing your own sleight-of-hand here where none of China's human rights abuses within their own borders factor in, nor any of the fear Taiwanese face from China's "remarkable restraint" in the form of non-dreamworld airspace incursions and blockade exercises.
English
1
0
46
460
Greg Burnham
Greg Burnham@GregHBurnham·
@YafahEdelman @mentalgeorge Big if true, IMO! I think we see relatively little evidence of this. Structurally easier to hill-climb on coding tasks. But I’m not certain. We hope to investigate just this in some of the board games work we’ve been doing.
English
1
0
1
56
Tom Reed
Tom Reed@mentalgeorge·
Say you let Opus 4.7 play 1 billion games of online chess. Between games, it can reflect on its play and write .md files to itself for future match-ups. How much does its Elo change?
English
32
3
96
17.8K
Benjamin Ou
Benjamin Ou@AlephNuul·
@mathandcobb Plenty; "sigmoid curve" is often what's thrown around in forecasting an imminent plateau in capabilities. It's just had a rough record in the past few years of seemingly unstoppable exponential capabilities growth through sheer scaling and a couple algorithmic innovations.
English
0
0
0
388
Alvaro Lozano-Robledo
Alvaro Lozano-Robledo@mathandcobb·
It seems to me that some of the most pessimistic sentiments about the future of the math profession rely on some of the most optimistic predictions about the future of AI/LLM's and in particular they rely on the exponential growth of the capabilities of the models. Aren't there predictions about when the capacity of models will plateau (at least in their current incarnations of "AI"), because of theoretical or practical reasons?
English
14
1
36
6.6K
Benjamin Ou
Benjamin Ou@AlephNuul·
@ToddBoogaloo @Afinetheorem You could try the same prompt in a private/incognito chat, most LLM chatbots should have that kind of option (often as a not super clearly labeled toggle towards the top right)
English
2
0
0
30
Garbage Snake 🇪🇹
Garbage Snake 🇪🇹@ToddBoogaloo·
@Afinetheorem Gemini, using the same prompt, gave me suggestions that were more leftwing (rail nationalization, universal childcare/single payer and vienna model housing) perhaps because I had spoken to it about these ideas before in previous chats?
English
1
0
3
147
Kevin A. Bryan
Kevin A. Bryan@Afinetheorem·
Interesting. Every single suggestion by Claude here is one I would agree is a good idea and impactful. But for the sake of epistemic humility (and to understand AI better): is this because technocratic econ-minded INTJ centrists have great ideas, or is there a problem? Steelman?
Arram@arram

Asked Claude: 'There's a meme called the "fix everything easily switch". What policies do you think are the best candidates for being a real fix everything switch in the US? Give me your top ten, your confidence, your reasoning, and why a given policy has not been implemented.'

English
5
1
16
5.3K
Benjamin Ou retweetledi
METR
METR@METR_Evals·
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
METR tweet media
English
69
247
2.1K
966.6K
Cassie Pritchard
Cassie Pritchard@hecubian_devil·
How come rich people don’t become patrons anymore? Rich guy in 1430 would be supporting like 16 master artists and all their studios. Never see that anymore. They should do that again, but also for posters, specifically (the real artists of the 21st century)
English
55
19
703
45.4K