4dimcube
280 posts

4dimcube
@4dimcube
Professional circle packer. Does math sometimes.



Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games. Every finite 2p0s game has a minimax equilibrium, which is essentially an unbeatable strategy in expectation (assuming the players alternate sides). In rock paper scissors, for example, minimax is 1/3rd on each action. Is minimax what we want? Not necessarily. If you're playing minimax in Rock Paper Scissors when most opponents' strategies are "always throw Rock" then you're clearly suboptimal, even though you're not losing in expectation. This especially matters in a game like poker because playing minimax means you might not make as much money off of weak players as you could if you maximally exploited them. But the guarantee of "you will not lose in expectation" is really nice to have. And in games like Chess and Go, the difference between a minimax strategy and a strategy that optimally exploits the population of opponents is negligible. For that reason, minimax is typically considered the goal for a two-player zero-sum game. Even in poker, the conventional wisdom among top pros is to play minimax (game theory optimal) and then only deviate if you spot clear weaknesses in the opponent. Sound self play, even from scratch, is guaranteed to converge to a minimax equilibrium in finite 2p0s games. That's amazing! By simply scaling memory and compute, and with no human data, we can converge to a strategy that's unbeatable in expectation. What about non-2p0s games? Sadly, pure self play, with no human data, is no longer guaranteed to converge to a useful strategy. This can be clearly seen in the Ultimatum Game. Alice must offer Bob $0-100. Bob then accepts or rejects. If Bob accepts, the money is split according to Alice's proposal. If Bob rejects, both receive $0. The equilibrium (specifically, subgame perfect equilibrium) strategy is to offer 1 penny and for Bob to accept. But in the real world, people aren't so rational. If Alice were to try that strategy with real humans she would end up with very little money. Self play becomes untethered from what we as humans find useful. A lot of folks have proposed games like "an LLM teacher proposes hard math problems, and a student LLM tries to solve them" to achieve self-play training, but this runs into similar problems as the Ultimatum game where the equilibrium is untethered from what we as humans find useful. What should the reward for the teacher be in such a game? If it's 2p0s then the teacher is rewarded if the student couldn't solve the problem, so the teacher will pose impossible problems. Okay, what if we reward it for the student having a 50% success rate? Then the teacher could just flip a coin and ask the student if it landed Heads. Or the teacher could ask the student to decrypt a message via an exhaustive key search. Reward shaping to achieve intended behavior becomes a major challenge. This isn't an issue in 2p0s games. I do believe in self play. It provides an infinite source of training, and it continuously matches an agent with an equally skilled peer. We've also seen it work in some complex non-2p0s settings like Diplomacy and Hanabi. But applying it outside of 2p0s games is a lot harder than it was for Go, Poker, Dota, and Starcraft.








In our recent NeurIPS paper we had to show the following cute inequality: For a real world application, ask all your friends to think of a number. Divide each number by the sum, and you'll get in expectation at least 1/n. This holds even if you give your friends weights. Seems simple enough. In fact the case where all the weights are equal, you get equality to 1/n by symmetry. However, the weighted case is harder to prove. Using Using Cauchy-Schwarz, E[X²/Y] ≥ E[|X|]² / E[Y], we get this awkward bound: Instead you need to apply Jensen a bunch of times. The original lemma was a lot of fun to prove. I recommend you try it! Also, for a cool alternative proof for Gaussians specifically, see River Li's answer here: math.stackexchange.com/a/4544808/7072 What did we use the lemma for? Well, in the paper we had a least squares problem with a residual vector, `t`. We basically wanted to reduce the error ‖Xt‖₂ by taking a step of optimal length in a random direction. Formally: Here the gaussian vector `g` represents the random direction and `m` is the step size. The value ρ is the smallest singular value, normalized: σₙ²/(σ₁² + ... + σₙ²). You can probably see how the original inequality may come in useful. We can check that for X approximately isotropic we get ρ ≈ 1/d, so the lemma says it takes ≈ d steps to reduce ‖Xt‖₂ to 0. This matches what we'd get from optimizing each orthogonal direction of the space one by one. However, if `t` "hides" in a direction where X's singular values are small, we are less likely to get a useful reduction in ‖Xt‖₂ unless we sample `g` in a smarter way. If you are interested in the full, memory efficient least squares algorithm, you have to read the paper. Or visit our poster in New Orleans, December 12th :-)



This is exactly our view in AI Snake Oil. The bottleneck to human intelligence is not biology, but the difficulty of observing and manipulating our physical and social environment to acquire knowledge. AI faces those same limitations — much more severely than people do, because intelligence in the sense of being able to act in the real world can only come through deployment. Adoption puts a speed limit on innovation. Contrary to the myth of innovation preceding adoption, the two happen in a feedback loop. The most valuable settings (whether self-driving cars or finance or medicine) are heavily regulated, which means that this adoption-innovation feedback loop will be extremely gradual, which we've already seen in the case of self-driving. Superintelligence as it's discussed today is a deeply confused concept. You cannot compute your way to superintelligence. In the view of intelligence as knowledge + real-world capability, if we build AI that's "more intelligent" than people, as long as humans remain in control, that AI tool simply augment human intelligence, as has been the case throughout the history of computing. Actual superintelligence presumes a loss of control. So the claim that if we build superintelligent AI then it will escape our control is nonsensical, because it presumes its conclusion.





so asml makes 25 machines a year at like $300M each, every good fab is completely dependent on them, and you're telling me no one in the startup world is crazy enough to try to compete? this is one of the main ai supply chain bottlenecks




Video generation systems will get better with time, no doubt. But learning systems that actually understand physics will not be generative. All birds and mammals understand physics better than any video generation system. Yet none of them can generate detailed videos.













