4dimcube

280 posts

4dimcube

@4dimcube

Professional circle packer. Does math sometimes.

Katılım Temmuz 2018

644 Takip Edilen222 Takipçiler

4dimcube@4dimcube·22 Eki

@jxmnop Reminds me of Kingma's dissertation. Develops deep insight, drops VAEs and ADAM in a blitz of fantastic papers, goes on his merry way.

English

150

Jack Morris@jxmnop·21 Eki

FlashAttention is probably the ultimate "AI PhD" contribution: > spend years studying > understand things better than everybody else > rewrite low-level code in more intelligent way > make transformers 2-4x faster > essentially a free lunch > now it runs on all of our computers

English

1.6K

110K

4dimcube@4dimcube·22 Eki

@VictorTaelin To be clear I agree wholeheartedly with your initial point, I just see scaling challenges that strongly decrease the odds of an "intelligence explosion" relative to something more like "optimal but still incremental/polynomial progress on heuristics"

English

4dimcube@4dimcube·22 Eki

@VictorTaelin I think this requires some deeper consideration that I haven't yet done, but the way that you describe utility of functions here sounds either uncomputable (via Kolmogorov Complexity) or at least NP-Hard (via reduction to something like generalized circuit-SAT).

English

Taelin@VictorTaelin·22 Eki

> A lot of folks have proposed games like "an LLM teacher proposes hard math problems, and a student LLM tries to solve them" to achieve self-play training, but this runs into similar problems as the Ultimatum game where the equilibrium is untethered from what we as humans find useful. This kind of idea is terribly unclever, as it is clearly going to converge in the teacher just proposing hash functions and things like that. This makes me unreasonably upset, because people seeking the "theorem proving RL" route ALWAYS end up doing something similarly stupid, which makes the idea seem bad, when it is not. Ultimately, what we want is a system capable of autonomously evolving ever more *useful abstractions*, in the sense that "dot product" is more useful than "product of pairwise atan2" as a `ℝ[] → ℝ` function. It isn't hard to see how such system would, given enough scale, result in an unprecedented intelligence explosion. The only obvious problem is: how do you measure how good or useful an abstraction is? Why is the dot product such a globally useful concept, while pairwise atan2 is just an esoteric function with no relevance? How can a computer assign a meaningful "usefulness score" to a bunch of symbols? What makes some formulas more special than others? Turns out there IS a way to measure that, and I think that's what most ignore: compression. A function is an interesting abstraction when it can be used to express other interesting abstractions more succinctly. This is clearly the case here: by using dot products, we can simplify many other important mathematical formulas, while pairwise atan2 is not capable of doing that. But this argument seems circular - what are the "other interesting abstractions"? If we just implement such system in a pure "self play" fashion, what is preventing it from, say, just generating a bunch of esoteric functions, and then finding abstractions that compress these functions, in an ever-growing pile of poo? If we allow ourselves to draw from human knowledge, we could, perhaps, train a system that compresses Lean's mathlib, and this would work, and we could discover some new interesting definitions that way. Yet, if such system depends on human data to work, then it will obviously not extrapolate beyond human science. If it could, it wouldn't depend on human data to begin with. That is the very reason LLMs aren't inventing new science, after all. So, a more interesting question is: how do we go from *zero* to *something*? For example, imagine a pure algorithm that, with no human intervention, eventually figured out that polynomials, complex numbers, matrices, rings, monoids, are super interesting concepts, worthy of keeping in its memory? IMO, that is the question everyone should be asking. The "real world" isn't special, and we don't need it. There ARE ways to achieve ever-growing intelligence in a "lone self play" fashion, but making two LLMs play teacher and student theorem proving games against each other is not one of them... 😐

Noam Brown@polynoamial

Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games. Every finite 2p0s game has a minimax equilibrium, which is essentially an unbeatable strategy in expectation (assuming the players alternate sides). In rock paper scissors, for example, minimax is 1/3rd on each action. Is minimax what we want? Not necessarily. If you're playing minimax in Rock Paper Scissors when most opponents' strategies are "always throw Rock" then you're clearly suboptimal, even though you're not losing in expectation. This especially matters in a game like poker because playing minimax means you might not make as much money off of weak players as you could if you maximally exploited them. But the guarantee of "you will not lose in expectation" is really nice to have. And in games like Chess and Go, the difference between a minimax strategy and a strategy that optimally exploits the population of opponents is negligible. For that reason, minimax is typically considered the goal for a two-player zero-sum game. Even in poker, the conventional wisdom among top pros is to play minimax (game theory optimal) and then only deviate if you spot clear weaknesses in the opponent. Sound self play, even from scratch, is guaranteed to converge to a minimax equilibrium in finite 2p0s games. That's amazing! By simply scaling memory and compute, and with no human data, we can converge to a strategy that's unbeatable in expectation. What about non-2p0s games? Sadly, pure self play, with no human data, is no longer guaranteed to converge to a useful strategy. This can be clearly seen in the Ultimatum Game. Alice must offer Bob $0-100. Bob then accepts or rejects. If Bob accepts, the money is split according to Alice's proposal. If Bob rejects, both receive $0. The equilibrium (specifically, subgame perfect equilibrium) strategy is to offer 1 penny and for Bob to accept. But in the real world, people aren't so rational. If Alice were to try that strategy with real humans she would end up with very little money. Self play becomes untethered from what we as humans find useful. A lot of folks have proposed games like "an LLM teacher proposes hard math problems, and a student LLM tries to solve them" to achieve self-play training, but this runs into similar problems as the Ultimatum game where the equilibrium is untethered from what we as humans find useful. What should the reward for the teacher be in such a game? If it's 2p0s then the teacher is rewarded if the student couldn't solve the problem, so the teacher will pose impossible problems. Okay, what if we reward it for the student having a 50% success rate? Then the teacher could just flip a coin and ask the student if it landed Heads. Or the teacher could ask the student to decrypt a message via an exhaustive key search. Reward shaping to achieve intended behavior becomes a major challenge. This isn't an issue in 2p0s games. I do believe in self play. It provides an infinite source of training, and it continuously matches an agent with an equally skilled peer. We've also seen it work in some complex non-2p0s settings like Diplomacy and Hanabi. But applying it outside of 2p0s games is a lot harder than it was for Go, Poker, Dota, and Starcraft.

English

337

40.4K

4dimcube@4dimcube·17 Ağu

@staysaasy This, but we hit stage 4 by Q3 2026

English

staysaasy@staysaasy·16 Ağu

> 2025: models plateau > 2026: companies stop paying for multiple foundational models > 2026: some company does big article about how they moved to open source model and saved tons of money without losing efficacy > 2027: blood in the streets

English

1.6K

158.5K

4dimcube@4dimcube·6 Tem

@VictorTaelin I lost vision in my right eye from the chiari complications and then again after the meningitis, but thankfully it came back (to my severely myopic status quo) after the intracranial pressure subsided both times. Sorry to hear that, can definitely relate to human body being shit.

English

Taelin@VictorTaelin·6 Tem

so 7% of my right eye vision is now gone and I'm honestly done guys, I seriously don't care about having a normal life anymore, it is just impossible to enjoy anything at all, when you can't have a single month of peace, without your body breaking in some different way I'll move to the US, I'll meet people, I'll convince some AGI lab to buy HOC and have our team join them, and if nobody does, I will buy-back each single investor myself. then I'll join a place where I can put all my skills, time and soul towards the only cause I care about, which is helping build the systems that will solve all this bullshit for me and everyone else

English

108

1.1K

139.9K

4dimcube@4dimcube·10 Haz

@VictorTaelin The meds required to keep you alive during an acute meningitis presentation (e.g. steroids to drop intracranial pressure, narcotics to get you to stop screaming) also tend to make you feel *awful* afterwards, at least until you can get back to an even keel.

English

4dimcube@4dimcube·10 Haz

@VictorTaelin N of 1 here, but I can tell you that it tends to get dramatically better as the primary symptoms subside. Had meningitis as a complication of brain surgery; almost died but didn't. Swelling and inflammation in the brain or spine feel fucking awful but are thankfully temporary.

English

117

Taelin@VictorTaelin·9 Haz

so, about my situation... most studies that you find will report that, with proper treatment, there are 70%-90% odds of "full recovery". yet, knowing what I know about viruses, that sounded... suspicious. how is that even defined, or measured? for example, a perfect MoCA score doesn't imply you're 100% - it is basically a "yeah you're not in coma" certificate. digging deeper, I found a 2024 paper, named: "Long-term sequelae after viral meningitis and meningoencephalitis are frequent, even in mildly affected patients, a prospective observational study" which actually *asked* the patients how they felt - and that confirmed what was intuitively obvious to me: 2/3 (!) of viral meningitis patients report sequelae 2 years (!) after infection, although "only" 1/3 report it affects their work and QoL. so, the problem is actually much more severe than papers would let you think, and it certainly matches my intuition it has been 2 weeks and I still feel some pain, sleepiness and mental fadigue, and, knowing my own body, I anticipate this will be a very tough recovery ): I don't understand why I'm so unlucky. I'm sad, pessimistic and probably shouldn't be posting. yet, somehow, this fills me with delusional determination to act and do something about it, for me and everyone else going through this and worse

English

513

36.2K

4dimcube@4dimcube·25 Mar

@cloneofsimo @FAL I don't think this is correct; the original is about comparing an expectation to a single known value (the weight not the sample). Not seeing how that follows from here but maybe that's on me 😅

English

394

Simo Ryu@cloneofsimo·24 Mar

Asked our recent hire @FAL and he took a deep look and solved it in 20 min with insanely clean proof.

Thomas Ahle@thomasahle

In our recent NeurIPS paper we had to show the following cute inequality: For a real world application, ask all your friends to think of a number. Divide each number by the sum, and you'll get in expectation at least 1/n. This holds even if you give your friends weights. Seems simple enough. In fact the case where all the weights are equal, you get equality to 1/n by symmetry. However, the weighted case is harder to prove. Using Using Cauchy-Schwarz, E[X²/Y] ≥ E[|X|]² / E[Y], we get this awkward bound: Instead you need to apply Jensen a bunch of times. The original lemma was a lot of fun to prove. I recommend you try it! Also, for a cool alternative proof for Gaussians specifically, see River Li's answer here: math.stackexchange.com/a/4544808/7072 What did we use the lemma for? Well, in the paper we had a least squares problem with a residual vector, `t`. We basically wanted to reduce the error ‖Xt‖₂ by taking a step of optimal length in a random direction. Formally: Here the gaussian vector `g` represents the random direction and `m` is the step size. The value ρ is the smallest singular value, normalized: σₙ²/(σ₁² + ... + σₙ²). You can probably see how the original inequality may come in useful. We can check that for X approximately isotropic we get ρ ≈ 1/d, so the lemma says it takes ≈ d steps to reduce ‖Xt‖₂ to 0. This matches what we'd get from optimizing each orthogonal direction of the space one by one. However, if `t` "hides" in a direction where X's singular values are small, we are less likely to get a useful reduction in ‖Xt‖₂ unless we sample `g` in a smarter way. If you are interested in the full, memory efficient least squares algorithm, you have to read the paper. Or visit our poster in New Orleans, December 12th :-)

English

499

156.3K

4dimcube@4dimcube·20 Şub

@getjonwithit Or in other words, the answer to your question is "yes"

English

4dimcube@4dimcube·20 Şub

@getjonwithit The emperor has been standing naked in our sprinklers at 3:32am every third Tuesday of the month for at least a few years now. (LLMs really suck at extrapolation because it's underspecified. A decent rule of thumb is "can the output I want be interpolated from training data?")

English

Jonathan Gorard@getjonwithit·19 Şub

Today I decided to try using o1 to assist with some math/CS research. Here's how it went. For context: I'm currently developing an automated theorem-proving framework in Scheme, and attempting to produce formal proofs of correctness for some advanced numerical algorithms. (1/10)

English

109

1.3K

584.5K

4dimcube@4dimcube·3 Eki

@MilesCranmer It's interesting, for sure. Some counterexamples too, in Go and Starcraft iirc. I'm skeptical of the "foom" sort of superintelligence for more basic computational reasons, but it's interesting to consider consequences.

English

Miles Cranmer@MilesCranmer·3 Eki

As a counterpoint and food for thought—in chess, a human + chess engine is often *worse* than the engine alone! Worth considering in the context of superintelligence

Arvind Narayanan@random_walker

This is exactly our view in AI Snake Oil. The bottleneck to human intelligence is not biology, but the difficulty of observing and manipulating our physical and social environment to acquire knowledge. AI faces those same limitations — much more severely than people do, because intelligence in the sense of being able to act in the real world can only come through deployment. Adoption puts a speed limit on innovation. Contrary to the myth of innovation preceding adoption, the two happen in a feedback loop. The most valuable settings (whether self-driving cars or finance or medicine) are heavily regulated, which means that this adoption-innovation feedback loop will be extremely gradual, which we've already seen in the case of self-driving. Superintelligence as it's discussed today is a deeply confused concept. You cannot compute your way to superintelligence. In the view of intelligence as knowledge + real-world capability, if we build AI that's "more intelligent" than people, as long as humans remain in control, that AI tool simply augment human intelligence, as has been the case throughout the history of computing. Actual superintelligence presumes a loss of control. So the claim that if we build superintelligent AI then it will escape our control is nonsensical, because it presumes its conclusion.

English

2.4K

4dimcube@4dimcube·29 Ağu

@tesavova @cremieuxrecueil In this case that seems unlikely; fake data to support weak (or more likely entirely fabricated) papers is far more probable.

English

tesavova@tesavova·28 Ağu

@cremieuxrecueil Conceivably there are incentives to overstate the provenance of your equipment on status/legal grounds? Like if you could jailbreak your android to make the bubbles display blue on receipt, is there any doubt the top offender graph would look likewise?

English

1.5K

Crémieux@cremieuxrecueil·28 Ağu

This is an image from a scanning electron microscope. The banner along the bottom is the image metadata. That banner might be useful for detecting fraudulent research 🧵

English

691

206.6K

4dimcube@4dimcube·28 Ağu

@jxmnop Other people have already said various smart things so this is just another reminder: 🥒=🙃

English

Jack Morris@jxmnop·27 Ağu

i have a dictionary of ~350M key-value integers that i want to save to disk - written as ints to lines of a text file: 3.2 GB - stored as torch.Tensors in a collections.defaultdict, saved to disk using pickle.dump: 937 GB lesson learned

English

123

271.6K

4dimcube@4dimcube·6 Ağu

@quantian1 @JoshMMcClure @wetboyslim Yeah, it's all about quantum effects related to the available electron orbitals. More charge in the nucleus means it pulls harder on the electrons, but more electrons means they repel each other (due to charge and degeneracy pressure). Noble gases have full shells so they're big.

English

Quantіan@quantian1·6 Ağu

@JoshMMcClure @4dimcube @wetboyslim No. Gravity has no effects at the atomic scale, it’s orders of magnitude too weak

English

156

Quantіan@quantian1·4 Ağu

Ok so for step 1 of your master plan to compete with ASML, you need to hire 10,000 autistic German lens crafters to create a mirror so perfect any defects are smaller than the diameter of a single helium atom. Then, you can advance to step 2 of 8,327.

bayes@bayeslord

so asml makes 25 machines a year at like $300M each, every good fab is completely dependent on them, and you're telling me no one in the startup world is crazy enough to try to compete? this is one of the main ai supply chain bottlenecks

English

124

652

13K

1.4M

4dimcube@4dimcube·6 Ağu

@JoshMMcClure @quantian1 @wetboyslim Atomic radius isn't about mass, it's about configuration! Helium has a smaller radius than hydrogen, but both of them are larger than average. Most transition metals have similar or smaller radii.

English

Josh McClure@JoshMMcClure·5 Ağu

@quantian1 @wetboyslim It's a period 1 atom. Only Hydrogen with 1 proton and 1 electron is smaller. Helium has 1 proton, 1 neutron and 1 electron. See it in the upper right hand corner of the periodic table of elements under "He"

English

2.2K

4dimcube@4dimcube·1 Tem

To be clear: I think you can build effective models that do not explicitly express a posterior over the input space, but I contend that internally the model must capture this information if it is complete. It *is* a generative model, just without a decoder in the input space.

English

173

4dimcube@4dimcube·1 Tem

I agree with @ylecun on a lot of things but this seems off: by the data processing inequality, if a system "understands" (captures) an input then it *must* contain the information required to generate that input. Perhaps the point is that you don't need a generative architecture?

Yann LeCun@ylecun

Video generation systems will get better with time, no doubt. But learning systems that actually understand physics will not be generative. All birds and mammals understand physics better than any video generation system. Yet none of them can generate detailed videos.

English

339

4dimcube@4dimcube·1 Tem

@ylecun @balazskegl This strikes me as an interesting hypothesis. I'm led to wonder if the experience of dreaming is even uniform across people - for example, might those with aphantasia have a radically different experience?

English

Yann LeCun@ylecun·1 Tem

@balazskegl It's abstract representations, but you can't possibly tell the difference from your memory of it.

English

8.4K

Balázs Kégl@balazskegl·1 Tem

When you dream, do you see images or just imagine abstract representations?

English

9.6K

4dimcube@4dimcube·30 Haz

@soncharm Basically everything that the EPA does as well as a lot of what the FDA does, etc. EPA already struggles to keep standards for forever chemicals like Teflon up to date without needing to go through the courts. It may become way easier to get away with dumping dangerous waste.

English

sonch@soncharm·29 Haz

People crying about 'Chevron' getting overturned: what's your favorite Regulation that Congress passed a law about and a Agency interpreted the law by making up some details and those details are great and you're afraid they'll go away? Make this concern tangible for me

English

504

664

11.6K

1.5M

4dimcube@4dimcube·28 Haz

@quantymacro In fact, I'd argue that it doesn't really matter if you use boosting or bagging if your individual estimators are poorly constrained. With good constraints (e.g. on complexity) we had similar results for the "macro" models I was working on, with bagging winning by a small margin.

English

4dimcube@4dimcube·28 Haz

@quantymacro Used a lot of ensemble methods including random forests. Comments about slowness are valid; we actually reinterpreted our trees as tensor networks and JIT compiled them after training to get them fast enough. Even with ensembles, constraints on the individual estimators matter.

English

246

qm@quantymacro·27 Haz

question for ML people: In his ESL review, Max Dama said that Random Forest is not rly used in practice - ppl use XGB/LGBM. why is that the case? OTOH De Prado said that bagging addresses overfitting, boosting addresses underfitting, & in finance overfitting is a bigger issue

English

160

35.5K

Keşfet

@jxmnop @VictorTaelin @staysaasy @cloneofsimo @FAL @fal @getjonwithit @MilesCranmer