Toviah Moldwin

3.3K posts

Toviah Moldwin banner
Toviah Moldwin

Toviah Moldwin

@TMoldwin

Computational neuroscientist @ELSCbrain @Segev_Lab. Singer and guitarist for the rock band @SynfireChain. Dualist. Founder, https://t.co/YMBvt487lT.

Katılım Nisan 2019
894 Takip Edilen851 Takipçiler
Sabitlenmiş Tweet
Toviah Moldwin
Toviah Moldwin@TMoldwin·
Lots of people (myself included) have trouble understanding why transformer architectures directly add token embeddings to the position embeddings. It's weird to just directly add the representation of the content - i.e. the token embeddings, with the representation of the content's position within its context. (Note: there are techniques like RoPE that try to get around this, however here I'll deal with the more basic strategy of addition.) Together with Raneem Mahajne, we trained a small transformer on a simple next-token prediction task. We gave the transformer a sequence of random digits, occasionally interspersed with a + sign. Whenever the + sign appeared, the next number would have to be the same as the *most recent even number*. For example, if we have the sequence 3 1 2 7 5 +, the next number would have to be a 2. Because we used an embedding dimension of 2 for both the positions and the tokens, we can directly visualize the token embeddings, the position embeddings, and their sum, for every possible token and position combination. In the token embedding space, the transformer learned to separate out the even numbers from the odd numbers, and it also learned to separate the + sign from everything else. Interestingly, the model also decided to smush all the odd numbers together, while maintaining some space between each even digit, presumably because once a + sign appears, it becomes necessary to predict a specific even number. In the position embedding space, the transformer learned to correctly order the positions along a curve. Position ordering is important for this task, because in order to know what the "most recent" even number is, you need to have a sense of ordering. When the token and position information are summed, the structure of *both* the token embedding space and position embedding space are preserved. The even numbers remain separated from each other and the odd numbers, the odd numbers remained smushed together, and the "+" token occupies its own area of space. But now, for each token, we see a local structure based on the curve of ordered positions. Part of the reason why this occurs is because the magnitudes of the token embeddings are ~10x larger than the magnitudes of the position embeddings, allowing the 'macro' structure to be dominated by the tokens, while the 'micro' structure is determined by the positions. In other words, summing the position information and token information doesn't just mix things haphazardly, it actually retains the geometric structure of both by using a different scale to encode 'what' and 'where'.
Toviah Moldwin tweet media
English
0
1
5
381
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@bryan_caplan Which 3 years? The current war with Iran is only tangentially related to Hamas. Iran is the bigger player here, and has been a major geopolitical concern since it started trying to acquire nukes decades ago. Hamas was just one component of the Iranian 'ring of fire'.
English
0
0
1
121
Bryan Caplan
Bryan Caplan@bryan_caplan·
Standard estimate say there are 20-30k Hamas members. Yet this tiny, impoverished, militarily feeble group has indirectly dominated not just headlines but world history for almost 3 years. What a weird world.
English
31
5
114
10.4K
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@LocasaleLab Number of panels per figure is also not usually a major barrier in reviewing.
English
0
0
0
39
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@LocasaleLab Nah. This is an atypically dense figure. Often it does take a lot of data to tell a complete story, it's rare that a panel is completely unnecessary. Overly dense figures are usually the result of max figure count constraints.
English
2
0
5
557
Jason Locasale
Jason Locasale@LocasaleLab·
This reflects how science has been packaged and evaluated over the past two decades. In the 2000s figure preparation software such as PowerPoint and Illustrator became straightforward to use. By the early 2010s, high-impact journals came to associate dense, elaborate figures (i.e. the exhibits of a scientific study) with rigor and depth. The implicit assumption was that more panels and more data reflected more thorough and careful work. At the same time, the editorial decision on whether to proceed to peer review was made by individuals not deeply embedded in the specific science, relying more on visual presentation and the perceived completeness of the data. The aesthetics of the figure panels became a proxy for scientific thoroughness. In response, scientists adapted. Figures became more complex, more densely populated, and more expansive in scope. This gave the appearance of rigor independent of whether the additional data materially clarified the central questions of the study. Reviewers were tasked with evaluating these large and complex datasets under significant time constraints, typically within a few days and without compensation, while managing substantial professional responsibilities. Under these conditions, it is impossible to systematically interrogate every component of a multi-panel figure. There is also a reluctance to question whether key elements of a study are missing if there is a possibility they are included somewhere within the large amount of presented data. The result is a publication system in which the presentation of large volumes of data in complex figure formats can facilitate publication in high-profile journals, often with limited connection to the underlying clarity, coherence, or quality of the science itself.
Banana Oncology@Banana_Oncology

Ok this figure is pretty intimidating...

English
10
34
269
36.7K
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@AustinA_Way Hasn't everyone known that this has been debunked for decades? I thought this was common knowledge.
English
1
0
3
846
Toviah Moldwin retweetledi
nature
nature@Nature·
Nature research paper: Climbing fibres recruit disinhibition to enhance Purkinje cell calcium signals go.nature.com/4bsCMw9
English
0
25
79
12.5K
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@wendlerch Of course? If your intermediate layers are linear you're basically just doing linear regression, no need for backprop. The whole point of backpropagation is to handle the nonlinearities.
English
0
0
3
63
Chris Wendler
Chris Wendler@wendlerch·
w8 a second. the backprop paper already had a great example for nonlinear features 🤯 easy to forget in modern days where you can get very very far with linearity assumptions... lecture link below 👇
Chris Wendler tweet media
English
2
4
35
2.5K
Toviah Moldwin retweetledi
Rafi DeMogge רפי דמוג
1./ Common fallacies in IR thought, or: how to avoid geopolitical gibberish. Some thought patterns became so prevalent in geopolitics commentary that they bled into ordinary discourse. Yet these ideas are manifestly and obviously absurd, and they should seem absurd to anyone who wasn’t indoctrinated with geopolitics gibberish. In this thread, I’ll list a few common fallacies.
English
6
43
171
16K
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@Qivshi1 Probably way more than that is required for a push over the edge. Couldn't really function very well if it took so little to push the brain over the edge into a different state.
English
1
0
2
36
Qivshi
Qivshi@Qivshi1·
how many levels of brain critcality are you on? One neuron can change the global state? (baby level) One Synapse? One AMPA receptor? One Glutamate Molecule? One H+? One microvolt?
English
8
0
16
818
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@DBashIdeas Someone translated חשוב in their head to 'significant' and that's still not the right word to use here.
English
0
0
0
128
David/Dovid Bashevkin
David/Dovid Bashevkin@DBashIdeas·
“Do you really think a dining room without sefarim on the table is nicer than a dining room with seforim on the table?”
David/Dovid Bashevkin tweet media
English
5
2
43
3.9K
Brandon Luu, MD
Brandon Luu, MD@BrandonLuuMD·
Students who took notes by hand scored ~28% higher on conceptual questions than laptop note-takers. Writing forces your brain to process and compress ideas instead of copying them.
Brandon Luu, MD tweet media
English
446
5.2K
24.5K
1.6M
Toviah Moldwin retweetledi
Oliver Sieberling
Oliver Sieberling@osieberling·
A very interesting observation on backpropagation is that no matter how nonlinear the forward pass, once the forward pass is fixed, backpropagation itself is purely linear. This allows for all kinds of gradient analysis. For example, one can decompose the backward pass by the depth of the backpropagated signal: Each forward and backward pass can be viewed as involving 2^{2L} different paths, where L is the number of blocks (2L to account for Attn/MLP subblocks). Because the forward pass is nonlinear, we can’t just compute each of the paths separately to decompose the forward pass. However, for the backward pass we can. Of course, 2^{2L} is computationally intractable as we would need to do 2^{2L} separate backward passes (>1e19 for a 32-block transformer). But if we are only interested in the depth of gradients, we can use simple dynamic programming to decompose the backward signal by depth: Let x_l be the residual stream at depth l. Instead of just maintaining dL/dx_l as in normal backpropagation, we maintain (dL/dx_l)^k for each gradient depth k, i.e., a table consisting of 2L+1 gradients. Because backpropagation is linear, we have dL/dx_l = \sum_{k=0}^{2L} (dL/dx_l)^k. Then, after backpropagating each subblock, we can update our table of gradients with (dL/dx_l)^k = (dL/dx_{l+1})^k + (Jacobian_{subblock_l}(x_l))^T (dL/dx_{l+1})^{k-1}. This way, we can efficiently compute for each weight update how much comes from each gradient depth. Interestingly, because there are C(2L, k) (2L choose k) paths of depth k (left plot), by sheer number of paths we would expect the decomposition to concentrate around depth L (ignoring cancellations/correlations), but this is not what we observe in practice: The actual decomposition of gradients by depth is much more shifted towards shallower depths (right plot), which suggests that after normalizing by the number of paths, shallower paths contribute gradients of much larger magnitude than deeper paths do.
Oliver Sieberling tweet media
English
6
20
325
31K
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@andrewgwils Alternatively, no it's not? If you want a model to be able to communicate its results, it of course has to be good at produce word sequences. What is happening internally is much more than next word prediction, that's just the final layer.
English
0
0
2
298
Andrew Gordon Wilson
Andrew Gordon Wilson@andrewgwils·
Being good at next word prediction is the opposite of what we want for creativity, for scientific breakthroughs.
English
21
13
137
29.1K
Toviah Moldwin retweetledi
Aakash Gupta
Aakash Gupta@aakashgupta·
The average laptop screen ships at 400 to 500 nits of brightness. Direct sunlight requires 1,000 nits minimum to be readable. Most MacBook Airs top out around 480. So every person “working from the beach” is doing one of three things: squinting at a washed-out screen while cupping their hand over it like a visor, cranking brightness to max and watching the battery drain in 90 minutes, or posing for a photo with a screen that’s actually off. The photo in this tweet is perfect. Two guys in white button-downs at a beach table, hunched forward, clearly unable to see anything. One has sunglasses on, which makes the screen even darker. The other is eating a bagel and appears to have accepted his fate. The entire “laptop at the beach” aesthetic is a lighting trick. Every influencer photo is shot at golden hour or in shade. The second the sun is directly overhead, your $2,000 laptop becomes a $2,000 mirror. Apple, Dell, and Lenovo could fix this tomorrow with 1,500-nit panels. They don’t because the battery tradeoff would cut runtime in half, and “4 hours of battery life” doesn’t sell in a commercial. The remote work fantasy was always an indoor product sold with outdoor photography.
Aakash Gupta tweet media
dr. z, esq.@zeynepmyenisey

Being on your laptop outside is a miserable experience and im tired of people pretending it's not

English
150
220
5.2K
679.2K
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@rohanpaul_ai He did not say it was coming, he said 'I bet' it is coming. That does not give any new information, this was always a possibility that one could be optimistic about.
English
0
0
0
142
Rohan Paul
Rohan Paul@rohanpaul_ai·
Sam Altman just said in his new interview, that a new AI architecture is coming that will be a massive upgrade, just like Transformers were over Long Short-Term Memory. And also now the current class of frontier models are powerful enough to have the brainpower needed to help us research these ideas. His advice is to use the current AI to help you find that next giant step forward. --- From 'TreeHacks' YT Channel (link in comment)
Rohan Paul@rohanpaul_ai

Morgan Stanley predicts a massive AI breakthrough driven by a huge spike in computing power across major U.S. laboratories. Increasing the amount of hardware used for training by 10x can effectively double the intelligence of these models. The recently released GPT-5.4 Thinking model already matches human experts on professional tasks with a score of 83% on the GDPVal benchmark. The biggest hurdle for this growth is an energy crisis, with the U.S. power grid facing a shortfall of 18 gigawatts by December-28. To keep running, developers are bypassing the grid by taking over Bitcoin mining sites and using natural gas turbines for their AI factories. This shift is creating a solid investment cycle where 15-year leases on data centers generate high financial yields for every watt consumed. Large companies are already reducing their staff numbers because these new AI tools can perform professional work for a tiny fraction of the cost. Researchers expect AI to begin recursive self-improvement by June-27, meaning the software will autonomously upgrade its own code without human help. The future economy will likely treat raw intelligence as a commodity that is manufactured by these massive computing and energy clusters.

English
119
76
760
651.1K
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@robinhanson Yes, many Israelites were polytheistic at many points - see...the Bible. No, they were not supposed to be, see the second commandment.
English
0
0
0
38
James Lucas
James Lucas@JamesLucasIT·
What's the most profoundly beautiful piece of music you have ever listened to?
English
1.7K
123
1.2K
434.1K
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@KordingLab @TaliaRinger And while learning CS is valuable, realistically a lot of SWE-type jobs in the future will mainly be about 'how good are you at vibecoding/systems thinking'. You need less book knowledge for that, more raw experience/experimentation.
English
0
0
1
44
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@KordingLab @TaliaRinger I don't understand the incentive of a student going into any sort of CS at all now. Not a good career path anymore, and also at the graduate level academia is trailing industry especially in AI/ML.
English
1
0
1
147
Toviah Moldwin
Toviah Moldwin@TMoldwin·
@robinhanson Standardized to what? The best hotel room I'm ever seen? The worst? The median?
English
0
0
1
115
Robin Hanson
Robin Hanson@robinhanson·
Would you prefer that hotel rooms were more or less standardized than they are now?
English
13
0
11
10.4K
Toviah Moldwin retweetledi
Jesse Singal
Jesse Singal@jessesingal·
NEW RULE*: You can't make a sweeping claim about how AI doesn't *really* 'think' or 'understand' or 'reason' unless you have a specific definition of the term in question in mind! Otherwise the conversation will just spiral into vapidity. *that I am politely suggesting
English
56
7
201
19.7K