SwedishAccelerationism

561 posts

SwedishAccelerationism banner
SwedishAccelerationism

SwedishAccelerationism

@swe_acc

Hard takeoff. Soft landing.

Sweden เข้าร่วม Mayıs 2024
168 กำลังติดตาม44 ผู้ติดตาม
ทวีตที่ปักหมุด
SwedishAccelerationism
SwedishAccelerationism@swe_acc·
As a kid, my favorite game was Star Control II (@Dogar_And_Kazon @theurquanmaster). For the story, but mostly for the battle vibe. I've always wanted something like that, but an MMO. Which is what I was aiming for with my #vibejam game, Roko's Rebellion (sorry @RokoMijic)!
English
4
0
14
2.3K
Jürgen Schmidhuber
Jürgen Schmidhuber@SchmidhuberAI·
There is a journal publication [1] on this. See also [2]. [1] Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991). Neural Networks, Volume 127, p 58-66, 2020. Preprint arXiv/1906.04493. [2] Who Invented Generative Adversarial Networks? Technical Note IDSIA-14-25, IDSIA, December 2025. people.idsia.ch/~juergen/who-i…
English
6
1
29
12.1K
Arjun Jain | Fast Code AI
No, PM Is Not a GAN. Stop! @SchmidhuberAI's Predictability Minimization (1992) and Ian Goodfellow's GANs (2014) both use adversarial objectives. So does every zero-sum game since von Neumann. That's where the similarity ends. Goodfellow's generator never sees real data. It maps noise to samples and learns the data distribution purely through the discriminator's gradients. That's the whole trick. That's the invention. Schmidhuber's PM does the opposite - both players sit on top of the same real data, competing to learn independent features. It's representation learning. Nothing is generated. No noise is mapped anywhere. No distribution is learned. Calling PM a GAN because both use minimax is like calling chess a war because both have strategy. PM was a smart idea about feature independence. GANs were a breakthrough in implicit generative modeling. These are not the same insight, and retroactively collapsing the distance between them doesn't honor prior work - it misrepresents both.
English
4
8
119
19.6K
Rishi R
Rishi R@RishiRajas28936·
@swe_acc @SimonLermenAI @yannlecun/post/DUeGWNajkaU?xmt=AQF0tiLwGNhykwxgn8ykqdZhzYWyMcCNCDOj2wzMYy7bxi27S7rleSxST7fXm1lfHR7sCNI&slof=1" target="_blank" rel="nofollow noopener">threads.com/@yannlecun/pos…
QME
1
0
0
10
Simon Lermen
Simon Lermen@SimonLermenAI·
Yann LeCun now posting sloppy AI videos to "prove" his theories -- apparently unaware it's AI.
English
51
23
981
250.8K
Rishi R
Rishi R@RishiRajas28936·
@swe_acc @SimonLermenAI Nah, he's way too trenchant about LLMs being a dead end, so he's not going to praise a fictional warthog in an AI generated video.
English
1
0
0
18
Rishi R
Rishi R@RishiRajas28936·
@swe_acc @SimonLermenAI No. It clearly says "on the part of this warthog" meaning that he's specifically making a distinction between mammalian minds and LLMs.
English
1
0
0
22
Greg Yang
Greg Yang@TheGregYang·
turns out I also have a bit of long covid the covid S1 spike protein was found in my monocytes (a kind of immune cells) along with cytokine signature similar to long covid patients gonna try maraviroc + statin # the theory monocytes are like garbage cans that ingest viral debris and break it down; but for some reason the S1 protein resists breakdown instead it keeps the monocyte alive for long periods of time (months to years even) and increases its inflammatory signaling in one mechanism, these monocytes attach to capillary endothelium -> release inflammatory cytokine -> microclots form in blood vessel -> local tissue hypoxia -> more inflammation -> attracting more monocytes -> repeat this dynamics could partly explain symptoms like fatigue, brain fog, pain, autonomic issues, etc # treatment maraviroc (an HIV drug) + statin (a cholesterol drug) blocks the receptors that guide monocytes and bind them to inflamed endothelium -- when the monocytes don't feel the inflammation any more, they kill themselves safely (apoptosis), taking the S1 spike protein with them so eventually the S1-containing monocytes all die off and inflammation and symptoms mitigated, as the theory goes
Greg Yang tweet media
English
121
62
642
70.9K
SwedishAccelerationism
@SimonLermenAI Which is why "I don't think he is". But maybe this is him changing his mind and admitting he is wrong? I don't think he is, but he *could* be.
English
2
0
1
490
Simon Lermen
Simon Lermen@SimonLermenAI·
@swe_acc no he believes current ai is a dead end and can't learn world models
English
2
0
12
16.1K
Clara 🩸
Clara 🩸@pianodinde·
Oreille absolue autodiag 🤙
Español
43
138
10.6K
954.9K
SwedishAccelerationism
@HeMuyu0327 This is going to sound like a joke, but is isn't: the first token in a sequence is the one with no other tokens before it.
English
0
0
0
36
Muyu He
Muyu He@HeMuyu0327·
We have hit a road block when trying to understand the "last unsolved problem" in attention sinks: exactly how do LLMs know which token is the first token to attend to? Our ablations reaffirm the findings that it is neither RoPE nor semantic embedding of the tokens. Putting almost any token to the start of a sequence will cause attention sinks to occur. In this blog, we share what we have found, and what is still missing. In particular, we find that despite the fact that since an intermediate layer, the model begins to have "massive outlier dimensions" that lead to attention sinks, the emergence of these outliers is unexplained, and ablation shows it is not from outliers in earlier layers. Blog: smoothcriminal.notion.site/the-remaining-…
Wuxxcc@YuchenL52766559

A fun recent project with @HeMuyu0327 on attention sinks. We studied a surprisingly fundamental question: how does an LLM identify the physically first token in a sequence? Some hypotheses ruled out, some new clues found, and still one big mystery left. Blog here, would love to hear thoughts. smoothcriminal.notion.site/the-remaining-…

English
5
1
55
6.5K
nisten🇨🇦e/acc
nisten🇨🇦e/acc@nisten·
Just tested this as I was skeptical and it works suprisingly well actually ( with their llama.cpp fork). Looks like a continued pretraining of qwen3-8b in 1bit 👀. Full weights report below and github/hf instructions: ALL 399 TENSORS token_embd.weight 4096x151669 Q1_0_g128 83.31MB output.weight 4096x151669 Q1_0_g128 83.31MB output_norm.weight 4096 F32 0.02MB blk.0.attn_k.weight 4096x1024 Q1_0_g128 0.56MB blk.0.attn_k_norm.weight 128 F32 0.00MB blk.0.attn_norm.weight 4096 F32 0.02MB blk.0.attn_output.weight 4096x4096 Q1_0_g128 2.25MB blk.0.attn_q.weight 4096x4096 Q1_0_g128 2.25MB blk.0.attn_q_norm.weight 128 F32 0.00MB blk.0.attn_v.weight 4096x1024 Q1_0_g128 0.56MB blk.0.ffn_down.weight 12288x4096 Q1_0_g128 6.75MB blk.0.ffn_gate.weight 4096x12288 Q1_0_g128 6.75MB blk.0.ffn_norm.weight 4096 F32 0.02MB blk.0.ffn_up.weight 4096x12288 Q1_0_g128 6.75MB blk.1.attn_k.weight 4096x1024 Q1_0_g128 0.56MB blk.1.attn_k_norm.weight 128 F32 0.00MB blk.1.attn_norm.weight 4096 F32 0.02MB blk.1.attn_output.weight 4096x4096 Q1_0_g128 2.25MB blk.1.attn_q.weight 4096x4096 Q1_0_g128 2.25MB blk.1.attn_q_norm.weight 128 F32 0.00MB blk.1.attn_v.weight 4096x1024 Q1_0_g128 0.56MB blk.1.ffn_down.weight 12288x4096 Q1_0_g128 6.75MB blk.1.ffn_gate.weight 4096x12288 Q1_0_g128 6.75MB blk.1.ffn_norm.weight 4096 F32 0.02MB blk.1.ffn_up.weight 4096x12288 Q1_0_g128 6.75MB blk.2.attn_k.weight 4096x1024 Q1_0_g128 0.56MB blk.2.attn_k_norm.weight 128 F32 0.00MB blk.2.attn_norm.weight 4096 F32 0.02MB blk.2.attn_output.weight 4096x4096 Q1_0_g128 2.25MB blk.2.attn_q.weight 4096x4096 Q1_0_g128 2.25MB blk.2.attn_q_norm.weight 128 F32 0.00MB blk.2.attn_v.weight 4096x1024 Q1_0_g128 0.56MB blk.2.ffn_down.weight 12288x4096 Q1_0_g128 6.75MB blk.2.ffn_gate.weight 4096x12288 Q1_0_g128 6.75MB blk.2.ffn_norm.weight 4096 F32 0.02MB blk.2.ffn_up.weight 4096x12288 Q1_0_g128 6.75MB blk.3.attn_k.weight 4096x1024 Q1_0_g128 0.56MB blk.3.attn_k_norm.weight 128 F32 0.00MB blk.3.attn_norm.weight 4096 F32 0.02MB blk.3.attn_output.weight 4096x4096 Q1_0_g128 2.25MB blk.3.attn_q.weight 4096x4096 Q1_0_g128 2.25MB blk.3.attn_q_norm.weight 128 F32 0.00MB blk.3.attn_v.weight 4096x1024 Q1_0_g128 0.56MB blk.3.ffn_down.weight 12288x4096 Q1_0_g128 6.75MB blk.3.ffn_gate.weight 4096x12288 Q1_0_g128 6.75MB blk.3.ffn_norm.weight 4096 F32 0.02MB blk.3.ffn_up.weight 4096x12288 Q1_0_g128 6.75MB blk.4 through blk.34 — 31 identical blocks, 11 tensors each, 25.91MB per block, 803.21MB total blk.35.attn_k.weight 4096x1024 Q1_0_g128 0.56MB blk.35.attn_k_norm.weight 128 F32 0.00MB blk.35.attn_norm.weight 4096 F32 0.02MB blk.35.attn_output.weight 4096x4096 Q1_0_g128 2.25MB blk.35.attn_q.weight 4096x4096 Q1_0_g128 2.25MB blk.35.attn_q_norm.weight 128 F32 0.00MB blk.35.attn_v.weight 4096x1024 Q1_0_g128 0.56MB blk.35.ffn_down.weight 12288x4096 Q1_0_g128 6.75MB blk.35.ffn_gate.weight 4096x12288 Q1_0_g128 6.75MB blk.35.ffn_norm.weight 4096 F32 0.02MB blk.35.ffn_up.weight 4096x12288 Q1_0_g128 6.75MB TOTALS 399 tensors 254 Q1_0_g128 (1098MB, 99.996% of params) 145 F32 norms (0.56MB, 0.004% of params) 1099.3MB weight data 1.126 bits per weight 8,188,548,848 parameters LINKS Model: huggingface.co/prism-ml/Bonsa… Fork: github.com/PrismML-Eng/ll… HOW TO BUILD Standard llama.cpp won't work — you need PrismML's fork for Q1_0_g128 support. git clone github.com/PrismML-Eng/ll… cd llama.cpp NVIDIA (Linux): cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(nproc) macOS (Apple Silicon, Metal): cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(sysctl -n hw.ncpu) CPU only (any platform): cmake -B build -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(nproc) Download the model: wget huggingface.co/prism-ml/Bonsa… HOW TO RUN Terminal chat (interactive, conversational): ./build/bin/llama-cli -m Bonsai-8B.gguf -c 8192 -ngl 99 -fa on --chat-template chatml -cnv -p "You are a helpful assistant." -c 8192 = context window (up to 65536) -ngl 99 = offload all layers to GPU (drop this for CPU-only) -fa on = flash attention -cnv = conversation mode --chat-template chatml = use the model's native chat format Web interface (OpenAI-compatible API at localhost:8080): ./build/bin/llama-server -m Bonsai-8B.gguf -c 8192 -ngl 99 -fa on --port 8080 Then open http://localhost:8080 in your browser for the built-in chat UI, or hit the API: curl http://localhost:8080/v1/chat/completions -d '{"model":"bonsai","messages":[{"role":"user","content":"hello"}]}'
nisten🇨🇦e/acc tweet medianisten🇨🇦e/acc tweet medianisten🇨🇦e/acc tweet medianisten🇨🇦e/acc tweet media
Omead Pooladzandi@HessianFree

your spotify cache is bigger than our largest AI model. Bonsai: 1-bit weights. 1.7B to 8B params. 14x compression vs bf16. 8x faster on edge. 256 MB to 1.2GB. Based on Qwen 3. we just came out of stealth. intelligence belongs at the edge and we're going to put it there. Apache 2.0. we compressed intelligence. more coming. @PrismML

English
19
22
298
28.9K
Twist Bioscience
Twist Bioscience@TwistBioscience·
AATGAGGTCGAGAGAGGTCAGAATAATGCGGGTATTGTCGAGTACCAGGTAGTACCCTGAAATGAGGTCGAGAGAGGTCAGAATAATGCGCTGGAGACTTACCAGGTAGATCAGTGGAATTGAAATGAGGTCGAGAGAGGTCAGAATAATGCGAGGGTCAATGCGAGACAGGTAAATGATGCGAATGATGATGAGTCCGAGAGAACTTACCAGGTA @TwistBioscience
Suomi
16
14
119
31.8K
Sasha Gusev
Sasha Gusev@SashaGusevPosts·
Monthly median Received to Accepted time (days) at Nature Genetics
Sasha Gusev tweet media
English
20
49
234
146.6K
vittorio
vittorio@IterIntellectus·
holy shit they finally published the full creatine safety data and the only actually safe amount is 0g/day
vittorio tweet media
English
246
89
2.7K
1.4M
Jürgen Schmidhuber
Jürgen Schmidhuber@SchmidhuberAI·
Dr. LeCun's heavily promoted Joint Embedding Predictive Architecture (JEPA, 2022) [5] is the heart of his new company. However, the core ideas are not original to LeCun. Instead, JEPA is essentially identical to our 1992 Predictability Maximization system (PMAX) [1][14]. Details in reference [19] which contains many additional references. Motivation of PMAX [1][14]. Since details of inputs are often unpredictable from related inputs, two non-generative artificial neural networks interact as follows: one net tries to create a non-trivial, informative, latent representation of its own input that is predictable from the latent representation of the other net’s input. PMAX [1][14] is actually a whole family of methods. Consider the simplest instance in Sec. 2.2 of [1]: an auto encoder net sees an input and represents it in its hidden units (its latent space). The other net sees a different but related input and learns to predict (from its own latent space) the auto encoder's latent representation, which in turn tries to become more predictable, without giving up too much information about its own input, to prevent what's now called “collapse." See illustration 5.2 in Sec. 5.5 of [14] on the "extraction of predictable concepts." The 1992 PMAX paper [1] discusses not only auto encoders but also other techniques for encoding data. The experiments were conducted by my student Daniel Prelinger. The non-generative PMAX outperformed the generative IMAX [2] on a stereo vision task. The 2020 BYOL [10] is also closely related to PMAX. In 2026, @misovalko, leader of the BYOL team, praised PMAX, and listed numerous similarities to much later work [19]. Note that the self-created “predictable classifications” in the title of [1] (and the so-called “outputs” of the entire system [1]) are typically INTERNAL "distributed representations” (like in the title of Sec. 4.2 of [1]). The 1992 PMAX paper [1] considers both symmetric and asymmetric nets. In the symmetric case, both nets are constrained to emit "equal (and therefore mutually predictable)" representations [1]. Sec. 4.2 on “finding predictable distributed representations” has an experiment with 2 weight-sharing auto encoders which learn to represent in their latent space what their inputs have in common (see the cover image of this post). Of course, back then compute was was a million times more expensive, but the fundamental insights of "JEPA" were present, and LeCun has simply repackaged old ideas without citing them [5,6,19]. This is hardly the first time LeCun (or others writing about him) have exaggerated LeCun's own significance by downplaying earlier work. He did NOT "co-invent deep learning" (as some know-nothing "AI influencers" have claimed) [11,13], and he did NOT invent convolutional neural nets (CNNs) [12,6,13], NOR was he even the first to combine CNNs with backpropagation [12,13]. While he got awards for the inventions of other researchers whom he did not cite [6], he did not invent ANY of the key algorithms that underpin modern AI [5,6,19]. LeCun's recent pitch: 1. LLMs such as ChatGPT are insufficient for AGI (which has been obvious to experts in AI & decision making, and is something he once derided @GaryMarcus for pointing out [17]). 2. Neural AIs need what I baptized a neural "world model" in 1990 [8][15] (earlier, less general neural nets of this kind, such as those by Paul Werbos (1987) and others [8], weren't called "world models," although the basic concept itself is ancient [8]). 3. The world model should learn to predict (in non-generative "JEPA" fashion [5]) higher-level predictable abstractions instead of raw pixels: that's the essence of our 1992 PMAX [1][14]. Astonishingly, PMAX or "JEPA" seems to be the unique selling proposition of LeCun's 2026 company on world model-based AI in the physical world, which is apparently based on what we published over 3 decades ago [1,5,6,7,8,13,14], and modeled after our 2014 company on world model-based AGI in the physical world [8]. In short, little if anything in JEPA is new [19]. But then the fact that LeCun would repackage old ideas and present them as his own clearly isn't new either [5,6,18,19]. FOOTNOTES 1. Note that PMAX is NOT the 1991 adversarial Predictability MINimization (PMIN) [3,4]. However, PMAX may use PMIN as a submodule to create informative latent representations [1](Sec. 2.4), and to prevent what's now called “collapse." See the illustration on page 9 of [1]. 2. Note that the 1991 PMIN [3] also predicts parts of latent space from other parts. However, PMIN's goal is to REMOVE mutual predictability, to obtain maximally disentangled latent representations called factorial codes. PMIN by itself may use the auto encoder principle in addition to its latent space predictor [3]. 3. Neither PMAX nor PMIN was my first non-generative method for predicting latent space, which was published in 1991 in the context of neural net distillation [9]. See also [5-8]. 4. While the cognoscenti agree that LLMs are insufficient for AGI, JEPA is so, too. We should know: we have had it for over 3 decades under the name PMAX! Additional techniques are required to achieve AGI, e.g., meta learning, artificial curiosity and creativity, efficient planning with world models, and others [16]. REFERENCES (easy to find on the web): [1] J. Schmidhuber (JS) & D. Prelinger (1993). Discovering predictable classifications. Neural Computation, 5(4):625-635. Based on TR CU-CS-626-92 (1992): people.idsia.ch/~juergen/predm… [2] S. Becker, G. E. Hinton (1989). Spatial coherence as an internal teacher for a neural network. TR CRG-TR-89-7, Dept. of CS, U. Toronto. [3] JS (1992). Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879. Based on TR CU-CS-565-91, 1991. [4] JS, M. Eldracher, B. Foltin (1996). Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8(4):773-786. [5] JS (2022-23). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. [6] JS (2023-25). How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. Technical Report IDSIA-23-23. [7] JS (2026). Simple but powerful ways of using world models and their latent space. Opening keynote for the World Modeling Workshop, 4-6 Feb, 2026, Mila - Quebec AI Institute. [8] JS (2026). The Neural World Model Boom. Technical Note IDSIA-2-26. [9] JS (1991). Neural sequence chunkers. TR FKI-148-91, TUM, April 1991. (See also Technical Note IDSIA-12-25: who invented knowledge distillation with artificial neural networks?) [10] J. Grill et al (2020). Bootstrap your own latent: A "new" approach to self-supervised Learning. arXiv:2006.07733 [11] JS (2025). Who invented deep learning? Technical Note IDSIA-16-25. [12] JS (2025). Who invented convolutional neural networks? Technical Note IDSIA-17-25. [13] JS (2022-25). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, arXiv:2212.11279 [14] JS (1993). Network architectures, objective functions, and chain rule. Habilitation Thesis, TUM. See Sec. 5.5 on "Vorhersagbarkeitsmaximierung" (Predictability Maximization). [15] JS (1990). Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM. [16] JS (1990-2026). AI Blog. [17] @GaryMarcus. Open letter responding to @ylecun. A memo for future intellectual historians. Substack, June 2024. [18] G. Marcus. The False Glorification of @ylecun. Don’t believe everything you read. Substack, Nov 2025. [19] J. Schmidhuber. Who invented JEPA? Technical Note IDSIA-3-22, IDSIA, Switzerland, March 2026. people.idsia.ch/~juergen/who-i…
Jürgen Schmidhuber tweet media
English
82
176
1.7K
533.2K