Sebastian Beyer

1.3K posts

Sebastian Beyer banner
Sebastian Beyer

Sebastian Beyer

@BeyerSebastian

Building Sentinel (~€1k MRR) & https://t.co/dLXnaGX8Pe (€55 MRR) • Co-founder https://t.co/X6KxwGA9Fw • Bootstrapping AI products from Vienna

Austria Beigetreten Haziran 2013
190 Folgt61 Follower
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
Any good advice on how to give an OpenClaw agent a unique personality?
English
1
0
0
24
Armin Ronacher ⇌
Armin Ronacher ⇌@mitsuhiko·
I did 10 calls with people now that shared their agentic coding experience. 7/10 reported non engineers vibeslopping code up. Majority said they moved to re-prompt all those contributions because it became impossible/too time consuming to work with the those PRs.
English
29
4
306
27.2K
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
„OpenClaw ist von der Architektur her so aufgebaut - wenn das Model etwas macht, das es nicht soll - geht es derart in die Hose, dass du eher chinesische Fachliteratur bekommst, statt eine funktionierende Hotelbuchung."
Deutsch
0
0
0
7
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
Don't get me wrong, I like @openclaw and it is a great project! The following is for my Austrian friends: Die Antwort die @steipete an Armin Wolf geben hätte sollen zu der „falsches Hotel in Paris gebucht“ Frage ist nicht was auch immer er gesagt hat, sondern:
Sebastian Beyer@BeyerSebastian

found the root cause. issue one appears to be model drift in Gemini Flash. Gemini randomly produced chinese text. second problem seems to be an architecture weakness in @openclaw which randomly surfaced the chinese text in a telegram message.

Deutsch
1
1
1
66
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
found the root cause. issue one appears to be model drift in Gemini Flash. Gemini randomly produced chinese text. second problem seems to be an architecture weakness in @openclaw which randomly surfaced the chinese text in a telegram message.
Sebastian Beyer tweet media
English
0
0
0
89
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
@steipete This is extremely weird - I suddenly received chinese text from a different user somehow. I have ChatGPT and Codex investigate it. Appears to be a session-state contamination inside @openclaw
Sebastian Beyer tweet media
English
1
0
0
30
Gary Marcus
Gary Marcus@GaryMarcus·
Hey, you are never gonna believe it! @ylecun organized a group phone call with me and @SchmidhuberAI and some of the other people he has ripped off over the years, and *apologized*. He’s a much more mature human being than I had ever realized. Oh wait… April Fool’s!
English
14
6
169
21.1K
John Carmack
John Carmack@ID_AA_Carmack·
Paper review: LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels arxiv.org/pdf/2603.19312 Nice clean github: github.com/lucas-maes/le-… This is the application of the LeJEPA results to world models, trained offline on experience from three different robotics style tests with one to two million steps in each dataset. Re-states the benefits of the SigReg loss relative to prior world model approaches. Uses ImageNet standard 224x224 RGB pixel input images with an unmodified ViT-Tiny vision transformer from HuggingFace to generate latents. One extra post-projection step is needed to give SigReg the necessary freedom to perturb the latents into independent gaussians, since ViT ends with a layernorm’d layer. Also tested with ResNet-18, which still performed well, but slightly worse. Uses a 192 dimensional latent. Performance slightly dropped when doubling the latent size to 384; it would be nice to know if it was stable there, or if it continued worsening with excessive latents. There is a relationship between batch size and SIGReg, the larger latent may have improved performance if the batch size was increased. The predictor is implemented as a ViT-S backbone – Why a vision transformer when the latent is flat? Uses a history of 3 sets of latents for two of the benchmarks and 1 for the other. Performance was markedly better with the “small” ViT model than the “tiny”, but the larger “base” model degraded notably, which is interesting. Dropout of 0.1 on the predictor significantly improved performance. 0.2 was still better than 0.0, but 0.5 was worse. Trained with a batch of 128 x 4 trajectories. I wish their training loss graphs were more zoomed in with grid lines. Performs planning at test time instead of building a policy by training in imagination like Dreamer / Diamond. Rolls out 300 initially random sets of actions up to a planning horizon H of 5 (at frame-skip 5). Iterates up to 30 times using the Cross Entropy Method (CEM). The main paper body mentions using Model Predictive Control (MPC) strategy, where only the first K planned actions are executed before replanning, but appendix D says they execute all 5 planned actions. After training, they probe the latent space to demonstrate that it does capture and represent physically meaningful quantities. They also implement a decoder from the latent space back to pixels – not used by the algorithms, but helpful to see what things the latent space is actually representing. They tested incorporating the reconstruction loss into training, but it hurt performance somewhat. They wound up with a 0.1 lambda for SigReg, as opposed to 0.05 in the LeJEPA paper. 1024 sigreg projections, but observe the number has negligible impact I like the JEPA framework, but so far my attempts to use it on Atari games with value functions have not matched my other efforts.
Lucas Maes@lucasmaes_

JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: le-wm.github.io

English
39
94
926
196.8K
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
@miangoar @ylecun The question is not what was schmidhubered? The question is what was NOT schmidhubered?
English
0
0
1
336
GAMA Miguel Angel 🐦‍⬛🔑
The JEPA architecture by @ylecun has been schmidhubered. This means it is a good algorithm and joins the hall of fame with other schmidhubered algorithms such as AlphaFold2, MLPs and transformers.
GAMA Miguel Angel 🐦‍⬛🔑 tweet media
Jürgen Schmidhuber@SchmidhuberAI

Dr. LeCun's heavily promoted Joint Embedding Predictive Architecture (JEPA, 2022) [5] is the heart of his new company. However, the core ideas are not original to LeCun. Instead, JEPA is essentially identical to our 1992 Predictability Maximization system (PMAX) [1][14]. Details in reference [19] which contains many additional references. Motivation of PMAX [1][14]. Since details of inputs are often unpredictable from related inputs, two non-generative artificial neural networks interact as follows: one net tries to create a non-trivial, informative, latent representation of its own input that is predictable from the latent representation of the other net’s input. PMAX [1][14] is actually a whole family of methods. Consider the simplest instance in Sec. 2.2 of [1]: an auto encoder net sees an input and represents it in its hidden units (its latent space). The other net sees a different but related input and learns to predict (from its own latent space) the auto encoder's latent representation, which in turn tries to become more predictable, without giving up too much information about its own input, to prevent what's now called “collapse." See illustration 5.2 in Sec. 5.5 of [14] on the "extraction of predictable concepts." The 1992 PMAX paper [1] discusses not only auto encoders but also other techniques for encoding data. The experiments were conducted by my student Daniel Prelinger. The non-generative PMAX outperformed the generative IMAX [2] on a stereo vision task. The 2020 BYOL [10] is also closely related to PMAX. In 2026, @misovalko, leader of the BYOL team, praised PMAX, and listed numerous similarities to much later work [19]. Note that the self-created “predictable classifications” in the title of [1] (and the so-called “outputs” of the entire system [1]) are typically INTERNAL "distributed representations” (like in the title of Sec. 4.2 of [1]). The 1992 PMAX paper [1] considers both symmetric and asymmetric nets. In the symmetric case, both nets are constrained to emit "equal (and therefore mutually predictable)" representations [1]. Sec. 4.2 on “finding predictable distributed representations” has an experiment with 2 weight-sharing auto encoders which learn to represent in their latent space what their inputs have in common (see the cover image of this post). Of course, back then compute was was a million times more expensive, but the fundamental insights of "JEPA" were present, and LeCun has simply repackaged old ideas without citing them [5,6,19]. This is hardly the first time LeCun (or others writing about him) have exaggerated LeCun's own significance by downplaying earlier work. He did NOT "co-invent deep learning" (as some know-nothing "AI influencers" have claimed) [11,13], and he did NOT invent convolutional neural nets (CNNs) [12,6,13], NOR was he even the first to combine CNNs with backpropagation [12,13]. While he got awards for the inventions of other researchers whom he did not cite [6], he did not invent ANY of the key algorithms that underpin modern AI [5,6,19]. LeCun's recent pitch: 1. LLMs such as ChatGPT are insufficient for AGI (which has been obvious to experts in AI & decision making, and is something he once derided @GaryMarcus for pointing out [17]). 2. Neural AIs need what I baptized a neural "world model" in 1990 [8][15] (earlier, less general neural nets of this kind, such as those by Paul Werbos (1987) and others [8], weren't called "world models," although the basic concept itself is ancient [8]). 3. The world model should learn to predict (in non-generative "JEPA" fashion [5]) higher-level predictable abstractions instead of raw pixels: that's the essence of our 1992 PMAX [1][14]. Astonishingly, PMAX or "JEPA" seems to be the unique selling proposition of LeCun's 2026 company on world model-based AI in the physical world, which is apparently based on what we published over 3 decades ago [1,5,6,7,8,13,14], and modeled after our 2014 company on world model-based AGI in the physical world [8]. In short, little if anything in JEPA is new [19]. But then the fact that LeCun would repackage old ideas and present them as his own clearly isn't new either [5,6,18,19]. FOOTNOTES 1. Note that PMAX is NOT the 1991 adversarial Predictability MINimization (PMIN) [3,4]. However, PMAX may use PMIN as a submodule to create informative latent representations [1](Sec. 2.4), and to prevent what's now called “collapse." See the illustration on page 9 of [1]. 2. Note that the 1991 PMIN [3] also predicts parts of latent space from other parts. However, PMIN's goal is to REMOVE mutual predictability, to obtain maximally disentangled latent representations called factorial codes. PMIN by itself may use the auto encoder principle in addition to its latent space predictor [3]. 3. Neither PMAX nor PMIN was my first non-generative method for predicting latent space, which was published in 1991 in the context of neural net distillation [9]. See also [5-8]. 4. While the cognoscenti agree that LLMs are insufficient for AGI, JEPA is so, too. We should know: we have had it for over 3 decades under the name PMAX! Additional techniques are required to achieve AGI, e.g., meta learning, artificial curiosity and creativity, efficient planning with world models, and others [16]. REFERENCES (easy to find on the web): [1] J. Schmidhuber (JS) & D. Prelinger (1993). Discovering predictable classifications. Neural Computation, 5(4):625-635. Based on TR CU-CS-626-92 (1992): people.idsia.ch/~juergen/predm… [2] S. Becker, G. E. Hinton (1989). Spatial coherence as an internal teacher for a neural network. TR CRG-TR-89-7, Dept. of CS, U. Toronto. [3] JS (1992). Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879. Based on TR CU-CS-565-91, 1991. [4] JS, M. Eldracher, B. Foltin (1996). Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8(4):773-786. [5] JS (2022-23). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. [6] JS (2023-25). How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. Technical Report IDSIA-23-23. [7] JS (2026). Simple but powerful ways of using world models and their latent space. Opening keynote for the World Modeling Workshop, 4-6 Feb, 2026, Mila - Quebec AI Institute. [8] JS (2026). The Neural World Model Boom. Technical Note IDSIA-2-26. [9] JS (1991). Neural sequence chunkers. TR FKI-148-91, TUM, April 1991. (See also Technical Note IDSIA-12-25: who invented knowledge distillation with artificial neural networks?) [10] J. Grill et al (2020). Bootstrap your own latent: A "new" approach to self-supervised Learning. arXiv:2006.07733 [11] JS (2025). Who invented deep learning? Technical Note IDSIA-16-25. [12] JS (2025). Who invented convolutional neural networks? Technical Note IDSIA-17-25. [13] JS (2022-25). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, arXiv:2212.11279 [14] JS (1993). Network architectures, objective functions, and chain rule. Habilitation Thesis, TUM. See Sec. 5.5 on "Vorhersagbarkeitsmaximierung" (Predictability Maximization). [15] JS (1990). Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM. [16] JS (1990-2026). AI Blog. [17] @GaryMarcus. Open letter responding to @ylecun. A memo for future intellectual historians. Substack, June 2024. [18] G. Marcus. The False Glorification of @ylecun. Don’t believe everything you read. Substack, Nov 2025. [19] J. Schmidhuber. Who invented JEPA? Technical Note IDSIA-3-22, IDSIA, Switzerland, March 2026. people.idsia.ch/~juergen/who-i…

English
16
56
891
69.3K
Mario Zechner
Mario Zechner@badlogicgames·
chat, is he serious?
Mario Zechner tweet media
English
33
0
96
19.3K
Armin Ronacher ⇌
Armin Ronacher ⇌@mitsuhiko·
By far the coolest thing about LLMs is still that you can use them to connect to people that speak other languages. I can now get great English language transcripts of arabic videos, which was previously a pain in the butt.
English
8
1
77
7K
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
@sickdotdev Coding = security risks Vibe or Trad. Doesn’t matter. It is a risk and must be managed. If managed well, the risk is as low as you want it to be.
English
0
0
0
5
Sick
Sick@sickdotdev·
Prove me wrong: Vibe coding = security risks
English
152
3
102
9.4K
Peter Steinberger 🦞
Peter Steinberger 🦞@steipete·
@JustinGorya the app's amazing for knowledge work, emails, slack, cal, notion, linear, github, smaller fixes. For deep coding work it's hard to break old habits.
English
13
6
168
32.1K
Mario Zechner
Mario Zechner@badlogicgames·
@0xSero the claude stat can't possibly be right.
English
4
0
47
5.4K
0xSero
0xSero@0xSero·
Pi is very cost efficient.
0xSero tweet media
English
33
20
541
49.6K
Peter Steinberger 🦞
Peter Steinberger 🦞@steipete·
This guy emailed me asking for a *token session refund* because his claw made mistakes. 🙃
Peter Steinberger 🦞 tweet media
English
962
152
6.8K
768.9K
Alex Ibragimov
Alex Ibragimov@alexwtlf·
Are you building something cool? Share your project. The top one gets featured for 24 hours on my platform with 5000+ weekly views
English
431
6
211
16K
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
Stop asking your coding agent "How do I... now?", start asking "What do you need from me to do ... now?"
English
0
0
0
15
Sebastian Beyer
Sebastian Beyer@BeyerSebastian·
Are there still people claiming they can write better code than Codex?
English
0
0
0
16