Nathan Axcan

1.2K posts

Nathan Axcan banner
Nathan Axcan

Nathan Axcan

@AxcanNathan

Entropy-neur | Gradient descent takes a photography of the real world. | Formerly@tudelft, now LLM research@IBM (I do not represent IBM in any way)

Zurich, Switzerland Присоединился Ağustos 2022
385 Подписки284 Подписчики
Закреплённый твит
Nathan Axcan
Nathan Axcan@AxcanNathan·
Some quests you may embark upon for me (should you find them worthy of your talents): 🪡🧵
English
2
0
4
2.7K
Nathan Axcan
Nathan Axcan@AxcanNathan·
The point of LLMs is access Non-technical people, whoever, will use a Lovable-like, and the model providers benefit because the model will do more and more of the thinking, so it won't matter who gave access to information, what matters is just that the LLM API is getting access to more sensitive, rare, interesting information to be anonymized and used to train better models
English
0
0
0
17
Nathan Axcan
Nathan Axcan@AxcanNathan·
Now we enter the territory of specific algorithms: DQN I believe to be the landmark result; it's an improvement upon previous SARSA that changes the definition of value: instead of "the value of an action is how valuably I behave later, if I take that action" (SARSA, and you can read this as being explicitly on-policy) whereas DQN says "the value of an action is the value of the most valuable behavior I could have after taking that action" (here's a hypothetical "best behavior", in practice the maximum value action once we're in the next state) And this made DQN able to beat Atari games for the first time, so the lesson is something like "even just imagining a better policy than yourself is enough to achieve complex tasks", "imagining" because the 'value' in 'maximum value action' depends on your own estimation. Sounds terribly fundamental and profound. Prior to DQN a landmark result was superhuman TD-Gammon, where a simpler approach "eligibility traces" where the update rule says something like "take the sum of gradients of recent value estimations (recent meaning the older the estimation is, the more it's discounted), and decide whether to reinforce or discourage that behavior based on this step's increase or decrease in value compared to the last step". In simpler terms "whether this step made things better or worse, determines whether I'm happy or sad about decisions I made recently". But again this is all about the model's own policy, a shortcoming in imagination, in retrospect! then we come to DQN modifications - Double DQN said "who chooses the 'best behavior' should be an agent learning on the spot, but who determines how good that behavior actually is, should be an unchanging, older version of that agent" which interestingly decouples something like short-term learning and long-term learning (and yielded modest improvements in Atari; so it was a way to take advantage of more gradient steps that could be made, while remaining stable by connecting the short term to the long term. So short-term learning is necessarily eager?) - Dueling DQN simply decomposes the task of value estimation from "what is the value of this state-action pair" to "what is the value of this state, and for each possible action does it make things better or worse"; probably because this fits the underlying data better it allowed for better Atari gaming, so I would expect it to disappear at larger scales. Something interesting is that these architectures have the following heuristic: "actions are solely determined by an equation involving value estimation and randomness", meaning there is no "irrational action" or "counter-intuitive action", those are fully explained by randomness and therefore are uninteresting. Actor-critic architectures (baby of Sutton&Barto&Anderson) say there should be an action function, taking state and returning actions; value estimation is merely something the action function learns from. DQN did not support continuous actions though, so it was followed on the landmark-beaten-path by DDPG, where we indeed have an "actor" action function and "critic" value estimator. Because of the replay buffer, it's off-policy, but we're not imagining a better policy than what we currently have anymore. Something interesting is that the actor in this setup tries to maximize the value function's judgement of sampled actions, so this is like a contrastive-learning setup or a GAN, where a neural network is optimizing the outputs of another; generally an unstable thing because NNs are not usually smooth!
English
0
0
0
13
Nathan Axcan
Nathan Axcan@AxcanNathan·
Concerning Bellman error and policy: Bellman error originally requires you to know what your next most likely action would be; that's the "connection to the future", or "learning from the future", which allows you to learn from future reward. And if you set your gamma close enough to 1, you could propagate backwards a reward found millions of steps in the future. One big problem is that this is basically one-shot. After you've done one backward pass, probably at least one of those state transitions wouldn't have been the most likely action, according to our new value function. But obviously there's still something to be learned here (one first-order gradient step doesn't usually grok anything), so how could we keep learning from the recorded rollout? Here's two obvious ideas: 1. make the reward propagate backwards, but it "fades" according to how likely our new Q was to pick the recorded action (and this is called importance sampling, and it is beautifully used in many algorithms: Eμ[f(x)⋅π(x)/μ(x)]=Eπ[f(x)]) 2. only propagate backwards through what sub-sequences (like a substring) still correspond to the new Q's most likely next action (and this is called Retrace(λ) by (Munos et al., 2016), which found great results!)
English
1
0
0
24
Nathan Axcan
Nathan Axcan@AxcanNathan·
RL study notes starting with gamma in the Bellman equation: The core idea seems to be that you want to plan for what you know (I can only plan based on the information I have available), and my available knowledge necessarily is placed behind a screen of uncertainty (back to Schopenhauer), which I can be aware of and try to consciously integrate in my plans. More specifically, Gamma is part of this idea that if a reward happens further in the future, it is of lesser importance to our actions (so that's very heuristic-y), and I see two justifications for this: 1. environments are probabilistic (or combinatorially explode), and we only care about the part of future predicted reward which protrudes above the "cone of noise" 2. our understanding of the environment and its rewards is probabilistic, and we only care about that part which we trust can be predicted So this interpretation decides there's brownian motion in the cumulative rewards of a rollout (1. realized and 2. imagined). This seems to be common among alternatives even, but something like Pontryagin maximum principle doesn't actually have this aspect (and therefore does not interface with reality, which admits no closed form formulation that is accessible to humans).
English
1
0
0
31
Nathan Axcan
Nathan Axcan@AxcanNathan·
Insane that DeepSeekV3.2-685B (37B active params) produces cheaper output tokens than hybrid model Qwen3.5-27B. Good architecture on the right infra makes your model look scale-free. The limits of scaling model size therefore will not be found for a long time still. Data parallelism must be sacrificed.
Nathan Axcan tweet mediaNathan Axcan tweet media
English
0
0
0
84
effectfully
effectfully@effectfully·
@thegeneralist01 "Between me and Mark Zuckerberg, we have a combined net worth of about $200 billion."
effectfully tweet media
English
4
0
91
14.1K
Nathan Axcan
Nathan Axcan@AxcanNathan·
@0xSero Please if you train a model use Engram layers they are absolutely perfect for local deployment
English
0
0
0
171
0xSero
0xSero@0xSero·
In 72 hours I got over 100k of value 1. Lambda gave me 5000$ credits in compute 2. Nvidia offered me 8x H100s on the cloud (20$/h) idk for how long but assuming 2 weeks that'd be 5000$~ 3. TNG technology offered me 2 weeks of B200s which is something like 12000$ in compute 4. A kind person offered me 100k in GCP credits (enough to train a 27B if you do it right) 5. Framework offered to mail me a desktop computer 6. We got 14,000$ in donations which will go to buying 2x RTX Pro 6000s (bringing me up to 384GB VRAM) 7. I got over 6M impressions which based on my RPM would be 1500$ over my 500$~ usual per pay period 8. I have gained 17,000~ followers, over doubling my follower count 9. 17 subscribers on X + 700 on youtube. The total value of all this approaches at minimum 50,000$~ and closer to 150,000$ if I leverage it all. --------------------- What I'll be doing with all this: Eric is an incredibly driven researcher I have been bouncing ideas off of over the last month. Him and I have been tackling the idea of getting massive models to fit on relatively cheap memory. The idea is taking advantage of different forms of memory, in combination with expert saliency scoring, to offload specific expert groupings to different memory tiers. For the MoEs I've tested over my entire AI session history about 37.5% of the model is responsible for 95% of token routing. So we can offload 62.5% of an LLM onto SSD/NVMe/CPU/Cheap VRAM this should theoretically result in minimal latency added if we can select the right experts. We can combine this with paged swapping to further accelerate the prompt processing, if done right we are looking at very very decent performance for massive unquantisation & unpruned LLMs. You can get DeepSeek-v3.2-speciale at full intelligence with decent tokens/s as long as you have enough vram to host the core 20-40% of the model and enough ram or SSD to host the rest. Add quantisation to the mix and you can basically have decent speeds and intelligence with just 5-10% of the model's size in vram (+ you need some for context) The funds will be used to push this to it's limits. ----------------- There's also tons of research that you can quantise a model drastically, then distill from the original BF16 or make a LoRA to align it back to the original mostly. This will be added to the pipeline too. ------------------ All this will be built out here: github.com/0xSero/moe-com… you will be able to take any MoE and shove it in here, and with only 24GB and enough RAM/NVMe to compress it down. it'll be slow as hell but it will work with little tinkering. ------------------ Lastly I will be looking into either a full training run from scratch -> or just post-training on an open AMERICAN base model - a research model - an openclaw/nanoclaw/hermes model - a browser-use model To prove that this can be done. -------------------- I will be bad at all of it, and doubt I will get beyond the best small models from 6 months ago, but I want to prove it's no boogeyman impossible task to everyone who says otherwise. -------------------- By the end of the year: 1. I will have 1 model I trained in some capacity be on the top 5 at either pinchbench, browseruse, or research. 2. My github will have a master repo which combines all my work into reusable generalised scripts to help you do that same. 3. The largest public comparative dataset for all MoE quantisations, prunes, benchmarks, costs, hardware requirements. -------------------------- A lot of this will be lead by Eric, who I will tag in the next post. I want to say thank you to everyone who has supported me, I have gotten a lot of comments stating: 1. I'm crazy, stupid, or both 2. I'm wasting my time, no one cares about this 3. This is not a real issue I believe the amount of interest and support I've received says it all. donate.sybilsolutions.ai
0xSero tweet media
English
223
273
4.1K
166.5K
Ekaeo
Ekaeo@Ekaeoq·
No sadder feeling than packing your life away, along with all the little knick knacks you’ve collected over the year. I hate moving, especially when it’s a nice place.
Ekaeo tweet media
English
9
14
552
17K
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
MacBook Neo vs similar price Windows laptops (AMD, Intel, Qualcomm) - Neo has better build quality + display - Neo has worse MT perf + throttles, but better ST - iGPU is competitive - 8GB RAM bottleneck in some cases (web browsing timed link ->) youtu.be/f9fhCMBIbis?si…
YouTube video
YouTube
Sebastian Aaltonen tweet media
English
15
8
105
49.4K
Charles Rosenbauer
Charles Rosenbauer@bzogrammer·
Anyone want to get added to a math/physics gc?
English
7
0
6
342
Tejes Srivalsan
Tejes Srivalsan@tejessrivalsan·
excited to announce that we’re open sourcing EGO-SNAKE the largest dataset of egocentric snake pov footage to train the next generation of autonomous vipers comment for a data sample
English
233
184
4.5K
639.3K
Nathan Axcan
Nathan Axcan@AxcanNathan·
The true openai golf challenge is getting a hold of the 8xh100 It’s meant to filter for labs that are not too far from them, and crafty hackers
English
0
0
0
41
Nathan Axcan
Nathan Axcan@AxcanNathan·
@Ekaeoq Good luck That’s super tiring
English
0
0
0
13
Ekaeo
Ekaeo@Ekaeoq·
@AxcanNathan Currently packing my life into boxes! Will answer properly once I’m home
English
1
0
0
18
Ekaeo
Ekaeo@Ekaeoq·
As a watchmaker, buying a Swiss dust blower is basically a necessity. Not because it’s rubber and moves air, but because the air inside is Swiss, and therefore objectively better. Bergeon heritage matters.
Ekaeo tweet media
English
8
4
113
4.1K
Ekaeo
Ekaeo@Ekaeoq·
@AxcanNathan I mean that sounds great, but what does it mean?
English
1
0
0
44
Nathan Axcan
Nathan Axcan@AxcanNathan·
@teortaxesTex I think this is more productive work than Mamba tbh: x.com/MayankMish98/s… At least it's grounded on Chomsky hierarchy arguments that preach capabilities and generalization; now if only they would drop the linear attention layers
Mayank Mishra@MayankMish98

Introducing M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling We bring back non-linear recurrence to language modeling and show it's been held back by small state sizes, not by non-linearity itself. 📄 Paper: arxiv.org/abs/2603.14360 💻 Code: github.com/open-lm-engine… 🤗 Models: huggingface.co/collections/op…

English
0
0
0
179
Nathan Axcan
Nathan Axcan@AxcanNathan·
@mertunsal2020 Well that’s great Btw, I’m curious why only one eval is shown and why there’s no tech report? Taking a previous model as an example, I would have done some more research using the unique Mamba-Codestral model, if there would have been a tech report
English
1
0
0
24
Mert Ünsal
Mert Ünsal@mertunsal2020·
you have to compare under the same $ budget, in which case we’re significantly better. advantage of Lean is that you have a verifier so you can just sample 2 times from our model and automatically pick the correct one. In other fields, there’s no good way of picking one of multiple answers so formal math is unique in this sense. once you decide on the $ you want to spend on a problem you can either run strong model fewer times or weak model many times, and our model will give you better results under same $ budget
English
1
0
1
48
Nathan Axcan
Nathan Axcan@AxcanNathan·
...which means it's time to up the body's complexity! And we move to a quadruped. Seems to learn interesting policy variants. (btw this is still ALL running on a single M1 Max)
English
0
0
0
27
Nathan Axcan
Nathan Axcan@AxcanNathan·
well! now the policies learn to go in different directions, which is reassuring (we are creating "neuro-spatial partitions", a luxury in the real world!)
Nathan Axcan tweet media
English
1
0
0
19
Nathan Axcan
Nathan Axcan@AxcanNathan·
This paper (DIAYN) stayed stuck in my mind and it looks like the Genesis simulator is now a good quality codebase, I think it's the true future of robotics; can we re-implement it and scale it up (in minimal time using coding agents)? 🧵 x.com/AxcanNathan/st…
Nathan Axcan tweet media
Nathan Axcan@AxcanNathan

Remember this thing? After a year+ : - Stable metal backend with easy visuals - runs great on my M1 Max - in Claude Code even GLM 4.7 was able to setup nice experiments Very cool if you don't wanna be locked into Isaac Gym!

English
1
0
0
105