Nathan Axcan

1.2K posts

Nathan Axcan

@AxcanNathan

Entropy-neur | Gradient descent takes a photography of the real world. | Formerly@tudelft, now LLM research@IBM (I do not represent IBM in any way)

Zurich, Switzerland Присоединился Ağustos 2022

385 Подписки284 Подписчики

Закреплённый твит

Nathan Axcan@AxcanNathan·8 Oca

Some quests you may embark upon for me (should you find them worthy of your talents): 🪡🧵

English

2.7K

Nathan Axcan@AxcanNathan·4d

The point of LLMs is access Non-technical people, whoever, will use a Lovable-like, and the model providers benefit because the model will do more and more of the thinking, so it won't matter who gave access to information, what matters is just that the LLM API is getting access to more sensitive, rare, interesting information to be anonymized and used to train better models

English

Nathan Axcan@AxcanNathan·4d

Now we enter the territory of specific algorithms: DQN I believe to be the landmark result; it's an improvement upon previous SARSA that changes the definition of value: instead of "the value of an action is how valuably I behave later, if I take that action" (SARSA, and you can read this as being explicitly on-policy) whereas DQN says "the value of an action is the value of the most valuable behavior I could have after taking that action" (here's a hypothetical "best behavior", in practice the maximum value action once we're in the next state) And this made DQN able to beat Atari games for the first time, so the lesson is something like "even just imagining a better policy than yourself is enough to achieve complex tasks", "imagining" because the 'value' in 'maximum value action' depends on your own estimation. Sounds terribly fundamental and profound. Prior to DQN a landmark result was superhuman TD-Gammon, where a simpler approach "eligibility traces" where the update rule says something like "take the sum of gradients of recent value estimations (recent meaning the older the estimation is, the more it's discounted), and decide whether to reinforce or discourage that behavior based on this step's increase or decrease in value compared to the last step". In simpler terms "whether this step made things better or worse, determines whether I'm happy or sad about decisions I made recently". But again this is all about the model's own policy, a shortcoming in imagination, in retrospect! then we come to DQN modifications - Double DQN said "who chooses the 'best behavior' should be an agent learning on the spot, but who determines how good that behavior actually is, should be an unchanging, older version of that agent" which interestingly decouples something like short-term learning and long-term learning (and yielded modest improvements in Atari; so it was a way to take advantage of more gradient steps that could be made, while remaining stable by connecting the short term to the long term. So short-term learning is necessarily eager?) - Dueling DQN simply decomposes the task of value estimation from "what is the value of this state-action pair" to "what is the value of this state, and for each possible action does it make things better or worse"; probably because this fits the underlying data better it allowed for better Atari gaming, so I would expect it to disappear at larger scales. Something interesting is that these architectures have the following heuristic: "actions are solely determined by an equation involving value estimation and randomness", meaning there is no "irrational action" or "counter-intuitive action", those are fully explained by randomness and therefore are uninteresting. Actor-critic architectures (baby of Sutton&Barto&Anderson) say there should be an action function, taking state and returning actions; value estimation is merely something the action function learns from. DQN did not support continuous actions though, so it was followed on the landmark-beaten-path by DDPG, where we indeed have an "actor" action function and "critic" value estimator. Because of the replay buffer, it's off-policy, but we're not imagining a better policy than what we currently have anymore. Something interesting is that the actor in this setup tries to maximize the value function's judgement of sampled actions, so this is like a contrastive-learning setup or a GAN, where a neural network is optimizing the outputs of another; generally an unstable thing because NNs are not usually smooth!

English

Nathan Axcan@AxcanNathan·5d

Concerning Bellman error and policy: Bellman error originally requires you to know what your next most likely action would be; that's the "connection to the future", or "learning from the future", which allows you to learn from future reward. And if you set your gamma close enough to 1, you could propagate backwards a reward found millions of steps in the future. One big problem is that this is basically one-shot. After you've done one backward pass, probably at least one of those state transitions wouldn't have been the most likely action, according to our new value function. But obviously there's still something to be learned here (one first-order gradient step doesn't usually grok anything), so how could we keep learning from the recorded rollout? Here's two obvious ideas: 1. make the reward propagate backwards, but it "fades" according to how likely our new Q was to pick the recorded action (and this is called importance sampling, and it is beautifully used in many algorithms: Eμ[f(x)⋅π(x)/μ(x)]=Eπ[f(x)]) 2. only propagate backwards through what sub-sequences (like a substring) still correspond to the new Q's most likely next action (and this is called Retrace(λ) by (Munos et al., 2016), which found great results!)

English

Nathan Axcan@AxcanNathan·5d

RL study notes starting with gamma in the Bellman equation: The core idea seems to be that you want to plan for what you know (I can only plan based on the information I have available), and my available knowledge necessarily is placed behind a screen of uncertainty (back to Schopenhauer), which I can be aware of and try to consciously integrate in my plans. More specifically, Gamma is part of this idea that if a reward happens further in the future, it is of lesser importance to our actions (so that's very heuristic-y), and I see two justifications for this: 1. environments are probabilistic (or combinatorially explode), and we only care about the part of future predicted reward which protrudes above the "cone of noise" 2. our understanding of the environment and its rewards is probabilistic, and we only care about that part which we trust can be predicted So this interpretation decides there's brownian motion in the cumulative rewards of a rollout (1. realized and 2. imagined). This seems to be common among alternatives even, but something like Pontryagin maximum principle doesn't actually have this aspect (and therefore does not interface with reality, which admits no closed form formulation that is accessible to humans).

English

Nathan Axcan@AxcanNathan·6d

Insane that DeepSeekV3.2-685B (37B active params) produces cheaper output tokens than hybrid model Qwen3.5-27B. Good architecture on the right infra makes your model look scale-free. The limits of scaling model size therefore will not be found for a long time still. Data parallelism must be sacrificed.

English

Nathan Axcan@AxcanNathan·22 Mar

@effectfully @thegeneralist01 Aah got it!

English

effectfully@effectfully·22 Mar

@AxcanNathan @thegeneralist01 I've heard something similar in the Silicon Valley show.

English

1.6K

effectfully@effectfully·22 Mar

bro

the tiny corp@__tinygrad__

Few know this, but I (George) was the only person in history to get a perfect score in CMU compilers, which is likely the best compilers course in the world. Combine that with crazy low level knowledge of hardware from 10 years of hacking. Then add a team of people who are talented enough to push back on my dumb ideas and clean up the implementations of the good ones. The team who keeps this whole operation running, software, infrastructure, and product. I love how there's no hype in deep learning compilers. It was one of the most annoying things about self driving cars, all the noobs who burned through billions on crap that was obviously dumb, and the companies who deserved to go bankrupt years ago if not for government bailouts (Tesla and China will devour them all). In this space, the competition is @jimkxa at Tenstorrent, @clattner_llvm at Modular, and @JeffDean at Google. Three of the living legends of computer science. And companies like @nvidia and @AMD, who are definitely live players, making single chips that have more power than the whole Internet two decades ago. This space is so fun to play in. If you haven't, read the tinygrad spec. It's all coming together beautifully.

935

246.2K

Nathan Axcan@AxcanNathan·22 Mar

@effectfully @thegeneralist01 Where is that quote from?

English

1.6K

effectfully@effectfully·22 Mar

@thegeneralist01 "Between me and Mark Zuckerberg, we have a combined net worth of about $200 billion."

English

14.1K

Nathan Axcan@AxcanNathan·22 Mar

@0xSero Please if you train a model use Engram layers they are absolutely perfect for local deployment

English

171

0xSero@0xSero·22 Mar

In 72 hours I got over 100k of value 1. Lambda gave me 5000$ credits in compute 2. Nvidia offered me 8x H100s on the cloud (20$/h) idk for how long but assuming 2 weeks that'd be 5000$~ 3. TNG technology offered me 2 weeks of B200s which is something like 12000$ in compute 4. A kind person offered me 100k in GCP credits (enough to train a 27B if you do it right) 5. Framework offered to mail me a desktop computer 6. We got 14,000$ in donations which will go to buying 2x RTX Pro 6000s (bringing me up to 384GB VRAM) 7. I got over 6M impressions which based on my RPM would be 1500$ over my 500$~ usual per pay period 8. I have gained 17,000~ followers, over doubling my follower count 9. 17 subscribers on X + 700 on youtube. The total value of all this approaches at minimum 50,000$~ and closer to 150,000$ if I leverage it all. --------------------- What I'll be doing with all this: Eric is an incredibly driven researcher I have been bouncing ideas off of over the last month. Him and I have been tackling the idea of getting massive models to fit on relatively cheap memory. The idea is taking advantage of different forms of memory, in combination with expert saliency scoring, to offload specific expert groupings to different memory tiers. For the MoEs I've tested over my entire AI session history about 37.5% of the model is responsible for 95% of token routing. So we can offload 62.5% of an LLM onto SSD/NVMe/CPU/Cheap VRAM this should theoretically result in minimal latency added if we can select the right experts. We can combine this with paged swapping to further accelerate the prompt processing, if done right we are looking at very very decent performance for massive unquantisation & unpruned LLMs. You can get DeepSeek-v3.2-speciale at full intelligence with decent tokens/s as long as you have enough vram to host the core 20-40% of the model and enough ram or SSD to host the rest. Add quantisation to the mix and you can basically have decent speeds and intelligence with just 5-10% of the model's size in vram (+ you need some for context) The funds will be used to push this to it's limits. ----------------- There's also tons of research that you can quantise a model drastically, then distill from the original BF16 or make a LoRA to align it back to the original mostly. This will be added to the pipeline too. ------------------ All this will be built out here: github.com/0xSero/moe-com… you will be able to take any MoE and shove it in here, and with only 24GB and enough RAM/NVMe to compress it down. it'll be slow as hell but it will work with little tinkering. ------------------ Lastly I will be looking into either a full training run from scratch -> or just post-training on an open AMERICAN base model - a research model - an openclaw/nanoclaw/hermes model - a browser-use model To prove that this can be done. -------------------- I will be bad at all of it, and doubt I will get beyond the best small models from 6 months ago, but I want to prove it's no boogeyman impossible task to everyone who says otherwise. -------------------- By the end of the year: 1. I will have 1 model I trained in some capacity be on the top 5 at either pinchbench, browseruse, or research. 2. My github will have a master repo which combines all my work into reusable generalised scripts to help you do that same. 3. The largest public comparative dataset for all MoE quantisations, prunes, benchmarks, costs, hardware requirements. -------------------------- A lot of this will be lead by Eric, who I will tag in the next post. I want to say thank you to everyone who has supported me, I have gotten a lot of comments stating: 1. I'm crazy, stupid, or both 2. I'm wasting my time, no one cares about this 3. This is not a real issue I believe the amount of interest and support I've received says it all. donate.sybilsolutions.ai

English

223

273

4.1K

166.5K

Nathan Axcan@AxcanNathan·21 Mar

@Ekaeoq Going where

English

101

Ekaeo@Ekaeoq·21 Mar

No sadder feeling than packing your life away, along with all the little knick knacks you’ve collected over the year. I hate moving, especially when it’s a nice place.

English

552

17K

Nathan Axcan@AxcanNathan·21 Mar

@SebAaltonen Report battery life

English

302

Sebastian Aaltonen@SebAaltonen·21 Mar

MacBook Neo vs similar price Windows laptops (AMD, Intel, Qualcomm) - Neo has better build quality + display - Neo has worse MT perf + throttles, but better ST - iGPU is competitive - 8GB RAM bottleneck in some cases (web browsing timed link ->) youtu.be/f9fhCMBIbis?si…

YouTube

English

105

49.4K

Nathan Axcan@AxcanNathan·21 Mar

@bzogrammer Me

Charles Rosenbauer@bzogrammer·21 Mar

Anyone want to get added to a math/physics gc?

English

342

Nathan Axcan@AxcanNathan·20 Mar

@tejessrivalsan @LiminalRev Yessss pleassssse

English

118

Tejes Srivalsan@tejessrivalsan·19 Mar

excited to announce that we’re open sourcing EGO-SNAKE the largest dataset of egocentric snake pov footage to train the next generation of autonomous vipers comment for a data sample

English

233

184

4.5K

639.3K

Nathan Axcan@AxcanNathan·20 Mar

The true openai golf challenge is getting a hold of the 8xh100 It’s meant to filter for labs that are not too far from them, and crafty hackers

English

Nathan Axcan@AxcanNathan·20 Mar

@Ekaeoq Good luck That’s super tiring

English

Ekaeo@Ekaeoq·20 Mar

@AxcanNathan Currently packing my life into boxes! Will answer properly once I’m home

English

Ekaeo@Ekaeoq·20 Mar

As a watchmaker, buying a Swiss dust blower is basically a necessity. Not because it’s rubber and moves air, but because the air inside is Swiss, and therefore objectively better. Bergeon heritage matters.

English

113

4.1K

Nathan Axcan@AxcanNathan·20 Mar

@Ekaeoq Check DM

English

Ekaeo@Ekaeoq·20 Mar

@AxcanNathan I mean that sounds great, but what does it mean?

English

Nathan Axcan@AxcanNathan·20 Mar

@teortaxesTex I think this is more productive work than Mamba tbh: x.com/MayankMish98/s… At least it's grounded on Chomsky hierarchy arguments that preach capabilities and generalization; now if only they would drop the linear attention layers

Mayank Mishra@MayankMish98

Introducing M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling We bring back non-linear recurrence to language modeling and show it's been held back by small state sizes, not by non-linearity itself. 📄 Paper: arxiv.org/abs/2603.14360 💻 Code: github.com/open-lm-engine… 🤗 Models: huggingface.co/collections/op…

English

179

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·20 Mar

This mirrors my thinking. I want to add: The beautiful thing about DeepSeek is that they clearly want to have *strictly better* primitives. EXTREMELY FEW types of layers, but all GENERAL. They are not enticed by Xiaomi- or Nemotron- style "hybrid" cope. They seek the big prize.

JJ@JosephJacks_

Frankly disappointed. Three iterations deep and the paper itself concedes pure SSMs can’t do retrieval and hybrids (SSM + attention) are the future.. So Mamba is converging on being a better compression sublayer inside someone else’s architecture — not a replacement. The “inference-first” framing is doing heavy rhetorical lifting over what are solid but incremental control theory refinements (complex transitions, trapezoidal discretization, MIMO) that don’t touch the core constraint: fixed state = lossy history. ~5% decode speedup over Mamba-2 SISO. Kernel engineering is genuinely good. But this isn’t the trajectory you want if the original pitch was obsoleting transformers.

English

16.9K

Nathan Axcan@AxcanNathan·20 Mar

Well it’s gotta be trained on recordings of Cursor users So it’s the ultimate western distillation effort?

Cursor@cursor_ai

Composer 2 is now available in Cursor.

English

Nathan Axcan@AxcanNathan·19 Mar

@mertunsal2020 Well that’s great Btw, I’m curious why only one eval is shown and why there’s no tech report? Taking a previous model as an example, I would have done some more research using the unique Mamba-Codestral model, if there would have been a tech report

English

Mert Ünsal@mertunsal2020·19 Mar

you have to compare under the same $ budget, in which case we’re significantly better. advantage of Lean is that you have a verifier so you can just sample 2 times from our model and automatically pick the correct one. In other fields, there’s no good way of picking one of multiple answers so formal math is unique in this sense. once you decide on the $ you want to spend on a problem you can either run strong model fewer times or weak model many times, and our model will give you better results under same $ budget

English

Mert Ünsal@mertunsal2020·18 Mar

764 points on HN is pretty cool but I wish people were talking more about our model lol news.ycombinator.com/item?id=474047…

English

1.4K

Nathan Axcan@AxcanNathan·19 Mar

...which means it's time to up the body's complexity! And we move to a quadruped. Seems to learn interesting policy variants. (btw this is still ALL running on a single M1 Max)

English

Nathan Axcan@AxcanNathan·19 Mar

well! now the policies learn to go in different directions, which is reassuring (we are creating "neuro-spatial partitions", a luxury in the real world!)

English

Nathan Axcan@AxcanNathan·11 Mar

This paper (DIAYN) stayed stuck in my mind and it looks like the Genesis simulator is now a good quality codebase, I think it's the true future of robotics; can we re-implement it and scale it up (in minimal time using coding agents)? 🧵 x.com/AxcanNathan/st…

Nathan Axcan@AxcanNathan

Remember this thing? After a year+ : - Stable metal backend with easy visuals - runs great on my M1 Max - in Claude Code even GLM 4.7 was able to setup nice experiments Very cool if you don't wanna be locked into Isaac Gym!

English

105

Открыть

@effectfully @thegeneralist01 @0xSero @Ekaeoq @SebAaltonen @bzogrammer @tejessrivalsan @LiminalRev