Alex Speicher

25 posts

Alex Speicher

Alex Speicher

@AlxSp_

Katılım Ekim 2017
65 Takip Edilen17 Takipçiler
Alex Speicher
Alex Speicher@AlxSp_·
@evilmathkid You can now try bi-directional attention in the input since the model doesn’t have to predict it. That should help.
English
0
0
1
118
Mithil Vakde
Mithil Vakde@evilmathkid·
A big change -> No training on inputs Interestingly, this makes the test loss much worse and yet it scores better! Clearly a compression framework / val loss alone isn't a perfect metric for sample efficiency (I am bullish input training will make a comeback though)
Mithil Vakde tweet media
English
3
1
40
13.3K
Mithil Vakde
Mithil Vakde@evilmathkid·
44% on ARC-AGI-1 in 67 cents! Trained from scratch in 2hrs on a 5090 Matches TRM, beats HRM and is way faster & cheaper No recursion, just a transformer Also, 7% on ARC-2 🧵
Mithil Vakde tweet media
English
30
73
679
54.6K
Alex Speicher
Alex Speicher@AlxSp_·
@Teknium That's a great insight that I haven't seen explicitly stated before, thanks
English
0
0
1
69
Teknium (e/λ)
Teknium (e/λ)@Teknium·
Reasoning in LLMs has actually broken at least one intuition about data that I thought I was confident in. Prior to reasoning models, there was a lot I could predict based on the data that went in, such as things like average output lengths and limits on how many tokens they'd generate. It used to be if you trained on outputs of 4k max, you'd have a near-0% chance to generate 10k+ tokens. But with reasoning models, they actually learn a function through this data that can make it generate way way way beyond what length of output tokens you trained on. I think this justifies calling it "reasoning", because it actually learned a function similar to reasoning, by generating tokens that look like thinking to improve accuracy until it is confident in it having found the correct answer, and even if you train on 10k cot tokens max, models will still think, potentially through the entire 128k+ context length it has. Something else interesting about "reasoning" is that we observed when scaling Hermes 4 from 14b, to 70b, to 405B, that the thinking lengths went down and down for the same set of problems as the model got bigger. This also implies that the reasoning process is very much tied to innate intelligence, because the problem is, relative to each model, a different difficulty, and it literally *thinks longer* if it is less intelligent! Just some fun facts for you on this Sunday :)
English
46
41
660
44.6K
Alex Speicher
Alex Speicher@AlxSp_·
@yacinelearning @suchenzang Yeah, I would consider the test training of those methods in the same way as how large scale AR models are expected to learn new patterns and logic through in-context learning.
English
1
0
2
106
Yacine Mahdid
Yacine Mahdid@yacinelearning·
@suchenzang I thought it was weird too at first but after diving into the methods of HRM + TRM + CompressARC it actually makes sense
English
2
0
5
953
Susan Zhang
Susan Zhang@suchenzang·
we need more AI people to join community notes... kind of crazy how many amplified a plot that went into negative cost territory, with a thread about training on test
Susan Zhang tweet media
Mithil Vakde@evilmathkid

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English
35
11
206
96.3K
Alex Speicher
Alex Speicher@AlxSp_·
@_arohan_ From a quick skim of the blog, I would assume it’s the task embedding (similar to HRM, TRM) that causes this performance.
English
0
0
2
301
rohan anil
rohan anil@_arohan_·
With so many years of work on arc-agi-1 plus hill climbing, and with many corps using it to showcase model capability, a simple transformer with the most obvious data representation has completely shaken up a cottage industry?
Mithil Vakde@evilmathkid

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English
6
2
60
12.9K
Teknium (e/λ)
Teknium (e/λ)@Teknium·
I wonder if openai has an entire team of people just working on improving arc-agi tasks
English
4
0
79
7K
Alex Speicher
Alex Speicher@AlxSp_·
@francoisfleuret You could split x_t into two tokens {x_t_in, x_t+1_out}, where x_t_in’s kv is kept for the following tokens, and x_t+1_out is discarded after predicting the next token (not visible to any other tokens). Obviously isn’t very performant during training as it doubles the context
English
0
0
0
340
François Fleuret
François Fleuret@francoisfleuret·
I really don't like that in the first layers X_t should be the representation of token t and gradually becomes that of token t+1 in the last layer. It makes absolutely no sense, it is objectively repugnant.
English
30
7
157
17.5K
Alex Speicher
Alex Speicher@AlxSp_·
@francoisfleuret @0xHenriksson Isn’t that basically life generally? Life maintains its level of low entropy in a small spot (its body) while emitting energy, which generally increases entropy as a whole
English
0
0
2
81
François Fleuret
François Fleuret@francoisfleuret·
@0xHenriksson I think I get you, but I find weird that you can dump radiation in a vacuum. You just send photons, that will travel forever and you reduced your own entropy?
English
9
0
3
882
François Fleuret
François Fleuret@francoisfleuret·
Something I don't understand in physics: Since you can emit radiation to cool down (e.g. earth), isn't it like reducing entropy *in a vacuum* ? How is that consistent with the second law of thermodynamics?
English
32
3
37
15.3K
kalomaze
kalomaze@kalomaze·
what the fuck is a qkqkv_proj
English
2
0
15
606
Alex Speicher
Alex Speicher@AlxSp_·
@willccbb I don’t think flow charts are the best option. They look pretty at first but turn into a mess
English
0
0
0
134
will brown
will brown@willccbb·
honestly i can see this being a smash hit
will brown tweet media
English
101
13
756
92.5K
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
I am looking to get new noise-cancelling Bluetooth headphones with high-quality sound... does anyone have suggestions? since I don't have Apple products, anything apart from AirPods Max
English
38
0
39
23.3K
Alex Speicher
Alex Speicher@AlxSp_·
@hi_tysam @_arohan_ How much do you think the hierarchical part of the architecture actually improves it? Seems like injecting the input embed into each recurrent block stabilizes the training a lot already
English
1
0
1
46
Fern
Fern@hi_tysam·
@_arohan_ definitely having something that is latent-space structure invariant is the way to go IMO, by a long shot
English
1
0
2
145
Alex Speicher
Alex Speicher@AlxSp_·
@cloneofsimo @lineardiff Could be that just injecting the embed input into each recurrent block is what makes the arch work and the hierarchical part is just being “fancy”
English
0
0
1
34
Alex Speicher
Alex Speicher@AlxSp_·
@cloneofsimo @lineardiff Yeah, they do use the entire “train” part of both the train & eval sets in ARC. Plus data augmentation. To me, the main interesting part in their paper is if this hierarchical structure scales better than simpler architectures like arxiv.org/pdf/2502.05171
English
1
0
1
57
Alex Speicher retweetledi
Xuandong Zhao
Xuandong Zhao@xuandongzhao·
🚀 Excited to share the most inspiring work I’ve been part of this year: "Learning to Reason without External Rewards" TL;DR: We show that LLMs can learn complex reasoning without access to ground-truth answers, simply by optimizing their own internal sense of confidence. 1/n
Xuandong Zhao tweet media
English
86
501
3.5K
572.8K
Alex Speicher
Alex Speicher@AlxSp_·
@iScienceLuvr Thank you for the inspiring talk! I had a great time at the hackathon and hopefully we’ll cross paths again at some other hackathon/meetup in the future
English
0
0
1
49
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
I'm glad that my talk inspired a team to build a medical RL environment that won 2nd place!
Nous Research@NousResearch

Nous Research's RL Environments Hackathon recap thread! Starting with the stars of the show, the winners! Top 3 for the subjective track were: 1st - Pokemon Trainer - by @iyajainfinity & @AlexReibman 2nd - VR-CLImax by @JakeABoggs 3rd - DynastAI by David van Vliet and @SRacoon23 Top 3 for the objective track were: 1st - CyberMaxxing by @1999_karthik 2nd - HelpfulDoctors by @tsadpbb, Nilesh Shah, Max Phelps, and Alexander Speicher 3rd - Physical RL by @nullref0 and @venkatacrc Another special shout out to our partners, @xai, @MistralAI, @nvidia, @tensorstax, @akashnet, @nebiusai, @runpod, @daytonaio, @morph_labs, @LambdaAPI and @Tesla As well as our many judges from @arcee_ai, @axolotl_ai, @cursor_ai, @latentspacepod, @MIT, @togethercompute, @haizelabs, @SophontAI, @EdgeAGI, @Google, specifically: @AlpayAriyak, @winglian, Samuel Barry, @tmm1, @keirp1, @swyx, @teknium, @karan4d, Meghana Puvvadi, @arattml, @brianlechthaler, Josh May, Alex Gu, @gordic_aleksa, @AlpayAriyak, @eraqian, @LukePiette, Rohan Rao, @chargoddard, @LoganGrasby, @xennygrimmato_, @zhangir_azerbay, @rogershijin, @max_paperclips, @theemozilla, and Abhinav Balasubramanian

English
6
6
96
20.1K
Alex Speicher retweetledi
Nous Research
Nous Research@NousResearch·
Nous Research's RL Environments Hackathon recap thread! Starting with the stars of the show, the winners! Top 3 for the subjective track were: 1st - Pokemon Trainer - by @iyajainfinity & @AlexReibman 2nd - VR-CLImax by @JakeABoggs 3rd - DynastAI by David van Vliet and @SRacoon23 Top 3 for the objective track were: 1st - CyberMaxxing by @1999_karthik 2nd - HelpfulDoctors by @tsadpbb, Nilesh Shah, Max Phelps, and Alexander Speicher 3rd - Physical RL by @nullref0 and @venkatacrc Another special shout out to our partners, @xai, @MistralAI, @nvidia, @tensorstax, @akashnet, @nebiusai, @runpod, @daytonaio, @morph_labs, @LambdaAPI and @Tesla As well as our many judges from @arcee_ai, @axolotl_ai, @cursor_ai, @latentspacepod, @MIT, @togethercompute, @haizelabs, @SophontAI, @EdgeAGI, @Google, specifically: @AlpayAriyak, @winglian, Samuel Barry, @tmm1, @keirp1, @swyx, @teknium, @karan4d, Meghana Puvvadi, @arattml, @brianlechthaler, Josh May, Alex Gu, @gordic_aleksa, @AlpayAriyak, @eraqian, @LukePiette, Rohan Rao, @chargoddard, @LoganGrasby, @xennygrimmato_, @zhangir_azerbay, @rogershijin, @max_paperclips, @theemozilla, and Abhinav Balasubramanian
Nous Research tweet media
English
23
42
396
64.8K
Alex Speicher retweetledi
Alex Speicher retweetledi
Isaac Liao
Isaac Liao@LiaoIsaac91893·
Introducing *ARC‑AGI Without Pretraining* – ❌ No pretraining. ❌ No datasets. Just pure inference-time gradient descent on the target ARC-AGI puzzle itself, solving 20% of the evaluation set. 🧵 1/4
Isaac Liao tweet media
GIF
English
36
181
1.3K
225.9K