Jan Disselhoff

77 posts

Jan Disselhoff banner
Jan Disselhoff

Jan Disselhoff

@JDisselh

Deep Learning Scientist | The ARChitects Kaggle Team | ARC-AGI 2024 Winner

Katılım Kasım 2025
45 Takip Edilen105 Takipçiler
Sabitlenmiş Tweet
Jan Disselhoff
Jan Disselhoff@JDisselh·
ARC Prize 2025 is over, an amazing contest, with amazing people competing. This year our team "the ARChitects" managed to reach second place. We tried a lot of things, some thoughts and explanation of our approach below!
Jan Disselhoff tweet media
English
2
1
33
10.8K
Jan Disselhoff retweetledi
0.005 Seconds (3/694)
0.005 Seconds (3/694)@seconds_0·
There's an entire parallel scientific corpus most western researches never see. Today i'm launching chinarxiv.org, a fully automated translation pipeline of all Chinese preprints, including the figures, to make that available.
0.005 Seconds (3/694) tweet media0.005 Seconds (3/694) tweet media0.005 Seconds (3/694) tweet media
English
199
1K
7.4K
839.3K
Jan Disselhoff
Jan Disselhoff@JDisselh·
@yacinelearning It was previously known as ARC without pretraining, which it is probably more well known under? Think it is fantastic because it provides a good lower bound for information needed to solve ARC-AGI problems
English
0
0
2
137
Yacine Mahdid
Yacine Mahdid@yacinelearning·
I think I might need to do a deep dive on compressARC because a lot of people don’t know about this weird but fascinating method to solve these puzzles
Yacine Mahdid tweet media
English
6
6
117
6.4K
Jan Disselhoff
Jan Disselhoff@JDisselh·
@MaleManlpulator @eshear Can be both! If you draw k random numbers uniformly from the set {1,...,n} with unknown n, the maximum of the k numbers is both a lower bound on n, as well as the maximum likelihood estimator for n.
English
1
0
0
44
male manipulator (reformed)
male manipulator (reformed)@MaleManlpulator·
lol what. Kolmogorov complexity is not “relativity + Shannon information” You were ostensibly a programmer, it’s one of the simpler concepts you mentioned, you should know this.
Emmett Shear@eshear

@JPobserver Sure, but it’s really just relativity + Shannon information. I guess I should have information theory in the list but I ran out of room.

English
1
0
16
13.2K
Jan Disselhoff
Jan Disselhoff@JDisselh·
@suchenzang Without labels? Why would it? How is that different from not putting batchnorm in eval mode for example? Technically that also changes weights based on test set inputs, but I think it would be silly to argue that it breaks test-set.
English
1
0
0
134
Susan Zhang
Susan Zhang@suchenzang·
@JDisselh if your image model the trains on the test set images (just the inputs) before testing on them, then yes that breaks test-set
English
1
0
1
153
Susan Zhang
Susan Zhang@suchenzang·
we need more AI people to join community notes... kind of crazy how many amplified a plot that went into negative cost territory, with a thread about training on test
Susan Zhang tweet media
Mithil Vakde@evilmathkid

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English
35
11
206
96.3K
Jan Disselhoff
Jan Disselhoff@JDisselh·
@suchenzang It is allowed though. In ARC-AGI you are provided with all unsolved challenges at once, and can do whatever you want with them. I don't see how this would break "test set" at all. If I built an image model that always classifies to images at once, would that break test-set?
English
1
0
0
163
Susan Zhang
Susan Zhang@suchenzang·
more precisely: during test-time, for TTT methods, the test input (one!) is allowed however, it is not allowed to use _all test inputs together_ to update a model with before evaluation on a single test occurs (that would break the design of what a "learning solution" and what a "test set" would mean) given how much confusion this has raised between established researchers (on the spirit of TTT and on the arc guidelines/submission process), it's completely understandable anyone else would be confused too
Susan Zhang@suchenzang

yes, that makes sense and is much more aligned with what i understood as acceptable for TTT methods so what's happening in the OG work (aka mdlARC) is twofold: 1) design a method that intentionally trains "from scratch" each time but then amortizing that cost across all tasks (breaking from-scratch-training design) 2) proceeding to also train on _all eval set inputs_, not just a single one at test time 2) was the bigger red flag for me, since it implies a model weight update using _all test inputs together_, which seems like it would also violate the arc submission process as well

English
1
0
20
10.9K
Jan Disselhoff
Jan Disselhoff@JDisselh·
@navbenny Going out on a limb, I expect this approach to get stuck at a maximum of ~30-35 points on public eval at best. Which is nice for the time it takes, but sadly would be considered a failure in the competition track.
English
0
0
2
32
Jan Disselhoff
Jan Disselhoff@JDisselh·
@navbenny This and the small size are advantages for experimenting, but disadvantages in the competition, where speed does not matter as long as you are under 12h. Submissions and approaches optimize for score before all else.
English
1
0
1
34
Jan Disselhoff
Jan Disselhoff@JDisselh·
@navbenny @evilmathkid The cost is very low, but it is also hard stuck at that performance. Afaik it is still the non-deep-learning state of the art, even though it is over 3 years old. And since a lot of approaches have better performance by now it has fallen to the wayside a bit.
English
0
0
1
20
Naveen Benny
Naveen Benny@navbenny·
@JDisselh @evilmathkid That makes sense, thanks. If CPU based program synth does as well, wouldn't their cost be much lesser?
English
1
0
0
22
Naveen Benny
Naveen Benny@navbenny·
Fascinating work from @evilmathkid. A few thoughts - I still can't get over the fact that you can get a double digit perf on a test for 'AGI' simply by bootstrapping on narrow arc data. It's another reminder how hard it is to actually measure general intelligence. How general are the LLMs really? We have many impressive benchmark phd level numbers, yet somehow these LLMs don't translate well in the real world. - This is essentially unsupervised learning on the test inputs. You can call it leakage, but it can be considered weight updates during test time which is totally fine. But of course this has to be implemented during test time and factored into the cost per task, and that would be the real cost per task - You can then run it on the private dataset and make it a fair comparison on the graph. - Run an abalation to understand the lift from unsupervised test input training vs just train data. That would be the best way to validate the hypothesis of using test inputs.
Mithil Vakde@evilmathkid

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English
1
1
13
677
Jan Disselhoff
Jan Disselhoff@JDisselh·
@sd_marlow That is true, I think the chart as shown is irresponsible from a scientific perspective. Not only because it compares public eval with hidden eval, but also because it is missing basically all open source approaches that work at low cost frontiers.
English
0
0
1
26
Steven Marlow
Steven Marlow@sd_marlow·
@JDisselh *it's not a formal validation, so just plotting it on the chart is misleading.
English
1
0
0
20
Jan Disselhoff
Jan Disselhoff@JDisselh·
@sd_marlow 1) Afaict he is not training on ARC 2 public, which should be obvious as it is in the second cell of the notebook. 2) He is not training on the *solutions* which is completely fine in ARC-AGI. Literally every SOTA approach in the competition trains on the test *input examples*.
Jan Disselhoff tweet media
English
0
0
0
41
Steven Marlow
Steven Marlow@sd_marlow·
@JDisselh You train on the public test set, not the public eval set. Chollet has even said that. ARC 2 public test contains ARC 1 eval.
English
1
0
0
32
Mithil Vakde
Mithil Vakde@evilmathkid·
Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:
Mithil Vakde tweet media
English
66
142
1.4K
438K
Jan Disselhoff
Jan Disselhoff@JDisselh·
@evilmathkid @yacinelearning They claim 20 minutes per task on a 4070 which means a cost of around ~2ct per task, if you were to rent the GPU. (Around 7ct per hour for a 4070).
English
1
0
1
219
Mithil Vakde
Mithil Vakde@evilmathkid·
@yacinelearning 4) Cost seems to be much cheaper CompressARC takes 86hrs on an 4070 for 20% This needs 2hrs on an A100 for 27%.
English
2
0
25
6.6K
Jan Disselhoff
Jan Disselhoff@JDisselh·
@sumukx @GregKamradt And they are an approach from 2024! We should expect costs to drop. The issue here is not their training procedure, which seems fine, but that they seem to ignore most of the other works on the "pareto frontier", or misunderstand their approaches.
English
0
0
0
43
Jan Disselhoff
Jan Disselhoff@JDisselh·
@sumukx @GregKamradt ? No it is not? How is that even connected? Also the costs are impressive but not unprecedented, ARC AGI wo pretraining has similar performance and cost. They get 34%/20% on public train/eval using a 4070 and 20min per task, which is around 2ct per task. (iliao2345.github.io/blog_posts/arc…)
English
1
0
1
50
Sumuk
Sumuk@sumukx·
imo this is wrong. I'm sure @GregKamradt can validate this. @suchenzang is correct the point of a test set is testing the model's performance on potentially OOD tasks. putting the test input distribution in train is not a "breakthrough"
Mithil Vakde@evilmathkid

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English
7
0
35
15.1K
Jan Disselhoff
Jan Disselhoff@JDisselh·
@navbenny @evilmathkid I.e. any systems that claims to be generally intelligent needs to be able to solve ARC-AGI, bit a system that solves ARC-AGI is not necessarily generally intelligent.
English
0
0
0
23
Jan Disselhoff
Jan Disselhoff@JDisselh·
@navbenny @evilmathkid You can get double digit performance using program search purely on CPU. I.e. without training anything. It was SoTA before the 2024 competition, see here: github.com/victorvikram/A… They got 21% on the hidden test set. Arc-agi is a negative claim on agi, not a positive one.
English
2
0
1
50
Jan Disselhoff
Jan Disselhoff@JDisselh·
@sumukx @GregKamradt ARC is testing generalization performance to new instances using minimal amount of examples. Try some of the tasks yourself! Even humans tend to use the challenge input to understand what is required. There are critiques to be had here but this ain't it.
English
0
0
2
28
Jan Disselhoff
Jan Disselhoff@JDisselh·
@sumukx @GregKamradt That is fine though. This is basically TTT. In ARC-AGI everything is allowed except for using the answer. This was also true in each contest, and for every model there. We even tried training on inputs several times, but it tends to not make a difference in larger models.
English
2
0
2
195