Jan Disselhoff

1

33

10.8K

Jan Disselhoff retweetledi

0.005 Seconds (3/694)@seconds_0·15 Ara

There's an entire parallel scientific corpus most western researches never see. Today i'm launching chinarxiv.org, a fully automated translation pipeline of all Chinese preprints, including the figures, to make that available.

English

199

1K

7.4K

839.3K

Jan Disselhoff@JDisselh·24 Ara

@yacinelearning It was previously known as ARC without pretraining, which it is probably more well known under? Think it is fantastic because it provides a good lower bound for information needed to solve ARC-AGI problems

English

2

137

Yacine Mahdid@yacinelearning·23 Ara

I think I might need to do a deep dive on compressARC because a lot of people don’t know about this weird but fascinating method to solve these puzzles

English

6

117

6.4K

Jan Disselhoff@JDisselh·22 Ara

@MaleManlpulator @eshear Can be both! If you draw k random numbers uniformly from the set {1,...,n} with unknown n, the maximum of the k numbers is both a lower bound on n, as well as the maximum likelihood estimator for n.

English

0

44

male manipulator (reformed)@MaleManlpulator·22 Ara

@eshear They’re upper bounds not estimates!

English

0

6

171

male manipulator (reformed)@MaleManlpulator·17 Eki

lol what. Kolmogorov complexity is not “relativity + Shannon information” You were ostensibly a programmer, it’s one of the simpler concepts you mentioned, you should know this.

Emmett Shear@eshear

@JPobserver Sure, but it’s really just relativity + Shannon information. I guess I should have information theory in the list but I ran out of room.

English

0

16

13.2K

Jan Disselhoff@JDisselh·21 Ara

@suchenzang Without labels? Why would it? How is that different from not putting batchnorm in eval mode for example? Technically that also changes weights based on test set inputs, but I think it would be silly to argue that it breaks test-set.

English

0

134

Susan Zhang@suchenzang·21 Ara

@JDisselh if your image model the trains on the test set images (just the inputs) before testing on them, then yes that breaks test-set

English

0

1

153

Susan Zhang@suchenzang·19 Ara

we need more AI people to join community notes... kind of crazy how many amplified a plot that went into negative cost territory, with a thread about training on test

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English

35

11

206

96.3K

Jan Disselhoff@JDisselh·21 Ara

@suchenzang It is allowed though. In ARC-AGI you are provided with all unsolved challenges at once, and can do whatever you want with them. I don't see how this would break "test set" at all. If I built an image model that always classifies to images at once, would that break test-set?

English

0

163

Susan Zhang@suchenzang·20 Ara

more precisely: during test-time, for TTT methods, the test input (one!) is allowed however, it is not allowed to use _all test inputs together_ to update a model with before evaluation on a single test occurs (that would break the design of what a "learning solution" and what a "test set" would mean) given how much confusion this has raised between established researchers (on the spirit of TTT and on the arc guidelines/submission process), it's completely understandable anyone else would be confused too

Susan Zhang@suchenzang

yes, that makes sense and is much more aligned with what i understood as acceptable for TTT methods so what's happening in the OG work (aka mdlARC) is twofold: 1) design a method that intentionally trains "from scratch" each time but then amortizing that cost across all tasks (breaking from-scratch-training design) 2) proceeding to also train on _all eval set inputs_, not just a single one at test time 2) was the bigger red flag for me, since it implies a model weight update using _all test inputs together_, which seems like it would also violate the arc submission process as well

English

0

20

10.9K

Jan Disselhoff@JDisselh·21 Ara

@navbenny Going out on a limb, I expect this approach to get stuck at a maximum of ~30-35 points on public eval at best. Which is nice for the time it takes, but sadly would be considered a failure in the competition track.

English

2

32

Jan Disselhoff@JDisselh·21 Ara

@navbenny This and the small size are advantages for experimenting, but disadvantages in the competition, where speed does not matter as long as you are under 12h. Submissions and approaches optimize for score before all else.

English

0

1

34

Jan Disselhoff@JDisselh·20 Ara

Okay, some things about this are cool, but many are overblown, and the discussion surrounding this is just bad.

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English

2

13

1.5K

Jan Disselhoff@JDisselh·21 Ara

@navbenny @evilmathkid The cost is very low, but it is also hard stuck at that performance. Afaik it is still the non-deep-learning state of the art, even though it is over 3 years old. And since a lot of approaches have better performance by now it has fallen to the wayside a bit.

English

1

20

Naveen Benny@navbenny·21 Ara

@JDisselh @evilmathkid That makes sense, thanks. If CPU based program synth does as well, wouldn't their cost be much lesser?

English

0

22

Naveen Benny@navbenny·20 Ara

Fascinating work from @evilmathkid. A few thoughts - I still can't get over the fact that you can get a double digit perf on a test for 'AGI' simply by bootstrapping on narrow arc data. It's another reminder how hard it is to actually measure general intelligence. How general are the LLMs really? We have many impressive benchmark phd level numbers, yet somehow these LLMs don't translate well in the real world. - This is essentially unsupervised learning on the test inputs. You can call it leakage, but it can be considered weight updates during test time which is totally fine. But of course this has to be implemented during test time and factored into the cost per task, and that would be the real cost per task - You can then run it on the private dataset and make it a fair comparison on the graph. - Run an abalation to understand the lift from unsupervised test input training vs just train data. That would be the best way to validate the hypothesis of using test inputs.

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English

13

677

Jan Disselhoff@JDisselh·20 Ara

@sd_marlow That is true, I think the chart as shown is irresponsible from a scientific perspective. Not only because it compares public eval with hidden eval, but also because it is missing basically all open source approaches that work at low cost frontiers.

English

1

26

Steven Marlow@sd_marlow·20 Ara

@JDisselh *it's not a formal validation, so just plotting it on the chart is misleading.

English

0

20

Jan Disselhoff@JDisselh·20 Ara

@sd_marlow 1) Afaict he is not training on ARC 2 public, which should be obvious as it is in the second cell of the notebook. 2) He is not training on the *solutions* which is completely fine in ARC-AGI. Literally every SOTA approach in the competition trains on the test *input examples*.

English

41

Steven Marlow@sd_marlow·20 Ara

@JDisselh You train on the public test set, not the public eval set. Chollet has even said that. ARC 2 public test contains ARC 1 eval.

English

0

32

Jan Disselhoff@JDisselh·20 Ara

@evilmathkid @yacinelearning @LiaoIsaac91893 Tbf, this is converted *dollar* cost, while you used *time* to compare above. You should try running on an absolute minimal setup to get lowest possible cost, would be interesting to see.

English

0

212

Jan Disselhoff@JDisselh·20 Ara

@evilmathkid @yacinelearning @LiaoIsaac91893 Yes, these are the same numbers no? 86h * 7ct/h for a 4070 is around 6$. 400 tasks in the public eval set, so around 1.5ct per task?

English

0

177

Mithil Vakde@evilmathkid·18 Ara

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English

66

142

1.4K

438K

Jan Disselhoff@JDisselh·20 Ara

@evilmathkid @yacinelearning They claim 20 minutes per task on a 4070 which means a cost of around ~2ct per task, if you were to rent the GPU. (Around 7ct per hour for a 4070).

English

0

1

219

Mithil Vakde@evilmathkid·19 Ara

@yacinelearning 4) Cost seems to be much cheaper CompressARC takes 86hrs on an 4070 for 20% This needs 2hrs on an A100 for 27%.

English

0

25

6.6K

Jan Disselhoff@JDisselh·20 Ara

@sumukx @GregKamradt And they are an approach from 2024! We should expect costs to drop. The issue here is not their training procedure, which seems fine, but that they seem to ignore most of the other works on the "pareto frontier", or misunderstand their approaches.

English

43

Jan Disselhoff@JDisselh·20 Ara

@sumukx @GregKamradt ? No it is not? How is that even connected? Also the costs are impressive but not unprecedented, ARC AGI wo pretraining has similar performance and cost. They get 34%/20% on public train/eval using a 4070 and 20min per task, which is around 2ct per task. (iliao2345.github.io/blog_posts/arc…)

English

0

1

50

Sumuk@sumukx·20 Ara

imo this is wrong. I'm sure @GregKamradt can validate this. @suchenzang is correct the point of a test set is testing the model's performance on potentially OOD tasks. putting the test input distribution in train is not a "breakthrough"

Announcing New Pareto Frontier on ARC-AGI 27.5% for just $2 333x cheaper than TRM! Beats every non-thinking LLM in existence Cost so low, its literally off the chart Vanilla transformer. No special architectures. Tiny. Trained in 2 hrs. Open source. Thread:

English

7

0

35

15.1K

Jan Disselhoff@JDisselh·20 Ara

@navbenny @evilmathkid I.e. any systems that claims to be generally intelligent needs to be able to solve ARC-AGI, bit a system that solves ARC-AGI is not necessarily generally intelligent.

English

23

Jan Disselhoff@JDisselh·20 Ara

@navbenny @evilmathkid You can get double digit performance using program search purely on CPU. I.e. without training anything. It was SoTA before the 2024 competition, see here: github.com/victorvikram/A… They got 21% on the hidden test set. Arc-agi is a negative claim on agi, not a positive one.

English

0

1

50

Jan Disselhoff@JDisselh·20 Ara

@sumukx @GregKamradt ARC is testing generalization performance to new instances using minimal amount of examples. Try some of the tasks yourself! Even humans tend to use the challenge input to understand what is required. There are critiques to be had here but this ain't it.

English

2

28

Jan Disselhoff@JDisselh·20 Ara

@sumukx @GregKamradt That is fine though. This is basically TTT. In ARC-AGI everything is allowed except for using the answer. This was also true in each contest, and for every model there. We even tried training on inputs several times, but it tends to not make a difference in larger models.

English