Steven Walton

2K posts

Steven Walton banner
Steven Walton

Steven Walton

@WaltonStevenj

Ph.D. from University of Oregon | Visiting Scholar @ Georgia Tech | Studying Computer Vision | SHI Lab | 🦋 https://t.co/jW5YkexBMX

Eugene, Oregon Katılım Mart 2021
623 Takip Edilen365 Takipçiler
Steven Walton
Steven Walton@WaltonStevenj·
Now I wish it was "mass surveillance" rather than "mass *domestic* surveillance" but it's clear that @AnthropicAI is aligning to /their/ *moralities* while @OpenAI is aligning to *legality*. These are not the same and the latter is blatantly a lower guardrail. It's insulting
English
1
0
0
29
Steven Walton
Steven Walton@WaltonStevenj·
This is a farce. To claim "more guardrails" is frankly unbelievable. Why would @AnthropicAI be called a supply chain risk but then @OpenAI gets a contract with stricter conditions. If guidelines are stricter then what's the claim? DOD killing your competition for you?
OpenAI@OpenAI

Yesterday we reached an agreement with the Department of War for deploying advanced AI systems in classified environments, which we requested they make available to all AI companies. We think our deployment has more guardrails than any previous agreement for classified AI deployments, including Anthropic's. Here's why: openai.com/index/our-agre…

English
1
0
1
124
Steven Walton
Steven Walton@WaltonStevenj·
@diyerxx @sarahookr Errors like these don't invalidate the utility of benchmarks but they do limit how useful they are to evaluation. My point is, you can't evaluate simply by looking at the numerical result. Analysis is the hard part, not the easy part
English
0
0
0
34
Lei Yang
Lei Yang@diyerxx·
Got burned by an Apple ICLR paper — it was withdrawn after my Public Comment. So here’s what happened. Earlier this month, a colleague shared an Apple paper on arXiv with me — it was also under review for ICLR 2026. The benchmark they proposed was perfectly aligned with a project we’re working on. I got excited after reading it. I immediately stopped my current tasks and started adapting our model to their benchmark. Pulled a whole weekend crunch session to finish the integration… only to find our model scoring absurdly low. I was really frustrated. I spent days debugging, checking everything — maybe I used it wrong, maybe there was a hidden bug. During this process, I actually found a critical bug in their official code: * When querying the VLM, it only passed in the image path string, not the image content itself. The most ridiculous part? After I fixed their bug, the model's scores got even lower! The results were so counterintuitive that I felt forced to do deeper validation. After multiple checks, the conclusion held: fixing the bug actually made the scores worse. At this point I decided to manually inspect the data. I sampled the first 20 questions our model got wrong, and I was shocked: * 6 out of 20 had clear GT errors. * The pattern suggested the “ground truth” was model-generated with extremely poor quality control, leading to tons of hallucinations. * Based on this quick sample, the GT error rate could be as high as 30%. I reported the data quality issue in a GitHub issue. After 6 days, the authors replied briefly and then immediately closed the issue. That annoyed me — I’d already wasted a ton of time, and I didn’t want others in the community to fall into the same trap — so I pushed back. Only then did they reopen the GitHub issue. Then I went back and checked the examples displayed in the paper itself. Even there, I found at least three clear GT errors. It’s hard to believe the authors were unaware of how bad the dataset quality was, especially when the paper claims all samples were reviewed by annotators. Yet even the examples printed in the paper contain blatant hallucinations and mistakes. When the ICLR reviews came out, I checked the five reviews for this paper. Not a single reviewer noticed the GT quality issues or the hallucinations in the paper's examples. So I started preparing a more detailed GT error analysis and wrote a Public Comment on OpenReview to inform the reviewers and the community about the data quality problems. The next day — the authors withdrew the paper and took down the GitHub repo. Fortunately, ICLR is an open conference with Public Comment. If this had been a closed-review venue, this kind of shoddy work would have been much harder to expose. So here’s a small call to the community: For any paper involving model-assisted dataset construction, reviewers should spend a few minutes checking a few samples manually. We need to prevent irresponsible work from slipping through and misleading everyone. Looking back, I should have suspected the dataset earlier based on two red flags: * The paper’s experiments claimed that GPT-5 has been surpassed by a bunch of small open-source models. * The original code, with a ridiculous bug, produced higher scores than the bug-fixed version. But because it was a paper from Big Tech, I subconsciously trusted the integrity and quality, which prevented me from spotting the problem sooner. This whole experience drained a lot of my time, energy, and emotion — especially because accusing others of bad data requires extra caution. I’m sharing this in hopes that the ML community remains vigilant and pushes back against this kind of sloppy, low-quality, and irresponsible behavior before it misleads people and wastes collective effort. #ICLR #ICLR2026 #NeurIPS #CVPR #openreview #MachineLearning #LLM #VLM
Lei Yang tweet media
English
52
212
2.5K
397K
Steven Walton retweetledi
Jürgen Schmidhuber
Jürgen Schmidhuber@SchmidhuberAI·
In 2025, the DeepSeek “Sputnik" shocked the world, wiping out a trillion $ from the stock market. DeepSeek [7] distills knowledge from one neural network (NN) into another. Who invented this? people.idsia.ch/~juergen/who-i… NN distillation was published in 1991 by yours truly [0]. Section 4 on a "conscious" chunker NN and a "subconscious” automatiser NN [0][1] introduced a general principle for transferring the knowledge of one NN to another. Suppose a teacher NN has learned to predict (conditional expectations of) data, given other data. Its knowledge can be compressed into a student NN, by training the student NN to imitate the behavior of the teacher NN (while also re-tarining the student NN on previously learned skills such that it does not forget them). In 1991, this was called "collapsing" or "compressing" the behavior of one NN into another. Today, this is widely used, and also referred to as “distilling" [2][6] or "cloning" the behavior of a teacher NN into that of a student NN. It even works when the NNs are recurrent and operate on different time scales [0][1]. See also [3][4]. REFERENCES (more in Technical Note IDSIA-12-25 [5]) [0] J. Schmidhuber. Neural sequence chunkers. Tech Report FKI-148-91, TU Munich, April 1991. [1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on [0]. [2] O. Vinyals, J. A. Dean, G. E. Hinton. Distilling the Knowledge in a Neural Network. Preprint arXiv:1503.02531 [stat.ML], 2015. The authors did not cite the original 1991 NN distillation procedure [0][1][DLP], not even in their later patent application. [3] J. Ba, R. Caruana. Do Deep Nets Really Need to be Deep? NIPS 2014. Preprint arXiv:1312.6184 (2013). [4] C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. SIGKDD International conference on knowledge discovery and data mining, 2006. [5] J. Schmidhuber. Who invented knowledge distillation with artificial neural networks? Technical Note IDSIA-12-25, IDSIA, Nov 2025 [6] How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. Technical Report IDSIA-23-23, 2023 [7] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Preprint arXiv:2501.12948, 2025
Jürgen Schmidhuber tweet media
English
22
21
265
51K
Steven Walton
Steven Walton@WaltonStevenj·
Btw, there are more mistakes. Can you find them? There's some more obvious ones (at least one "physically impossible" one) and a few that are far more subtle. Regardless, Nano Banana is very impressive. Mistakes like these are hard to catch, so keep an attentive eye out.
English
0
0
0
38
Steven Walton
Steven Walton@WaltonStevenj·
It is also a good example of why researchers are arguing about "understanding". Does it understand? - It got the right answer, so it must! - The steps were wrong, so it actually doesn't! Or maybe something more complex is going on... We still don't know
English
1
0
0
39
Steven Walton
Steven Walton@WaltonStevenj·
Very impressive, but also wrong. There's 2 mistakes in the first physics problem on the LHS. It did the derivative wrong, but fixed its mistake with another mistake! (better explanation in alt text) ChatGPT looked at the end, not the answer.
Steven Walton tweet mediaSteven Walton tweet media
Andrej Karpathy@karpathy

Gemini Nano Banana Pro can solve exam questions *in* the exam page image. With doodles, diagrams, all that. ChatGPT thinks these solutions are all correct except Se_2P_2 should be "diselenium diphosphide" and a spelling mistake (should be "thiocyanic acid" not "thoicyanic") :O

English
1
0
0
163
Steven Walton
Steven Walton@WaltonStevenj·
@simonw @mitchellh @doodlestein That's absolutely mindboggling. I mean I can `vimdiff` or `git diff` thousands of lines on a machine with outdated hardware without breaking a sweat. Something went terribly wrong and these "solutions" look like patches kicking a can down the road. Talk about tech debt...
English
0
0
0
40
Mitchell Hashimoto
Mitchell Hashimoto@mitchellh·
GitHub feels like a product that isn't used by the people that work there. That can't POSSIBLY be true, I know. I just hit issues everyday that really make me wonder... how can this bug exist? More likely engineers aren't empowered to fix things and are bogged down by red tape.
English
119
52
1.8K
208.9K
Steven Walton
Steven Walton@WaltonStevenj·
@doodlestein @jwkicklighter @mitchellh The problem here that @jwkicklighter is pointing out is that this doesn't just "get work done." This type of solution adds more complexity and this gets compounded upon again and again. Your attempts to make things simple have only increased complexity, not decreased ir.
English
0
0
0
22
Jeffrey Emanuel
Jeffrey Emanuel@doodlestein·
@jwkicklighter @mitchellh Ok whatever, you can be an idealist on your machine, other people need to get work done. It’s a good suggestion if he actually wants to avoid wasting his time on a broken site he doesn’t control.
English
1
0
0
58
Steven Walton
Steven Walton@WaltonStevenj·
@keenanisalive Sure, you can make accurate predictions without causal models or consistency but those will always be brittle and can't generalize. I just don't see how it can generalize without causality and consistency.
English
0
0
0
24
Steven Walton
Steven Walton@WaltonStevenj·
@keenanisalive What does "accurate physics" mean? Which physics? I feel "world model" often gets used in a weird way as if there is only one world and one physics. I think building causal relationships and consistency seems more important than which world is actually being modeled.
English
1
0
0
142
Keenan Crane
Keenan Crane@keenanisalive·
I don't have a strong opinion about whether video models “understand the world.” But I do think the first bar should be checking whether you can recover consistent geometry from video—not whether it makes accurate predictions of physics. (“Accurate physics” is not even well-posed unless the geometry defining the physical experiment is well-defined.)
Keenan Crane tweet media
English
10
34
297
18K