Jonathan Balloch

2.4K posts

Jonathan Balloch

@JonathanBalloch

I mostly tweet about #ai, #robots, #science, @packers... Senior SWE at @Anduril | Ph.D. Robotics @GeorgiaTech | M.S Robotics @Penn Thought/opinions are mine

Atlanta, GA Beigetreten Ekim 2012

1.1K Folgt382 Follower

Angehefteter Tweet

Jonathan Balloch@JonathanBalloch·27 Ağu

ZXX

451

Jonathan Balloch@JonathanBalloch·13 Mar

Now that's a useful robot

K.L@kinglinzhuhui

China has developed an AI robot for this job, replacing humans in one of the most dangerous workplaces — grain storage facilities. These spaces are filled with dust, extreme heat, low oxygen levels, and unstable grain piles that can collapse without warning.

English

Jonathan Balloch@JonathanBalloch·13 Mar

Based as hell

Peter Steinberger 🦞@steipete

@AskPerplexity Big no for spamming our repo with a slop PR. Account banned.

English

Jonathan Balloch@JonathanBalloch·13 Mar

@traestephens The Original Catholic Justice

English

Trae Stephens@traestephens·14 Oca

One of the best Bible passages: "So Peter and the other disciple [John, the author] started for the tomb. Both were running, but the other disciple outran Peter and reached the tomb first." John 20:3-4 Translated: "I cooked you and I want the world to remember that forever."

English

105

5.8K

Jonathan Balloch retweetet

Nathan Lambert@natolambert·11 Mar

This looks like a model that's competitive with GPT OSS 120B or similar Qwen3.5 models on intelligence & speed, while coming with tons of open data + training details. Is a huge contribution for the ecosystem. Congrats Nvidia on the Nemotron 3 Super release!

Bryan Catanzaro@ctnzr

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: research.nvidia.com/labs/nemotron/… And yes, Ultra is coming!

English

480

44.6K

Jonathan Balloch retweetet

anton@abacaj·10 Mar

“Make the models cheap to use” “Great, they all forgot how to code” “Now 10x the price”

English

231

1.6K

27.5K

671.2K

Jonathan Balloch@JonathanBalloch·12 Mar

@HasanFaisall @natolambert @burkov OP is saying its useless, nathan is just saying its not. also qwen 3.5-9b matched it not really beat it and is a dope model; more a testament to qwen than a knock against gpt-oss

English

Hasan Faisal@HasanFaisall·12 Mar

@natolambert @burkov how is gpt-oss better? didn't qwen 3.5 9b beat it?

English

244

BURKOV@burkov·12 Mar

GPT-OSS-120B is a useless model. Nothing I tried to use it for worked as you would expect from a model of this size. It doesn't respect the constraints, performs poorly when the task description precedes the text the task is supposed to apply to, it stops the generation at random moments without finishing the sentence, and it generates repetitive expressions that don't ever stop. All these are properties of 7B-parameter models of late 2023. IMO, altman released this model to avoid being accused of lying after he made a drunk promise to release a competitive open-weight model.

English

145

51.8K

Jonathan Balloch@JonathanBalloch·12 Mar

@satyajitdas90 @natolambert @burkov not anymore, but not far from it. Nemotron 3 Super for long context, qwen3.5:122b for general VLM reasoning , qwen-next-coder (80B) for coding, but all depends on what you want. For example, the dedicated OCR models (almost all much smaller than any above) are better for OCR

English

Satyajit Das@satyajitdas90·12 Mar

@natolambert @burkov Is it the best in its size?

English

489

Jonathan Balloch@JonathanBalloch·12 Mar

@natolambert @burkov based as hell and true

English

Nathan Lambert@natolambert·12 Mar

@burkov skill issue, lots of people LOVE this model ;)

English

134

11.1K

Jonathan Balloch@JonathanBalloch·12 Mar

@RelaxedPop @BrianRoemmele def real. always looks for physics consistency. the sheet getting too flat. the turning of the sheet over perfectly while barely touching it. They are getting good though

English

Charles Waters@RelaxedPop·12 Mar

@BrianRoemmele Do you think this video is real? I'm looking for AI artifacts, but it's fairly low res and I don't see any, but the decisions the robot is making, such as causally tossing the black piece of clothing aside, don't seem very AI-ish. I'm skeptical. We will get there though.

English

511

Brian Roemmele@BrianRoemmele·12 Mar

2025: “It will be a decade before Robots can do anything in the home” 2026: “Oh”

English

148

129

1.3K

127.3K

Jonathan Balloch@JonathanBalloch·12 Mar

@sdflbb @BrianRoemmele not even possible with teleop. ai generated. anyone who has ever tried to engineer to do a cloth-related task knows this is still far from possible. but we keep getting better every day!

English

Burning Bridges@sdflbb·12 Mar

@BrianRoemmele Teleoperated. Those nodes cannot do anything elseful otherwise.

English

788

Jonathan Balloch@JonathanBalloch·12 Mar

@BrianRoemmele dog its obviously ai

English

Jonathan Balloch@JonathanBalloch·10 Mar

@karpathy small science nit: this does not seem to account for whether there is a relationship between the improvement methods. You can only know with independence testing and correlation analysis if one method undercut another. build this into autoresearcher for maximum results

English

Jonathan Balloch retweetet

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

968

2.1K

19.5K

3.6M

Jonathan Balloch@JonathanBalloch·3 Mar

@DanutPralea @adxtyahq $5B is GENEROUS. I have heard that compute alone is $8B, other spend is ~$9B, and that doesn't fully cover the data center build outs

English

132

Dani Pralea@DanutPralea·3 Mar

@adxtyahq spending $5B/year to make a product that keeps getting beaten by a company a tenth their size is a bold financial strategy

English

2.6K

aditya@adxtyahq·3 Mar

Lowkey feels like OpenAI might be the first AI giant to go bankrupt.

English

128

4.1K

88.6K

Jonathan Balloch@JonathanBalloch·3 Mar

Yup, because the hard problem that cannot just be solved by imitation is the interaction

Harrison Kinsley@Sentdex

picking something up off the floor w/ a humanoid is more challenging than a backflip

English

Jonathan Balloch@JonathanBalloch·3 Mar

@tunguz Eh, true but minimax 2.5 passed my smell test. Still currently free on open code

English

Bojan Tunguz@tunguz·3 Mar

Further evidence of benchmaxxing.

ARC Prize@arcprize

International models on ARC-AGI-2 Semi Private - Kimi K2.5 (@Kimi_Moonshot): 12%, $0.28 - Minimax M2.5 (@MiniMax_AI): 5%, $0.17 - GLM-5 (@Zai_org): 5%, $0.27 - Deepseek V3.2 (@deepseek_ai): 4%, $0.12 These models score below July 2025 frontier labs

English

122

15.6K

Jonathan Balloch@JonathanBalloch·3 Mar

@yacineMTB People don't know what vacuum tubes are? F we are cooked

English

kache@yacineMTB·3 Mar

if you know what this is, dm me i will hire you

English

615

916

115.3K

Jonathan Balloch@JonathanBalloch·3 Mar

@BoWang87 Do it for ROCm and other competitors that don't have the monopoly and you have got yourself a billion dollar asset

English

117

Bo Wang@BoWang87·3 Mar

ByteDance just published something I've been waiting for someone to build: CUDA Agent! It trained a model that writes fast CUDA kernels. Not just correct ones — actually optimized ones. It beats torch.compile by 2× on simple/medium kernels, ~92% on complex ones, and even outperforms Claude Opus 4.5 and Gemini 3 Pro by ~40% on the hardest setting. The key idea is simple but kind of brilliant: CUDA performance isn’t about correctness, it’s about hardware. Warps, memory bandwidth, bank conflicts — the stuff you only see in a profiler. So instead of rewarding “did it compile?”, they reward actual GPU speed. Real profiling numbers. RL trained directly on performance. That’s a big shift. Paper: arxiv.org/abs/2602.24286 Project: cuda-agent.github.io

English

365

2.7K

181.2K

Jonathan Balloch@JonathanBalloch·2 Mar

@yacineMTB easy just come work at @anduriltech 😉

English

kache@yacineMTB·2 Mar

i need radio engineers, aerospace engineers, electrical engineers, machine learning researchers. ideally all four at the same time. how the FUCK do i do that? that's like trying to find a unicorn with diamond underwear and a golden horn

English

221

788

134.9K

Entdecken

@traestephens @HasanFaisall @natolambert @burkov @satyajitdas90 @RelaxedPop @BrianRoemmele @sdflbb