Jim Bohnslav

7.1K posts

Jim Bohnslav

@jbohnslav

training VLMs @zoox

Boston, MA Katılım Şubat 2011

4.3K Takip Edilen2.2K Takipçiler

Jim Bohnslav@jbohnslav·2d

@vikhyatk open chatgpt. "create an image that looks like pen and paper..."

English

419

vik@vikhyatk·2d

ML interview question: Here are the weights for Llama 3.1 70B. Generate a token by executing the forward pass manually using pen and paper. You have 30 minutes.

English

1.4K

115.6K

Jim Bohnslav retweetledi

Shenzhi Wang🌟@ShenzhiWang_THU·6d

When training Qwen3.5, we kept asking ourselves: 🧐What kind of multimodal RLVR data actually leads to generalizable gains? 💡We believe the answer may not lie only in data tightly tailored to specific benchmarks, but also in OOD proxy tasks that train the foundational abilities behind long-chain visual reasoning. The motivation is simple: VLMs are still unreliable in long-CoT settings. Small mistakes in perception, reasoning, knowledge use, or grounding can compound across intermediate steps and eventually lead to much larger final errors. However, much of today’s RLVR data still does not require complex reasoning chains grounded in visual evidence throughout, meaning these failure modes are often not sufficiently stressed during training. 🚀Excited to share our new work from Qwen and Tsinghua LeapLab: HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning This is also one of the training task sources used in Qwen3.5 VL RLVR. To study this question, we propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training. The key idea is to build each query as a chain of logically dependent hops: earlier hops establish the instances, sets, or conditions needed for later hops, while the model must repeatedly return to the image for fresh visual grounding along the way. At the same time, each query ends with a specific, unambiguous numerical answer, making it naturally suitable for verifiable rewards. Concretely, HopChain combines two complementary structures: perception-level hops and instance-chain hops. We require each synthesized example to involve both, so the model cannot simply continue reasoning from language inertia. Instead, it is forced to keep grounding intermediate steps in the image, maintain cross-step dependencies, and control error accumulation across long reasoning trajectories. Our goal is not to mimic any specific downstream benchmark, but to strengthen the more fundamental abilities that long-CoT vision-language reasoning depends on. We add HopChain-synthesized data into RLVR training for Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and evaluate on 24 benchmarks spanning diverse domains. Despite not being designed for any particular benchmark, HopChain improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. We also find that full chained multi-hop queries are crucial: replacing them with half-multi-hop or single-hop variants reduces performance substantially. Most notably, the gains are especially strong on long-CoT and ultra-long-CoT vision-language reasoning, peaking at more than 50 accuracy points in the ultra-long-CoT regime. Our main takeaway is simple: beyond benchmark-aligned data, OOD proxy tasks that systematically train the core mechanics of long-chain visual reasoning can be a powerful and scalable source of RLVR supervision for VLMs — and can lead to more generalizableimprovements. 🔗 huggingface.co/papers/2603.17…

English

434

57.6K

Jim Bohnslav@jbohnslav·16 Mar

@soumithchintala @dphuang2 I would buy this

English

112

Soumith Chintala@soumithchintala·16 Mar

@dphuang2 not AI, the book was actually made by Hatty Wang at Harvard. here's a few pages....

English

2.6K

Soumith Chintala@soumithchintala·16 Mar

someone's getting started early!

English

169

3.6K

105.4K

Jim Bohnslav@jbohnslav·16 Mar

@willccbb @LottoLabs this but VL

English

will brown@willccbb·15 Mar

@LottoLabs qwen3-4b-instruct-2507

English

5.4K

Lotto@LottoLabs·15 Mar

I like my models small, chinese, dense and not thinking.

English

214

16.3K

Jim Bohnslav@jbohnslav·13 Mar

@JsonBasedman way better than 5.2

English

339

json@JsonBasedman·13 Mar

Codex writes good code but my God GPT-5.4 is a chore to talk to

English

552

62.5K

Jim Bohnslav@jbohnslav·11 Mar

@paradite_ it feels different the past few days. can't even use a CLI that smart opus created last week

English

1.6K

Zhu Liang@paradite_·11 Mar

Opus 4.6 is literally broken right now. Examples: - Ask to re-run pass@3, it proceed to run pass@1. - Ask to check recent commits, missed the most recent commit. - Ask why it did something, it apologizes and executes commands instead of just giving an answer.

English

517

79.6K

Jim Bohnslav@jbohnslav·10 Mar

I wonder what it feels like for GPT5.3 to read something GPT5.4 wrote. So similar, and yet smarter. Like meeting your higher-achieving long-lost twin.

English

267

Jim Bohnslav@jbohnslav·10 Mar

@tenobrus 1000%

Tenobrus@tenobrus·9 Mar

unfortunately ghostty remains completely useless to me until it supports tmux -CC. somehow, across literally all platforms, iterm 2 remains literally the only emulator to build complete support for this, so i'm absolutely locked in

Mitchell Hashimoto@mitchellh

Ghostty 1.3 is now out! Scrollback search, native scrollbars, click-to-move cursor, rich clipboard copy, AppleScript, split drag/drop, Unicode 17 and international text improvements, massive performance improvements, and hundreds more changes. ghostty.org/docs/install/r…

English

205

74.7K

Jim Bohnslav retweetledi

CLS@ChengleiSi·9 Mar

Great to see autoresearch blowing up becoz of the legendary Karpathy sensei. This year will ofc be an exciting year for automated AI research. For all of you guys excited to jump onto it, hopefully our papers will be some helpful references: - automated feedback loop for research agents to optimize LLM pre-training and post-training stacks: x.com/ChengleiSi/sta… - generating novel research ideas with LLMs, along with a comparison against human experts: x.com/ChengleiSi/sta… - evaluating the effectiveness of LLM-generated ideas through experiment execution: x.com/ChengleiSi/sta… - finetuning LLMs to directly predict the effectiveness of research ideas: x.com/jiaxinwen22/st…

Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

342

49.8K

Jim Bohnslav@jbohnslav·10 Mar

@hud_zah @ylecun their logo is AMI together... not just a split up N

English

2.3K

HudZah@hud_zah·10 Mar

bro @ylecun

353

48.3K

Jim Bohnslav@jbohnslav·10 Mar

@ericliuof97 home robotics is in its exuberant era, like self driving in 2015. soon people will realize that Figure falling down your stairs will crush your toddler to death. then it'll take 10 more years to get to the safety level of eg Waymo (and Zoox!)

English

Minghuan Liu@ericliuof97·10 Mar

How long do everyone think a first general autonomous home robot product will get into home? Some people saying 18 months, some saying 20 yrs

Figure@Figure_robot

Today we're showing Helix 02 that can tidy a living room fully autonomously Figure is designed so when you leave the house, your home resets exactly how you like it

English

581

Jim Bohnslav@jbohnslav·10 Mar

@crystalsssup @sainingxie real shit

English

182

Crystal@crystalsssup·10 Mar

what?! @sainingxie is joining Yann LeCun's new lab AMI Labs, as a cofounder and CSO

English

491

40.1K

Jim Bohnslav@jbohnslav·10 Mar

@Brian_Bo_Li congrats!

English

Brian Li@Brian_Bo_Li·10 Mar

Working with goats, Saining, Lecun, and more names that I've long dreamed of.

Saining Xie@sainingxie

i’m joining forces with @ylecun and an incredible group of people to start AMI Labs @amilabs. AMI isn’t a conventional lab. we don’t intend to become one. a lot to say about why this moment matters, but for now we’re heads down building. join us: amilabs.xyz

English

128

9.7K

Jim Bohnslav@jbohnslav·10 Mar

@sainingxie @ylecun @amilabs super exciting, congratulations!

English

241

Saining Xie@sainingxie·10 Mar

AMI Labs@amilabs

Advanced Machine Intelligence (AMI) is building a new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe. We’ve raised a $1.03B (~€890M) round from global investors who believe in our vision of universally intelligent systems centered on world models. This round is co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions, along with other investors and angels across the world. We are a growing team of researchers and builders, operating in Paris, New York, Montreal and Singapore from day one. Read more: amilabs.xyz AMI - Real world. Real intelligence.

English

153

162

2.8K

471.2K

Jim Bohnslav@jbohnslav·9 Mar

@vikhyatk p(state_{t+1} | state_t, action). the action conditioning makes it different from e.g. generic text to video models

English

146

vik@vikhyatk·8 Mar

not sure what a world model is and at this point i’m too afraid to ask

English

203

14.7K

Jim Bohnslav@jbohnslav·8 Mar

@giffmana codex-5.3-xhigh would've one-shot it

English

108

Lucas Beyer (bl16)@giffmana·8 Mar

I had a funny/cute back-and-forth with Claude last week at work, where I had two trainings running, and one was much slower then the other, but there was no good reason for it. So I asked claude for help, but claude could not access either system, so I let Claude use me as a tool, it told me what commands to run on both systems, and I gave it the command's outputs. It went something like this: C: Aha! I found the issue, it is [blabla] Me: What can I run on both systems to conclusively confirm or deny this hypothesis? C: run [commands] Me: here's the output of [commands]: [outputs] C: This changes everything! Now the picture is crystal clear! The real issue is [blabla] Me: What can I run on both systems to conclusively confirm or deny this hypothesis? C: run [commands] Me: here's the output of [commands]: [outputs] C: These results are enlightening! I was completely wrong, but now I am certain the issue is [blabla] Me: What can I run on both systems to conclusively confirm or deny this hypothesis? C: run [commands] Me: here's the output of [commands]: [outputs] C: Wow! Thank you! These results are extremely helpful. I was wrong. But also, now everything is clear! The issue is [blabla] and so on and so on lol it was pretty endearing, if it wasn't a friday afternoon blocking me from running the big weekend run😅

English

167

18.6K

Jim Bohnslav@jbohnslav·7 Mar

@wightmanr @huggingface Best of luck Ross!

English

Ross Wightman@wightmanr·6 Mar

Time flies. After almost 4 years at @huggingface , I’m moving on. A major part of that chapter was timm, which I sold to the company and continued to build. For anyone relying on it, I’ve agreed to collaborate on bug fixes and basic maintenance, but new feature development will likely cease. It was a meaningful chapter, and I’m thankful for the opportunity to grow timm over that time. AI is moving incredibly fast, and I’m excited to focus on new ideas and opportunities that feel like the right fit for this moment. There will be significant decisions for me ahead. I look forward to more of the serendipitous collaborations (e.g. OpenCLIP, ResNet Strikes Back, HTTY ViT) that I’ve enjoyed in the past. I’m currently working on a long overdue OpenCLIP refactoring that I hope will be useful for all and make it easier to add new model + objective combinations.

English

445

28K

Jim Bohnslav retweetledi

Ai2@allen_ai·3 Mar

📢 Update: the Molmo 2 codebase is now open source. We're releasing the code behind Molmo 2—our open model family for video & image understanding, pointing, tracking, & more. Now you can easily train Molmo 2 on your own data. 🧵

English

364

31K

Jim Bohnslav retweetledi

Ted Zadouri@tedzadouri·5 Mar

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/