Scott Condron

3K posts

Scott Condron banner
Scott Condron

Scott Condron

@_ScottCondron

Helping build AI dev tools at @weights_biases. I post about AI, data visualisation and the stuff I’m working on at wandb.

Dublin, Ireland Katılım Nisan 2018
2K Takip Edilen5.7K Takipçiler
Sabitlenmiş Tweet
Scott Condron
Scott Condron@_ScottCondron·
Here's an animation of a @PyTorch DataLoader. It turns your dataset into a shuffled, batched tensors iterator. (This is my first animation using @manim_community, the community fork of @3blue1brown's manim) Here's a little summary of the different parts for those curious: 1/5
English
34
490
2.6K
0
Scott Condron retweetledi
WolfBench
WolfBench@WolfBenchAI·
For benchmarks, I keep agent versions stable so results stay comparable. But new models can expose agent-side bugs. Here, updating @openclaw from 2026.3.11 to 2026.4.23 lifted Kimi K2.6 from 4% to 60% on @WolfBenchAI due to crucial fixes in how the agent handles its tool calling.
WolfBench tweet media
English
0
3
6
430
Scott Condron retweetledi
WolfBench
WolfBench@WolfBenchAI·
GPT-5.5 takes over WolfBench! It’s now the #1 model, ahead of Claude Opus 4.7 and 4.6, GPT-5.4, Sonnet 4.6, Kimi K2.6, Gemini 3.1 Pro, and more. Notable findings after 30 runs (40h runtime, >1.7B tokens, ~$3K cost): - @OpenAI's GPT-5.5 is the best model we ever tested. - @cursor_ai's Agent CLI (CA) is the best agent we ever tested. - @NousResearch's Hermes Agent (HA) outperformed OpenClaw (OC). - With Hermes, going from medium to xhigh reasoning only improved consistency, not capability. Note: This is WolfBench, where we look at more than just the average score, because one metric is not enough. The golden ∅ score is the actual 5-run average, which most other benchmarks report as their only score. ★ shows the ceiling (what percentage of the full benchmark this model+agent combination solved at least once across all runs). ■ shows the solid base (what percentage of the full benchmark it solved consistently in every run).
WolfBench tweet media
English
2
3
26
2.7K
Scott Condron retweetledi
Weights & Biases
Weights & Biases@wandb·
Still feels a little unreal that you can just upload a dataset, get a fine-tuned LoRA back, and have it auto-deployed for inference without touching a single GPU config. Serverless SFT is still in public preview and adapter training is free right now. Don't sleep on it.
English
1
3
31
4.8K
Scott Condron retweetledi
Bowen Baker
Bowen Baker@bobabowen·
Today we open sourced many of OpenAI's monitorability evaluations. We hope that the research community and other model developers can build upon them and use them to evaluate the monitorability of their own models. alignment.openai.com/monitorability…
English
55
51
592
193.8K
Scott Condron retweetledi
Weights & Biases
Weights & Biases@wandb·
v26.5 of the W&B iOS app is live! › Full run logs on your phone. Live, searchable, exportable. › Stop a run from your phone. › Server-side run search. › UI polish. 🫡 This release is based off of your feedback, so please keep it coming.
English
2
2
9
1K
Gavin Nelson
Gavin Nelson@Gavmn·
I thought working at a frontier lab would make it easier to stay on top of ai news...
English
13
2
229
17.9K
Scott Condron retweetledi
Emmanuel Turlay
Emmanuel Turlay@neutralino1·
CC has become a major building block of the new agent-first application paradigm. One CC turn can spawn hundreds of steps (tool calls, reflection, planning, etc.). Without clear visibility, you're building blind and the outcomes are 🤷 Agents without observability will just fail.
Weights & Biases@wandb

Building with Claude Code? You need to see what's happening each turn. The new @weave_wb plugin traces every session automatically. Tool calls, subagents, inputs, outputs. All structured so you can debug faster. No code changes. Just install and go.

English
0
3
4
813
Scott Condron retweetledi
Weights & Biases
Weights & Biases@wandb·
Building with Claude Code? You need to see what's happening each turn. The new @weave_wb plugin traces every session automatically. Tool calls, subagents, inputs, outputs. All structured so you can debug faster. No code changes. Just install and go.
English
1
3
20
5.3K
Scott Condron
Scott Condron@_ScottCondron·
Having worked at @wandb for years, one thing we always wanted to capture was the "why" behind experiments - not only the runs. Reports help, but it still takes effort to get things down. Now that Claude Code is everyone's experimentation partner - kicking off research, synthesizing results, suggesting next iterations - we have a real shot at logging that work automatically, alongside your runs. So we built two things: a Claude Code integration that auto-logs your sessions, and a W&B Skill that teaches Claude to work with W&B - query runs, suggest experiments, analyze results. Excited to see how teams use this to iterate.
Scott Condron tweet media
English
1
8
24
1.9K
Scott Condron retweetledi
marimo
marimo@marimo_io·
Our friend and ambassador @pandeyparul is teaching a free live @OReillyMedia workshop on using marimo for AI and ML development. 5 hands-on modules covering reactive notebooks, interactive ML workflows, AI coding agents, and more. 🗓️ May 18, 2026 Register here: oreilly.com/live/marimo-fo…
English
0
7
19
5.1K
Ian Arawjo
Ian Arawjo@IanArawjo·
@_ScottCondron Let’s touch base later in the summer. One of my students may work on a topic around Splat, close to this qualitative analysis problem. (Aside from this, I’d probably have some things to ask you about stats for evals, too!)
English
1
0
1
37
Ian Arawjo
Ian Arawjo@IanArawjo·
A few months ago, I released Splat: an affinity diagramming tool in a single file. Today, a bright PhD student at Northeastern extended Splat to create "Splatter", with many more features for qualitative analysis, like tags and a codebook. Try it here: hasiburrahman.net/splatter/
Ian Arawjo tweet media
English
4
1
22
820
Scott Condron retweetledi
clem 🤗
clem 🤗@ClementDelangue·
"But here is what we found when we tested: We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens. A 5.1B-active open model recovered the core chain of the 27-year-old OpenBSD bug." aisle.com/blog/ai-cybers…
English
110
344
2.4K
723.7K
Scott Condron retweetledi
Weights & Biases
Weights & Biases@wandb·
You start a training run. You leave for a jog. Your phone buzzes. "Loss diverged." Sprint home. Or... "Loss below threshold." Keep jogging. W&B Automations are now LIVE for everyone!
English
3
23
104
1M
Scott Condron
Scott Condron@_ScottCondron·
We're looking for a Chief Shitposter - You ragebait celeb accounts most days - Being muted is a badge of honour - You're currently a Reply Guy for 10+ accounts - Your engagement farm is almost fully automated
English
1
1
2
141