Scott Condron

3K posts

Scott Condron

@_ScottCondron

Helping build AI dev tools at @weights_biases. I post about AI, data visualisation and the stuff I’m working on at wandb.

Dublin, Ireland Katılım Nisan 2018

2K Takip Edilen5.7K Takipçiler

Sabitlenmiş Tweet

Scott Condron@_ScottCondron·21 Şub

Here's an animation of a @PyTorch DataLoader. It turns your dataset into a shuffled, batched tensors iterator. (This is my first animation using @manim_community, the community fork of @3blue1brown's manim) Here's a little summary of the different parts for those curious: 1/5

English

490

2.6K

Scott Condron retweetledi

Alex Volkov@altryne·5d

Can confirm, @cursor_ai is the best harness we've tested on @WolfBenchAI so far! @WolframRvnwlf tests Harness x Model, and Cursor (before the SDK) is the best one we've ever tested!

Dan ⚡️@d4m1n

lol Cursor is a better harness for both GPT 5.5 in Codex AND Opus 4.7 in Claude Code how is that possible?!

English

284

63.9K

Scott Condron retweetledi

WolfBench@WolfBenchAI·3d

For benchmarks, I keep agent versions stable so results stay comparable. But new models can expose agent-side bugs. Here, updating @openclaw from 2026.3.11 to 2026.4.23 lifted Kimi K2.6 from 4% to 60% on @WolfBenchAI due to crucial fixes in how the agent handles its tool calling.

English

430

Scott Condron retweetledi

WolfBench@WolfBenchAI·28 Nis

GPT-5.5 takes over WolfBench! It’s now the #1 model, ahead of Claude Opus 4.7 and 4.6, GPT-5.4, Sonnet 4.6, Kimi K2.6, Gemini 3.1 Pro, and more. Notable findings after 30 runs (40h runtime, >1.7B tokens, ~$3K cost): - @OpenAI's GPT-5.5 is the best model we ever tested. - @cursor_ai's Agent CLI (CA) is the best agent we ever tested. - @NousResearch's Hermes Agent (HA) outperformed OpenClaw (OC). - With Hermes, going from medium to xhigh reasoning only improved consistency, not capability. Note: This is WolfBench, where we look at more than just the average score, because one metric is not enough. The golden ∅ score is the actual 5-run average, which most other benchmarks report as their only score. ★ shows the ceiling (what percentage of the full benchmark this model+agent combination solved at least once across all runs). ■ shows the solid base (what percentage of the full benchmark it solved consistently in every run).

English

2.7K

Scott Condron retweetledi

Weights & Biases@wandb·27 Nis

Still feels a little unreal that you can just upload a dataset, get a fine-tuned LoRA back, and have it auto-deployed for inference without touching a single GPU config. Serverless SFT is still in public preview and adapter training is free right now. Don't sleep on it.

English

4.8K

Scott Condron retweetledi

Bowen Baker@bobabowen·24 Nis

Today we open sourced many of OpenAI's monitorability evaluations. We hope that the research community and other model developers can build upon them and use them to evaluate the monitorability of their own models. alignment.openai.com/monitorability…

English

592

193.8K

Scott Condron retweetledi

Weights & Biases@wandb·21 Nis

v26.5 of the W&B iOS app is live! › Full run logs on your phone. Live, searchable, exportable. › Stop a run from your phone. › Server-side run search. › UI polish. 🫡 This release is based off of your feedback, so please keep it coming.

English

Scott Condron@_ScottCondron·14 Nis

@Gavmn @altryne does that for me

English

347

Gavin Nelson@Gavmn·14 Nis

I thought working at a frontier lab would make it easier to stay on top of ai news...

English

229

17.9K

Scott Condron retweetledi

Emmanuel Turlay@neutralino1·13 Nis

CC has become a major building block of the new agent-first application paradigm. One CC turn can spawn hundreds of steps (tool calls, reflection, planning, etc.). Without clear visibility, you're building blind and the outcomes are 🤷 Agents without observability will just fail.

Weights & Biases@wandb

Building with Claude Code? You need to see what's happening each turn. The new @weave_wb plugin traces every session automatically. Tool calls, subagents, inputs, outputs. All structured so you can debug faster. No code changes. Just install and go.

English

813

Scott Condron retweetledi

Weights & Biases@wandb·11 Nis

English

5.3K

Scott Condron@_ScottCondron·10 Nis

@wandb Claude Code Plugin github.com/wandb/claude_c… npm install -g weave-claude-plugin weave-claude-plugin install W&B Skill github.com/wandb/skills npx skills add wandb/skills

English

238

Scott Condron@_ScottCondron·10 Nis

Having worked at @wandb for years, one thing we always wanted to capture was the "why" behind experiments - not only the runs. Reports help, but it still takes effort to get things down. Now that Claude Code is everyone's experimentation partner - kicking off research, synthesizing results, suggesting next iterations - we have a real shot at logging that work automatically, alongside your runs. So we built two things: a Claude Code integration that auto-logs your sessions, and a W&B Skill that teaches Claude to work with W&B - query runs, suggest experiments, analyze results. Excited to see how teams use this to iterate.

English

1.9K

Scott Condron retweetledi

marimo@marimo_io·1 Nis

Our friend and ambassador @pandeyparul is teaching a free live @OReillyMedia workshop on using marimo for AI and ML development. 5 hands-on modules covering reactive notebooks, interactive ML workflows, AI coding agents, and more. 🗓️ May 18, 2026 Register here: oreilly.com/live/marimo-fo…

English

5.1K

Scott Condron retweetledi

marimo@marimo_io·9 Nis

We built the ultimate AI/ML workflow with @pydantic, @wandb, and @PyTorch all in marimo. Full walkthrough: youtube.com/watch?v=dObLL3…

YouTube

English

3.6K

Scott Condron@_ScottCondron·10 Nis

You can now get notifications on your phone for sudden drops in the loss function, just in case

Scott Condron@_ScottCondron

OOM scrolling

English

2.6K

Scott Condron@_ScottCondron·9 Nis

@IanArawjo Let’s do it, lmk when you want to chat

English

Ian Arawjo@IanArawjo·9 Nis

@_ScottCondron Let’s touch base later in the summer. One of my students may work on a topic around Splat, close to this qualitative analysis problem. (Aside from this, I’d probably have some things to ask you about stats for evals, too!)

English

Ian Arawjo@IanArawjo·9 Nis

A few months ago, I released Splat: an affinity diagramming tool in a single file. Today, a bright PhD student at Northeastern extended Splat to create "Splatter", with many more features for qualitative analysis, like tags and a codebook. Try it here: hasiburrahman.net/splatter/

English

820

Scott Condron retweetledi

clem 🤗@ClementDelangue·8 Nis

"But here is what we found when we tested: We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens. A 5.1B-active open model recovered the core chain of the 27-year-old OpenBSD bug." aisle.com/blog/ai-cybers…

English

110

344

2.4K

723.7K

Scott Condron@_ScottCondron·8 Nis

You can use automations and alerts with trace evaluations too - so get alerts when the average score of your agent goes below some threshold etc. Training run + production evaluation alerts in the same place :)

Weights & Biases@wandb

You start a training run. You leave for a jog. Your phone buzzes. "Loss diverged." Sprint home. Or... "Loss below threshold." Keep jogging. W&B Automations are now LIVE for everyone!

English

286

Scott Condron retweetledi

Weights & Biases@wandb·8 Nis

You start a training run. You leave for a jog. Your phone buzzes. "Loss diverged." Sprint home. Or... "Loss below threshold." Keep jogging. W&B Automations are now LIVE for everyone!

English

104

Scott Condron@_ScottCondron·8 Nis

We're looking for a Chief Shitposter - You ragebait celeb accounts most days - Being muted is a badge of honour - You're currently a Reply Guy for 10+ accounts - Your engagement farm is almost fully automated

English

141

Keşfet

@cursor_ai @WolfBenchAI @WolframRvnwlf @openclaw @OpenAI @NousResearch @Gavmn @altryne