Scott

4.4K posts

Scott banner
Scott

Scott

@sdmiddlecamp

Wake Forest MS student. @CalPoly alum. Formerly @DrivelineBB, @SLOTribune, Sounders FC.

Seattle, WA Katılım Eylül 2015
357 Takip Edilen487 Takipçiler
Scott
Scott@sdmiddlecamp·
@tangotiger Yep, it's cool we arrived at similar conclusions with different methodologies! I did it this way as part of a larger hitting model but simple is usually better. Cleaner too. Upon rereading we partially disagree on 3-2. But I think its just zone proximity vs run value of swings
English
0
0
0
58
Scott
Scott@sdmiddlecamp·
The conventional wisdom is "be aggressive in hitter's counts" but the data says batters aren't aggressive enough. The bars show the gap between what batters actually do vs what the model says is optimal in each count.
Scott tweet mediaScott tweet mediaScott tweet mediaScott tweet media
Max Greenfield@GreenfieldMax18

We've come so far in the public world that we've gotten a little repetitive. I don't have the time nor the skills to do the research on it but if someone (smarter and less busy) did new research on arm angles affecting pitch grips, swing decisions based on counts, other topics.

English
3
4
38
5.3K
Scott
Scott@sdmiddlecamp·
Chase vs Zone SD+ in 2025
Scott tweet media
English
0
0
0
83
Scott
Scott@sdmiddlecamp·
Other things this model tries to account for: - Pitch tunneling at decision point - Catcher framing (riskier takes) - Batter zone strengths - Times through order - Pitcher sequencing tendencies - Umpire tendencies within same game (would be stronger with Retrosheet)
English
1
0
0
120
Scott
Scott@sdmiddlecamp·
Tokens lower the barrier to building something. They don’t lower the barrier to knowing what to build or knowing how it works. Domain knowledge is the moat.
⚾️@MarinersF4n

@drivelinekyle @GiuseppePaps I think about this a lot when it comes to vibecoded platforms in general. If the barrier to entry is tokens, what’s the moat? Name recognition? And how long can someone coast purely on name recognition?

English
0
0
1
209
Scott
Scott@sdmiddlecamp·
Gil v. TOR 10/4 - 2.2 innings, 2 ER, 4 hits CH: 6 pitches, 0 whiffs, 1.000 BA FF and SL tunnel cleanly but the CH sits on an island.
Scott tweet media
English
0
0
0
63
Scott
Scott@sdmiddlecamp·
Statcast already quantifies part of the leak. Correlate arm angle with velo across an arsenal. If high, hitters can sort hard from soft from the arm. Weathers v. TOR 3/19 High slot (FF/SL) — 26 pitches, 23% whiff, .600 BIP Low slot (CH/ST/SI) — 48 pitches, 40% whiff, .714 BIP
Scott tweet media
Eli Ben-Porat 🇨🇦@EliBenPorat

If I were (much) younger, and keen on working in baseball, I would build a computer vision program to quantify how much information pitchers leak when they deliver their different pitch types. Batters are computational geniuses masquerading as athletes.

English
2
0
4
794
Scott
Scott@sdmiddlecamp·
@EliBenPorat Kershaw is a great example, when you have the stuff you can overcome it. You can see with the whiffs generated that Weathers' offspeed is nasty. But the point is this approach sorts entire arsenals into two readable buckets.
English
0
0
1
25
Eli Ben-Porat 🇨🇦
Eli Ben-Porat 🇨🇦@EliBenPorat·
@sdmiddlecamp Yes, arm angle is a good approach, but just a piece. IIRC Kershaw's curve comes from an extremely different arm angle, but he's been very successful with it.
English
1
0
1
146
Scott
Scott@sdmiddlecamp·
@wearefromstars @tangotiger Nice graphic. Outside of leverage I think it will depend more on individual umpire bias and batter handedness
English
1
1
1
42
j
j@wearefromstars·
@tangotiger do you think this will be on a catcher’s mind at all when deciding to challenge or not? maybe more likely to challenge in counts when the zone is smaller or larger? or maybe too noisy for the catcher’s to make use of that at all
English
2
0
1
174
j
j@wearefromstars·
one more upside of ABS: the strike zone size doesn’t change depending on the count, unlike umps the difference in strike zone size between a 0-2 and 3-0 count is almost 2 inches! plotted here are 50% ball/strike contours in each count (2023-2025). cool trend!
j tweet media
English
3
3
17
6.7K
Scott
Scott@sdmiddlecamp·
Catchers should: - Target low-confidence locations early in the season before batters calibrate - Exploit early counts where challenge EV is negative - Be aware how sequencing impacts edge perception - Learn pitcher’s shadow zone distributions against their framing profiles
English
0
0
0
44
Scott
Scott@sdmiddlecamp·
In 3-2 counts framing effort is neutralized. Expect any close call to be challenged. But hitters have a dilemma in early counts: Don’t challenge > catchers keep the strike (0.125 runs) Challenge > only 44% overturn rate
English
1
0
0
50
Scott
Scott@sdmiddlecamp·
The goal shouldn’t be to make strikes balls. It should be to maximize uncontested strikes in early counts. RE288 tells you exactly which strikes shouldn’t be challenged. Smart framers should want ABS.
Scott tweet media
The Athletic@TheAthletic

It’s no longer enough to know the strike zone. Big league catchers now have to learn every strike zone and better recognize the difference between, say, the top of 5-foot-6 Jose Altuve’s strike zone and the top of 6-foot-7 Aaron Judge’s.

English
1
0
2
208
Derek Holland
Derek Holland@Dutch_Oven45·
For those that want the robotic strike zone. That pitch last night would definitely have been a strike. All it would have to do is touch any part of the 3d cube. And with breaking balls you could throw some that bounce. Yes literally bounce that would clip the cube and be called a strike. So as a reminder. It’s not where the catcher catches it. It’s where the ball crosses the plate. I promise the robotic strike zone Would have that as a strike because it would have crossed let alone clipped the strike zone cube.
English
183
41
568
216.9K
Scott retweetledi
Robert Stock
Robert Stock@RobertStock6·
Listening to John Smoltz be so confidently wrong for 3 hours straight is an experience
English
29
39
904
50.7K
Scott
Scott@sdmiddlecamp·
Scott tweet media
Andrej Karpathy@karpathy

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

ZXX
0
0
0
168
Chris Worsey
Chris Worsey@Chris_Worsey·
I took the @karpathy autoresearch loop and pointed it at markets. 25 AI agents debate macro, rates, commodities, sectors, and single stocks daily. Every recommendation scored against real outcomes. Worst agent by rolling Sharpe gets its prompt rewritten by the system. Keep or revert. Same loop, prompts are the weights, Sharpe is the loss function. Trained the agents on 18 months of market data. 378 iterations. 54 prompt modifications, 16 survived. The system learned which agents to trust using Darwinian weights — geopolitical, commodities, and the @BillAckman quality compounder rose to the top. The agents even figured out their own portfolio manager was the weakest link before we did! Deployed the trained agents. +22% in 173 days. Best pick: AVGO at $152, held for +128%. The final prompts are evolutionary products — shaped by market feedback, not human intuition. Now running live with my own capital. github.com/chrisworsey55/… Part hedge fund, part research experiment :)
Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English
156
232
4K
768.5K