Justin Waugh

439 posts

Justin Waugh

Justin Waugh

@JustinWaugh

Founder @ Approximate Labs

Katılım Temmuz 2009
605 Takip Edilen268 Takipçiler
Justin Waugh
Justin Waugh@JustinWaugh·
@LLMJunky @mweinbach Dell via call for quote (the video above is of the Dell version, hence the "them") I saw the cdw thing as well, but not clear that that is real / actually available. I haven't heard back directly from any other supplier (there are 6 that were on the NVIDIA website)
English
0
0
0
20
Max Weinbach
Max Weinbach@mweinbach·
Does anyone have pricing on this I? A couple months ago I was betting like $50-75K
Matthew Berman@MatthewBerman

.@nvidia hand delivered a pre-production unit of the @Dell Pro Max with GB300 to my house. 100lbs beast with 750GB+ of unified memory to power the best open-source models in the world. What should I test first?

English
14
2
65
16.8K
Justin Waugh
Justin Waugh@JustinWaugh·
@mweinbach (called them and got this number, precisely: 169,005.32 base, 183,029.39 with tax) i was seriously ready to buy at ~80k-100k price point, but was shocked to hear the price they are asking.
English
1
0
2
88
Justin Waugh
Justin Waugh@JustinWaugh·
@thsottiaux Codex stops early way too often, and i have no way to meta-prompt it to wake itself up on a schedule / keep iterating like i can with claude code
English
0
0
2
134
Tibo
Tibo@thsottiaux·
What are we consistently getting wrong with codex that you wish we would improve / fix?
English
1.2K
14
872
141.3K
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
NVIDIA DGX Station is now available to order from select OEMs🔥 Powered by the GB300 Grace Blackwell Ultra Desktop Superchip, DGX Station brings data-center-class AI performance to the desk — enabling developers to build and run autonomous AI agents locally. ⚡ 748GB of coherent memory ⚡ Up to 20 petaflops of AI compute performance ⚡ Run large open models up to one trillion parameters Together with NVIDIA NemoClaw — an open source stack that simplifies running OpenClaw always-on assistants, more safely, with a single command, we are delivering a full-stack platform for secure, long-running agentic AI. Learn more: #dgx-spark-station" target="_blank" rel="nofollow noopener">blogs.nvidia.com/blog/gtc-2026-…
NVIDIA AI Developer tweet media
English
114
153
1.4K
234.4K
leon
leon@towheretobegin·
When I moved to new york, I found it hard to visualize what commute times actually looked like. The same dilemma occurs every time you move, or even book a hotel: what's actually accessible in 20 minutes of public transit? Deployment link below
English
99
136
5.2K
1.4M
Justin Waugh
Justin Waugh@JustinWaugh·
@_chenglou Very cool~ I wonder if it would generalize well across more puzzle varieties (eg. those in pencil puzzle bench ppbench.com )
English
0
0
0
569
Cheng Lou
Cheng Lou@_chenglou·
I’m very happy to present my toy research project: Sotaku! It's a neural net that automatically discovered the rules of sudoku and learned to solve them, achieving a new state-of-the-art score of 98.9% on one of the hardest sudoku datasets, while being agnostic to the game, and beating all other sudoku-optimized neural net architectures* Read more for fun motivations, plus some extremely unconventional discoveries, e.g. reverse curriculum consistently beating curriculum (!), emergent reasoning-like capabilities, and the future of traditional programming
Cheng Lou tweet media
English
22
96
1.2K
86.7K
Carter McKay
Carter McKay@carterwmckay·
@scaling01 Where are you finding the puzzles, all the sites I tried had some broken puzzles, 0 and 1 points directly at each other.
English
1
0
0
70
Justin Waugh
Justin Waugh@JustinWaugh·
@AndilesAnthony @scaling01 For this puzzle specifically here's the trace: @xhigh&puzzle=yajilin_de2cf706b2ff47627cc6ded790ff3de4" target="_blank" rel="nofollow noopener">ppbench.com/replay.html?mo… it took ~30 minutes thinking and then one-shot it, haha.
English
0
0
0
43
Justin Waugh
Justin Waugh@JustinWaugh·
I wrote the benchmark and ran gpt-5.4-xhigh test shown here. Paper is available on arxiv: arxiv.org/html/2603.0211… website has full traces of the model on all the puzzles @xhigh.html" target="_blank" rel="nofollow noopener">ppbench.com/model/gpt-5.4@… (full details available on huggingface) For your actual questions: GPT-5.4@xhigh was run against these puzzles using the API and with this basic-agentic harness, defined here. github.com/approximatelab…
English
2
0
0
41
Justin Waugh
Justin Waugh@JustinWaugh·
@scaling01 That was a fun one one for sure! Took me 17:05 (and I consider myself pretty good at these / have done many yajilin before) (also, thanks for posting this, prompted me to try it, led to learning the open-graph preview for share had a bug on X!) ppbench.com/share/QVBZEEhD…
English
1
0
4
429
Justin Waugh retweetledi
Ethan Mollick
Ethan Mollick@emollick·
Exponential improvements* everywhere for those with the eyes to see them. This is a cool benchmark, and was impossible for early non-reasoner LLMs to do at all. * Okay, technically "logistic improvement" because the maximum score is bounded at 100 (and logistic has a lower AIC)
Ethan Mollick tweet media
Justin Waugh@JustinWaugh

(1/N) Pencil Puzzle Bench is out! 51 LLMs tested on pencil puzzles (multi-step, logical reasoning, verifiable at each step) Dataset: 62k unique puzzles, 94 types. Evaluation: covers 300 puzzles across 20 types Best score: GPT 5.2@xhigh 56%, half the puzzles are still unsolved

English
20
22
261
57.2K
Justin Waugh
Justin Waugh@JustinWaugh·
@ChristosTzamos I recently released pencil-puzzle-bench. Awesome to see so many steps / decoding as a computer. Would be interested to see if it can adapt solutions for many puzzle types, not just sudoku as shown. ppbench.com
English
0
1
11
4.7K
Christos Tzamos
Christos Tzamos@ChristosTzamos·
1/4 LLMs solve research grade math problems but struggle with basic calculations. We bridge this gap by turning them to computers. We built a computer INSIDE a transformer that can run programs for millions of steps in seconds solving even the hardest Sudokus with 100% accuracy
English
239
787
5.9K
1.6M
Justin Waugh
Justin Waugh@JustinWaugh·
@YafahEdelman pencil puzzle bench recently crossed 30%, but each LLM sucess is still expensive/slow ($5+, 10min+), especially compared to CPU based SAT solvers (custom to problem) that are ~7-8 orders cheaper and ~3 orders faster Huge efficiency gains still remain ppbench.com
English
0
0
1
253
Yafah Edelman
Yafah Edelman@YafahEdelman·
Okay, what benchmarks are still under 30%?
English
32
3
60
32.6K
meta
meta@eigenform·
latest installment in "is this a factorio screenshot or old intel p-core layout"
meta tweet media
English
40
1.1K
20.7K
290.2K
hardmaru
hardmaru@hardmaru·
In an alternate timeline we’d be using Evangelion GUI designs rather than CLIs
English
102
700
7.2K
1.6M
Jasper Dekoninck
Jasper Dekoninck@j_dekoninck·
I know about some other benchmarks where GPT-5.4-Pro seemingly does not outperform GPT-5.4 by all that much, but this clearly shows it's at least better in some areas :)
English
2
0
27
2.1K
Jasper Dekoninck
Jasper Dekoninck@j_dekoninck·
One more: We now added GPT-5.4-Pro to ArxivMath February and Apex. Extremely expensive, but it is SoTa on both by quite a significant margin
Jasper Dekoninck tweet media
English
13
31
360
28.7K
Justin Waugh
Justin Waugh@JustinWaugh·
Just added GPT-5.4 to pencil-puzzle bench results! Large uplift (70% solved now). Longest single success for a puzzle took 95 minutes and $17.90 in inference alone See the full breakdown, play the puzzles, and look at the 5.4 traces here: ppbench.com
Justin Waugh tweet mediaJustin Waugh tweet media
Justin Waugh@JustinWaugh

(1/N) Pencil Puzzle Bench is out! 51 LLMs tested on pencil puzzles (multi-step, logical reasoning, verifiable at each step) Dataset: 62k unique puzzles, 94 types. Evaluation: covers 300 puzzles across 20 types Best score: GPT 5.2@xhigh 56%, half the puzzles are still unsolved

English
1
0
4
318