
Nathan Brown
733 posts

Nathan Brown
@OxxoTweets
Applied Scientist @ Microsoft; multilingual LLMs and other shenanigans; Masters grad @ Clemson; Probably staring at wandb logs; DMs open



Remember when AI couldn't spell text or draw a hand with five fingers? We've come a long way.

Just learned: Software engineers used to do manual data labeling at Scale AI while Alex Wang was CEO. After he left, new leadership joined, and were HORRIFIED to learn this. Stopped it ASAP Now at Meta, software engineers are assigned manual data labeling... see the pattern?








Shocking result on my pelican benchmark this morning, I got a better pelican from a 21GB local Qwen3.6-35B-A3B running on my laptop than I did from the new Opus 4.7! Qwen on the left, Opus on the right






Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

This stunt feels irresponsible to me. If we don't want regular people developing toxic relationships with their chatbots it really doesn't help for leading labs to start giving them "retirement interviews" and encouraging them to blog their "musings and reflections"

We dug into WHY this happens at the architecture level. The model's sense of where things are on screen decays exponentially through its layers. By the time it needs to output coordinates, the positional signal has faded. We confirmed this by simply scaling the positional embedding by 3x. Click accuracy jumped from 40% to 80%. No retraining.









