
John Bryan
1.2K posts














i’m really surprised that people don’t see this. It’s mathematically true that llms can’t come up with novel ideas, because the whole point of training is to reduce loss, gain rewards so that the model adhere to rules and ground truth. if you have a model that can come up with novel ideas, it must have high loss during sft or rl.






You'll see a lot of doctors come out "against" this kind of broad screening system. They can even get quite agitated about it. This resistance stems from a well-established clinical consensus: traditional population-level imaging fails to improve health outcomes because false positives and invasive follow-ups do more harm than good. But this view suffers from an obvious blind spot. Existing studies rely on static data and completely ignore time-series imaging. And time-series is ignored because we haven't been able to afford to do high frequency imaging at population scale. Clearly, time series is going to be immensely more valuable than a single image. If you drop costs, value can go from 0 -> 1. On a more fundamental level, the argument against screening rests on an obviously false precept "More information is bad" -- just clearly untrue. More information better, you just have to interpret it correctly.



"Transformers" by Daniel Jurafsky and James H. Martin is one of the clearest and most mathematically grounded introductions to the Transformer architecture I have ever read. Chapter 8 introduces the Transformer as the standard architecture behind modern large language models. What makes this chapter particularly interesting is its step-by-step presentation of the underlying mechanisms: contextual embeddings, self-attention, query, key and value vectors, scaled dot-product attention, multi-head attention, residual streams, feedforward layers, layer normalization, masking, and the parallel matrix formulation of attention. In particular, the treatment of attention as a weighted sum of contextual representations is especially valuable. The chapter first develops an intuitive, simplified view of attention and then gradually derives the full formulation using the Q, K, and V matrices. This approach makes it easier to understand what is actually happening inside the architecture from an algebraic and matrix-based perspective, rather than simply viewing the usual block diagrams. I think it is an excellent resource for anyone interested in understanding how Transformers work from linguistic, mathematical, and computational perspectives. web.stanford.edu/~jurafsky/slp3…



Has anyone else noticed that cancer tends to happen to good people?









My full response to Ted Chiang’s Atlantic essay on AI consciousness is up. In short, he’s a brilliant novelist doing bad philosophy of mind — conceptual confusions, appeals to vibes, and claims out of proportion to evidence. Link in comments!






