Ben Cohen
569 posts

Ben Cohen
@blc_16
spends too much time watching football and coding. Prev @Meta, @Microsoft



I am glad that I don't live in new york - not that I hate any parts of it. On the contrary the absence makes many parts of the city stir stronger memories when I revisit them... forgetting and then rediscovering things in such a great city quite a unique experience



My latest, on the continuing fiasco at the CPP investment fund, which has now spent more than $50 billion over twenty years to lose about $100 billion relative to what it might have earned, for an equal amount of risk, if it had just bought the relevant indexes — or flung darts at the stock listings. theglobeandmail.com/opinion/articl…





JUST IN: The Enhanced Games are set to debut this weekend in Las Vegas, with athletes allowed to use steroids, testosterone, HGH, & other banned substances.


OpenAI is offering $2M in tokens to every YC company in the spring and summer batches. We extended the summer deadline to May 25 so more founders can get in on it. ycombinator.com/apply

I don't understand how people are still coping about Mythos. Here's a few benchmarks: SWE-bench Pro: Mythos -> 77.8%, GPT-5.5 -> 58.6% HLE: Mythos -> 56.8%, GPT-5.5 -> 41.4% UK AISI cyber ranges: - "The Last Ones": Mythos -> 6/10, GPT-5.5 3/10 - "Cooling Tower": Mythos -> 3/10, GPT-5.5 0/10 ExploitBench: - Mythos -> 18 Arbitrary Code Executions - GPT-5.5 -> 0 Arbitrary Code Executions ExploitGym: - Mythos -> 157 exploits (289.3 LLM calls) - GPT-5.5 -> 120 exploits (375.4 LLM calls) XBOW same story. Mythos has much higher odds of finding vulnerabilities within smaller token budgets.

@levelsio @TermiusHQ so do you write code? or you just write to Claude what do you want? ex. you want a new feature or you want a change. you just prompt or what ? or Claude is mainly the assistant here? my question mostly is how much trust you put on claude while you are on prod.


Some enterprise tasks are challenging to hill-climb with RL-based methods since they involve very out-of-distribution behavior. On-policy self-distillation (OPSD) gives a model learning signal for every token it writes, far richer than the single scalar reward of RL. But that channel is noisy: most tokens don't reflect the behavior you're after. We introduce Relevance-Masked Self-Distillation (RMSD), which uses a two-step filtered loss mask to cut through the noise and find the tokens with the highest signal. Compared to OPSD it trains more stably, provides higher data efficiency, and reaches a higher performance ceiling.







