notebook enthusiast

5.7K posts

notebook enthusiast banner
notebook enthusiast

notebook enthusiast

@enthusednotebk

prove the existence of discomfort

six feet under Joined Kasım 2022
581 Following297 Followers
Bayesian
Bayesian@Bayesian0_0·
yeah! parsed 138-so-far benchmarks from the internet, then for each model string i use a bunch of cursed heuristics to match them to some known model, i store info about each benchmark and model like release date and benchmark score column name / model column name, then i can replicate a bunch of existing analyses that were made on fewer benchmarks, like Epoch's ECI or github.com/anadim/llm-ben…. or create / test new ones. it has given me plots that blow my mind like this one (92.6% of the variance in scores across 136 benchmarks & 6568 benchmark scores is accounted for by a single factor)
Bayesian tweet media
English
3
0
1
41
Bayesian
Bayesian@Bayesian0_0·
i'm trying to do data analysis on benchmark results across a hundred+ datasets. and they all use different formats for models, and don't specify some version info about some models (eg. say "gpt-4o" when there are ~7 different versions of gpt-4o). it is extremely cursed.
English
2
0
5
311
Shuvom Sadhuka
Shuvom Sadhuka@shuvom_s·
CS majors are drilled to think about "worst-case" performance of algorithms. By contrast, it seems much of the discourse on AI evals focuses on average-case or even best-case (e.g., this LLM can solve IMO problems). Maybe one key to "reliability" is certifying the 1st+99th quantile of outputs too, not just the mean/median. Model A may beat model B on average, but model A can still lose to model B if judged by the min. over several tasks. I wrote a brief blog post on this (good time to announce I started a substack!).
Shuvom Sadhuka tweet media
English
3
0
13
1.5K
notebook enthusiast retweeted
⋆☀︎。 ོ
⋆☀︎。 ོ@S0L4RFL4RE·
when you think so little of yourself it is hard to imagine you are capable of causing great harm bc u basically don’t even think u matter. so u commit cruel actions and never recognize them as such.
English
77
3.3K
29.2K
488.9K
notebook enthusiast retweeted
Epoch AI
Epoch AI@EpochAIResearch·
How much of the world's advanced chip packaging and high-bandwidth memory does AI consume? Almost all of it. We estimate the four largest AI chip designers consumed ~90% of global advanced packaging and HBM supply in 2025, suggesting these inputs were bottlenecks in 2025.
Epoch AI tweet media
English
4
28
143
20.7K
John Vining
John Vining@__vining·
Introducing a new, stupid website to find a piece of classical music whose duration most closely matches that of your next trip. busundreu.com
John Vining tweet media
English
59
1.6K
12.5K
454.6K
Luke Drago
Luke Drago@luke_drago_·
okay yeah we'll come out of stealth today
English
7
2
96
8.4K
Ben Calusinski
Ben Calusinski@BCalusinski·
met an absolutely cracked quant the other day and what i’ve noticed talking to people like this (genuinely brilliant, deeply analytical, almost frighteningly intelligent) is that they’re usually incredibly lonely inside because most people can’t be in a conversation with them their thoughts are too advanced and too technical a normal conversation just doesn’t have the bandwidth for what’s actually going on in their head so they’ve learned to compress it, hide it, or just stop sharing altogether but when you actually give them the space (genuine interest, real willingness to follow them wherever the thought goes) something shifts you can see the relief almost like they’ve been carrying this entire world inside them that’s never had anywhere to go and suddenly there’s somewhere for it to go these have honestly been some of the most profound conversations i’ve had with people not because i’m the smartest person in the room (i’m not), but because i can hold the space for it and follow the thread - fill in the pieces when they’re struggling to articulate something they’ve never had to put into words before the amount of stuff stuck inside people that never comes out simply because nobody around them has the capacity to receive it that’s the part that is so exciting for someone like me, it’s like a new discovery that’s contagious
English
35
31
622
67.5K
notebook enthusiast retweeted
Psyho
Psyho@FakePsyho·
Radar graphs are among the worst ideas in data visualization. The whole point of them is to show the area and you can usually reorder the labels freely in order to create a desired dramatic effect. Two versions of the same graph: - left one tells the story that AI is rapidly replacing whole industries - right one shows the "jaggedness" and reinforces the idea that humans will always have something that AI won't be able to replicate
Psyho tweet mediaPsyho tweet media
Andrew Curran@AndrewCurran_

Striking image from the new Anthropic labor market impact report.

English
220
886
10.8K
1.2M
Florian Brand
Florian Brand@xeophon·
Some personal news: - Finished another trip around the sun today 🫡 - Decided to join @PrimeIntellect to work on evals!! There’s a lot to be build and do couldn’t imagine a better team to do just that 🙌 - I will be in SF the next two weeks :) Just to look around, of course 👀
English
195
21
903
102.5K
notebook enthusiast retweeted
Epoch AI
Epoch AI@EpochAIResearch·
GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems! We had pre-release access to evaluate the model. On Tiers 1–3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%. See thread for commentary and additional experiments.
Epoch AI tweet media
English
30
110
901
120.6K
Tech Layoff Tracker
Tech Layoff Tracker@TechLayoffLover·
Just got this DM from a follower: Hey dude, I need to vent this to someone who gets it. I've been at this Big Tech company (you know the one) for almost 6 years now—senior SWE, TC around $350k last year with RSUs still vesting. Thought I was bulletproof after surviving the 2023-2024 bloodbaths and then pivoting hard into the AI org. But fuck, the ground is shifting under my feet faster than I can keep up. Last week in our all-hands, leadership was bragging about how the team's "AI leverage ratio" hit 4.2x—meaning each engineer is now shipping what used to take a team of four. They showed the metrics: feature velocity up 180% YoY while headcount's down another 22% since Q4 '25. The slide literally had a photo of Cursor + Claude Sonnet 4 workflows replacing entire squads. Everyone clapped like trained seals, but I saw three faces go pale—they're the mid-level folks who just finished documenting their entire codebase for the "knowledge distillation" project. My direct report, this solid L5 who joined right after me, got put on a 30-day PIP after his productivity dashboard dipped below the new AI-augmented benchmark. The benchmark? It's literally what the offshore team in India hits using the exact prompts he used to write. He trained them on our internal style guide last quarter—now they're outperforming him at $28/hour all-in. He told me privately he's burning through savings and eyeing real estate licensing because "at least houses don't get refactored by agents overnight." The internal job board is a ghost town. Entry-level SWE roles? Frozen since mid-'25. What few postings go up are tagged "AI-native preferred" and get 2,000+ apps in hours, mostly from people already on H-1Bs or contractors. Meanwhile, they're quietly converting more mid-tier positions to "AI orchestration" contractors—$90-110/hour remote from LATAM or Eastern Europe, no benefits, 6-month contracts. My manager admitted in 1:1 that if the next Grok/Claude/Anthropic release closes the last 10-15% quality gap, we'll probably cut another layer. I'm hanging on because I'm one of the ones who owns the prompt libraries and fine-tuning pipelines now. They need humans to babysit the models until the self-improving loops actually work without constant human intervention. But I see the writing: every time we make the system more autonomous, we make our own roles more optional. The alumni Slack is full of 2024-2025 grads DMing for coffee chats because their referrals bounce—67% underemployed or gigging according to the last poll. One kid I mentored last year is back living with parents after burning through his signing bonus. I used to tell people "just upskill in AI, you'll be fine." Now I feel like a fraud saying it. If I lost this tomorrow, I'd be competing with the same offshore talent I've been helping scale, plus a flood of recently "managed out" seniors. My emergency fund is decent, but the mortgage isn't. Thinking about side hustles in trades or something offline—plumbing, electrical, anything that can't be prompted away. This feels like watching the industry eat itself from the inside while pretending it's evolution. You still feeling secure over there, or is it hitting your shop too? Need to hear I'm not going insane.
Tech Layoff Tracker tweet mediaTech Layoff Tracker tweet mediaTech Layoff Tracker tweet mediaTech Layoff Tracker tweet media
English
138
431
2.8K
294.3K
prof-g
prof-g@prof_g·
after every quiz i get emails from students who are stressed out about the fact that they are below the mean, etc. i asked claude to write an analysis of how much to trust each quiz score based on stats. claude did such a good job! let's see if it helps...
prof-g tweet mediaprof-g tweet mediaprof-g tweet media
English
2
0
25
2.9K
notebook enthusiast
notebook enthusiast@enthusednotebk·
@whitfill_parker do you think algorithmic progress measured by nanogpt speedrun is representative of algorithmic progress as a whole?
English
1
0
5
738
Parker Whitfill
Parker Whitfill@whitfill_parker·
To measure algorithmic progress since 2019, I retrained GPT-2 using the modern nanogpt speedrun stack. Current nanogpt SOTA is 707x faster. We can decompose total speedup into > 15x faster FLOP per second (on fixed hardware) > 46x less FLOPs to reach the same val loss.
Parker Whitfill tweet media
English
8
28
250
31.9K
notebook enthusiast retweeted
Donald J. Trump
Donald J. Trump@realDonaldTrump·
Remember that I predicted a long time ago that President Obama will attack Iran because of his inability to negotiate properly-not skilled!
English
7.5K
75.8K
157.7K
0