notebook enthusiast

5.7K posts

notebook enthusiast

@enthusednotebk

prove the existence of discomfort

six feet under Joined Kasım 2022

581 Following297 Followers

notebook enthusiast@enthusednotebk·5h

@Bayesian0_0 @fleetingbits very cool, i assume you have seen BenchPress already right? i think those results directionally agree with yours :) x.com/DimitrisPapail…

Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2026…

English

Bayesian@Bayesian0_0·7h

yeah! parsed 138-so-far benchmarks from the internet, then for each model string i use a bunch of cursed heuristics to match them to some known model, i store info about each benchmark and model like release date and benchmark score column name / model column name, then i can replicate a bunch of existing analyses that were made on fewer benchmarks, like Epoch's ECI or github.com/anadim/llm-ben…. or create / test new ones. it has given me plots that blow my mind like this one (92.6% of the variance in scores across 136 benchmarks & 6568 benchmark scores is accounted for by a single factor)

English

Bayesian@Bayesian0_0·11h

i'm trying to do data analysis on benchmark results across a hundred+ datasets. and they all use different formats for models, and don't specify some version info about some models (eg. say "gpt-4o" when there are ~7 different versions of gpt-4o). it is extremely cursed.

English

311

notebook enthusiast@enthusednotebk·5h

@Bayesian0_0 been there,,,,,,,,

English

notebook enthusiast retweeted

Joel Becker@joel_bkr·10h

new paper begging to be written: "Many Benchmarks Would Saturate Much Earlier If You Let The AIs Use Adequate Token Budgets" x.com/joel_bkr/statu…

Joel Becker@joel_bkr

there's a paper begging to be written called "most SWE-bench PRs would not be merged into main"

English

2.3K

notebook enthusiast@enthusednotebk·7h

@shuvom_s nice post, makes a good point :) followed

English

Shuvom Sadhuka@shuvom_s·17h

CS majors are drilled to think about "worst-case" performance of algorithms. By contrast, it seems much of the discourse on AI evals focuses on average-case or even best-case (e.g., this LLM can solve IMO problems). Maybe one key to "reliability" is certifying the 1st+99th quantile of outputs too, not just the mean/median. Model A may beat model B on average, but model A can still lose to model B if judged by the min. over several tasks. I wrote a brief blog post on this (good time to announce I started a substack!).

English

1.5K

notebook enthusiast@enthusednotebk·10h

@heeney_luke also curious about this :O

English

465

Luke Heeney@heeney_luke·11h

Say you're a PhD applicant who believes this is the future. Where do you apply? (Pls don't just tell me to get into MIT econ)

rishi@RishiBommasani

I am really enjoying the near-daily stream of interesting papers on the economics of frontier AI. The field building is working and now we have interesting work coming from senior economists as well as junior econ researchers and PhDs. From here, I would like to see computer scientists involved. Collaborations across CS and econ are still very rare even as this area grows. Relative to what we have now, I think the econ work can be sharpened to have more acuity in its study of frontier AI technology. More CS folks at NBER convenings; more economists at NeurIPS and ICML.

English

21K

notebook enthusiast@enthusednotebk·22 Mar

chicago i am in you :D

English

notebook enthusiast retweeted

⋆☀︎｡ ོ@S0L4RFL4RE·13 Mar

when you think so little of yourself it is hard to imagine you are capable of causing great harm bc u basically don’t even think u matter. so u commit cruel actions and never recognize them as such.

English

3.3K

29.2K

488.9K

notebook enthusiast retweeted

Epoch AI@EpochAIResearch·12 Mar

How much of the world's advanced chip packaging and high-bandwidth memory does AI consume? Almost all of it. We estimate the four largest AI chip designers consumed ~90% of global advanced packaging and HBM supply in 2025, suggesting these inputs were bottlenecks in 2025.

English

143

20.7K

notebook enthusiast@enthusednotebk·12 Mar

@bfiafls and so it is!! and so it is...

English

notebook enthusiast@enthusednotebk·10 Mar

@bfiafls @__vining YESSSSS YES!! NEED

English

John Vining@__vining·10 Mar

Introducing a new, stupid website to find a piece of classical music whose duration most closely matches that of your next trip. busundreu.com

English

1.6K

12.5K

454.6K

notebook enthusiast@enthusednotebk·9 Mar

@luke_drago_ eyyyy :D

190

Luke Drago@luke_drago_·9 Mar

okay yeah we'll come out of stealth today

English

8.4K

notebook enthusiast@enthusednotebk·7 Mar

@BCalusinski @pangramlabs slop?

Nederlands

Ben Calusinski@BCalusinski·6 Mar

met an absolutely cracked quant the other day and what i’ve noticed talking to people like this (genuinely brilliant, deeply analytical, almost frighteningly intelligent) is that they’re usually incredibly lonely inside because most people can’t be in a conversation with them their thoughts are too advanced and too technical a normal conversation just doesn’t have the bandwidth for what’s actually going on in their head so they’ve learned to compress it, hide it, or just stop sharing altogether but when you actually give them the space (genuine interest, real willingness to follow them wherever the thought goes) something shifts you can see the relief almost like they’ve been carrying this entire world inside them that’s never had anywhere to go and suddenly there’s somewhere for it to go these have honestly been some of the most profound conversations i’ve had with people not because i’m the smartest person in the room (i’m not), but because i can hold the space for it and follow the thread - fill in the pieces when they’re struggling to articulate something they’ve never had to put into words before the amount of stuff stuck inside people that never comes out simply because nobody around them has the capacity to receive it that’s the part that is so exciting for someone like me, it’s like a new discovery that’s contagious

English

622

67.5K

notebook enthusiast retweeted

Psyho@FakePsyho·6 Mar

Radar graphs are among the worst ideas in data visualization. The whole point of them is to show the area and you can usually reorder the labels freely in order to create a desired dramatic effect. Two versions of the same graph: - left one tells the story that AI is rapidly replacing whole industries - right one shows the "jaggedness" and reinforces the idea that humans will always have something that AI won't be able to replicate

Andrew Curran@AndrewCurran_

Striking image from the new Anthropic labor market impact report.

English

220

886

10.8K

1.2M

notebook enthusiast@enthusednotebk·6 Mar

@xeophon @PrimeIntellect MY GOAT!!!

English

Florian Brand@xeophon·5 Mar

Some personal news: - Finished another trip around the sun today 🫡 - Decided to join @PrimeIntellect to work on evals!! There’s a lot to be build and do couldn’t imagine a better team to do just that 🙌 - I will be in SF the next two weeks :) Just to look around, of course 👀

English

195

903

102.5K

notebook enthusiast retweeted

Epoch AI@EpochAIResearch·5 Mar

GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems! We had pre-release access to evaluate the model. On Tiers 1–3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%. See thread for commentary and additional experiments.

English

110

901

120.6K

notebook enthusiast@enthusednotebk·5 Mar

@max_spero_ @TechLayoffLover @pangramlabs haha i was looking for this comment

English

200

Max Spero@max_spero_·5 Mar

@TechLayoffLover @pangramlabs slop?

Nederlands

2.8K

Tech Layoff Tracker@TechLayoffLover·4 Mar

Just got this DM from a follower: Hey dude, I need to vent this to someone who gets it. I've been at this Big Tech company (you know the one) for almost 6 years now—senior SWE, TC around $350k last year with RSUs still vesting. Thought I was bulletproof after surviving the 2023-2024 bloodbaths and then pivoting hard into the AI org. But fuck, the ground is shifting under my feet faster than I can keep up. Last week in our all-hands, leadership was bragging about how the team's "AI leverage ratio" hit 4.2x—meaning each engineer is now shipping what used to take a team of four. They showed the metrics: feature velocity up 180% YoY while headcount's down another 22% since Q4 '25. The slide literally had a photo of Cursor + Claude Sonnet 4 workflows replacing entire squads. Everyone clapped like trained seals, but I saw three faces go pale—they're the mid-level folks who just finished documenting their entire codebase for the "knowledge distillation" project. My direct report, this solid L5 who joined right after me, got put on a 30-day PIP after his productivity dashboard dipped below the new AI-augmented benchmark. The benchmark? It's literally what the offshore team in India hits using the exact prompts he used to write. He trained them on our internal style guide last quarter—now they're outperforming him at $28/hour all-in. He told me privately he's burning through savings and eyeing real estate licensing because "at least houses don't get refactored by agents overnight." The internal job board is a ghost town. Entry-level SWE roles? Frozen since mid-'25. What few postings go up are tagged "AI-native preferred" and get 2,000+ apps in hours, mostly from people already on H-1Bs or contractors. Meanwhile, they're quietly converting more mid-tier positions to "AI orchestration" contractors—$90-110/hour remote from LATAM or Eastern Europe, no benefits, 6-month contracts. My manager admitted in 1:1 that if the next Grok/Claude/Anthropic release closes the last 10-15% quality gap, we'll probably cut another layer. I'm hanging on because I'm one of the ones who owns the prompt libraries and fine-tuning pipelines now. They need humans to babysit the models until the self-improving loops actually work without constant human intervention. But I see the writing: every time we make the system more autonomous, we make our own roles more optional. The alumni Slack is full of 2024-2025 grads DMing for coffee chats because their referrals bounce—67% underemployed or gigging according to the last poll. One kid I mentored last year is back living with parents after burning through his signing bonus. I used to tell people "just upskill in AI, you'll be fine." Now I feel like a fraud saying it. If I lost this tomorrow, I'd be competing with the same offshore talent I've been helping scale, plus a flood of recently "managed out" seniors. My emergency fund is decent, but the mortgage isn't. Thinking about side hustles in trades or something offline—plumbing, electrical, anything that can't be prompted away. This feels like watching the industry eat itself from the inside while pretending it's evolution. You still feeling secure over there, or is it hitting your shop too? Need to hear I'm not going insane.

English

138

431

2.8K

294.3K

notebook enthusiast@enthusednotebk·3 Mar

@prof_g this is deeply kind of you

English

113

prof-g@prof_g·3 Mar

after every quiz i get emails from students who are stressed out about the fact that they are below the mean, etc. i asked claude to write an analysis of how much to trust each quiz score based on stats. claude did such a good job! let's see if it helps...

English

2.9K

notebook enthusiast@enthusednotebk·2 Mar

@whitfill_parker do you think algorithmic progress measured by nanogpt speedrun is representative of algorithmic progress as a whole?

English

738

Parker Whitfill@whitfill_parker·2 Mar

To measure algorithmic progress since 2019, I retrained GPT-2 using the modern nanogpt speedrun stack. Current nanogpt SOTA is 707x faster. We can decompose total speedup into > 15x faster FLOP per second (on fixed hardware) > 46x less FLOPs to reach the same val loss.