notebook enthusiast

5.7K posts

notebook enthusiast

@enthusednotebk

prove the existence of discomfort

six feet under Se unió Kasım 2022

581 Siguiendo296 Seguidores

notebook enthusiast@enthusednotebk·33m

@shuvom_s nice post, makes a good point :) followed

English

Shuvom Sadhuka@shuvom_s·10h

CS majors are drilled to think about "worst-case" performance of algorithms. By contrast, it seems much of the discourse on AI evals focuses on average-case or even best-case (e.g., this LLM can solve IMO problems). Maybe one key to "reliability" is certifying the 1st+99th quantile of outputs too, not just the mean/median. Model A may beat model B on average, but model A can still lose to model B if judged by the min. over several tasks. I wrote a brief blog post on this (good time to announce I started a substack!).

English

650

notebook enthusiast@enthusednotebk·2h

@heeney_luke also curious about this :O

English

315

Luke Heeney@heeney_luke·4h

Say you're a PhD applicant who believes this is the future. Where do you apply? (Pls don't just tell me to get into MIT econ)

rishi@RishiBommasani

I am really enjoying the near-daily stream of interesting papers on the economics of frontier AI. The field building is working and now we have interesting work coming from senior economists as well as junior econ researchers and PhDs. From here, I would like to see computer scientists involved. Collaborations across CS and econ are still very rare even as this area grows. Relative to what we have now, I think the econ work can be sharpened to have more acuity in its study of frontier AI technology. More CS folks at NBER convenings; more economists at NeurIPS and ICML.

English

11.7K

notebook enthusiast@enthusednotebk·22 Mar

chicago i am in you :D

English

notebook enthusiast retuiteado

⋆☀︎｡ ོ@S0L4RFL4RE·13 Mar

when you think so little of yourself it is hard to imagine you are capable of causing great harm bc u basically don’t even think u matter. so u commit cruel actions and never recognize them as such.

English

3.3K

29.2K

488.8K

notebook enthusiast retuiteado

Epoch AI@EpochAIResearch·12 Mar

How much of the world's advanced chip packaging and high-bandwidth memory does AI consume? Almost all of it. We estimate the four largest AI chip designers consumed ~90% of global advanced packaging and HBM supply in 2025, suggesting these inputs were bottlenecks in 2025.

English

143

20.7K

notebook enthusiast@enthusednotebk·12 Mar

@bfiafls and so it is!! and so it is...

English

notebook enthusiast@enthusednotebk·10 Mar

@bfiafls @__vining YESSSSS YES!! NEED

English

John Vining@__vining·10 Mar

Introducing a new, stupid website to find a piece of classical music whose duration most closely matches that of your next trip. busundreu.com

English

1.6K

12.5K

454.6K

notebook enthusiast@enthusednotebk·9 Mar

@luke_drago_ eyyyy :D

190

Luke Drago@luke_drago_·9 Mar

okay yeah we'll come out of stealth today

English

8.4K

notebook enthusiast@enthusednotebk·7 Mar

@BCalusinski @pangramlabs slop?

Nederlands

Ben Calusinski@BCalusinski·6 Mar

met an absolutely cracked quant the other day and what i’ve noticed talking to people like this (genuinely brilliant, deeply analytical, almost frighteningly intelligent) is that they’re usually incredibly lonely inside because most people can’t be in a conversation with them their thoughts are too advanced and too technical a normal conversation just doesn’t have the bandwidth for what’s actually going on in their head so they’ve learned to compress it, hide it, or just stop sharing altogether but when you actually give them the space (genuine interest, real willingness to follow them wherever the thought goes) something shifts you can see the relief almost like they’ve been carrying this entire world inside them that’s never had anywhere to go and suddenly there’s somewhere for it to go these have honestly been some of the most profound conversations i’ve had with people not because i’m the smartest person in the room (i’m not), but because i can hold the space for it and follow the thread - fill in the pieces when they’re struggling to articulate something they’ve never had to put into words before the amount of stuff stuck inside people that never comes out simply because nobody around them has the capacity to receive it that’s the part that is so exciting for someone like me, it’s like a new discovery that’s contagious

English

622

67.5K

notebook enthusiast retuiteado

Psyho@FakePsyho·6 Mar

Radar graphs are among the worst ideas in data visualization. The whole point of them is to show the area and you can usually reorder the labels freely in order to create a desired dramatic effect. Two versions of the same graph: - left one tells the story that AI is rapidly replacing whole industries - right one shows the "jaggedness" and reinforces the idea that humans will always have something that AI won't be able to replicate

Andrew Curran@AndrewCurran_

Striking image from the new Anthropic labor market impact report.

English

220

885

10.8K

1.2M

notebook enthusiast@enthusednotebk·6 Mar

@xeophon @PrimeIntellect MY GOAT!!!

English

Xeophon@xeophon·5 Mar

Some personal news: - Finished another trip around the sun today 🫡 - Decided to join @PrimeIntellect to work on evals!! There’s a lot to be build and do couldn’t imagine a better team to do just that 🙌 - I will be in SF the next two weeks :) Just to look around, of course 👀

English

195

903

102.5K

notebook enthusiast retuiteado

Epoch AI@EpochAIResearch·5 Mar

GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems! We had pre-release access to evaluate the model. On Tiers 1–3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%. See thread for commentary and additional experiments.

English

110

901

120.6K

notebook enthusiast@enthusednotebk·5 Mar

@max_spero_ @TechLayoffLover @pangramlabs haha i was looking for this comment

English

200

Max Spero@max_spero_·5 Mar

@TechLayoffLover @pangramlabs slop?

Nederlands

2.8K

Tech Layoff Tracker@TechLayoffLover·4 Mar

Just got this DM from a follower: Hey dude, I need to vent this to someone who gets it. I've been at this Big Tech company (you know the one) for almost 6 years now—senior SWE, TC around $350k last year with RSUs still vesting. Thought I was bulletproof after surviving the 2023-2024 bloodbaths and then pivoting hard into the AI org. But fuck, the ground is shifting under my feet faster than I can keep up. Last week in our all-hands, leadership was bragging about how the team's "AI leverage ratio" hit 4.2x—meaning each engineer is now shipping what used to take a team of four. They showed the metrics: feature velocity up 180% YoY while headcount's down another 22% since Q4 '25. The slide literally had a photo of Cursor + Claude Sonnet 4 workflows replacing entire squads. Everyone clapped like trained seals, but I saw three faces go pale—they're the mid-level folks who just finished documenting their entire codebase for the "knowledge distillation" project. My direct report, this solid L5 who joined right after me, got put on a 30-day PIP after his productivity dashboard dipped below the new AI-augmented benchmark. The benchmark? It's literally what the offshore team in India hits using the exact prompts he used to write. He trained them on our internal style guide last quarter—now they're outperforming him at $28/hour all-in. He told me privately he's burning through savings and eyeing real estate licensing because "at least houses don't get refactored by agents overnight." The internal job board is a ghost town. Entry-level SWE roles? Frozen since mid-'25. What few postings go up are tagged "AI-native preferred" and get 2,000+ apps in hours, mostly from people already on H-1Bs or contractors. Meanwhile, they're quietly converting more mid-tier positions to "AI orchestration" contractors—$90-110/hour remote from LATAM or Eastern Europe, no benefits, 6-month contracts. My manager admitted in 1:1 that if the next Grok/Claude/Anthropic release closes the last 10-15% quality gap, we'll probably cut another layer. I'm hanging on because I'm one of the ones who owns the prompt libraries and fine-tuning pipelines now. They need humans to babysit the models until the self-improving loops actually work without constant human intervention. But I see the writing: every time we make the system more autonomous, we make our own roles more optional. The alumni Slack is full of 2024-2025 grads DMing for coffee chats because their referrals bounce—67% underemployed or gigging according to the last poll. One kid I mentored last year is back living with parents after burning through his signing bonus. I used to tell people "just upskill in AI, you'll be fine." Now I feel like a fraud saying it. If I lost this tomorrow, I'd be competing with the same offshore talent I've been helping scale, plus a flood of recently "managed out" seniors. My emergency fund is decent, but the mortgage isn't. Thinking about side hustles in trades or something offline—plumbing, electrical, anything that can't be prompted away. This feels like watching the industry eat itself from the inside while pretending it's evolution. You still feeling secure over there, or is it hitting your shop too? Need to hear I'm not going insane.

English

138

432

2.8K

294.2K

notebook enthusiast@enthusednotebk·3 Mar

@prof_g this is deeply kind of you

English

113

prof-g@prof_g·3 Mar

after every quiz i get emails from students who are stressed out about the fact that they are below the mean, etc. i asked claude to write an analysis of how much to trust each quiz score based on stats. claude did such a good job! let's see if it helps...

English

2.9K

notebook enthusiast@enthusednotebk·2 Mar

@whitfill_parker do you think algorithmic progress measured by nanogpt speedrun is representative of algorithmic progress as a whole?

English

738

Parker Whitfill@whitfill_parker·2 Mar

To measure algorithmic progress since 2019, I retrained GPT-2 using the modern nanogpt speedrun stack. Current nanogpt SOTA is 707x faster. We can decompose total speedup into > 15x faster FLOP per second (on fixed hardware) > 46x less FLOPs to reach the same val loss.

English

250

31.9K

notebook enthusiast retuiteado

Donald J. Trump@realDonaldTrump·11 Kas

Remember that I predicted a long time ago that President Obama will attack Iran because of his inability to negotiate properly-not skilled!

English

7.5K

75.8K

157.7K

notebook enthusiast@enthusednotebk·26 Şub

@holdmycovfefe12 this tweet was authored by an octopus

English

138

notebook enthusiast retuiteado

Standard Intelligence@si_pbc·23 Şub

Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.

GIF

English

187

400

3.9K

1.1M

notebook enthusiast retuiteado

Celeste (in london dm to hang)@celestepoasts·22 Şub

consider the following:

Celeste (in london dm to hang) tweet media

English

110

2.6K

195.8K

notebook enthusiast@enthusednotebk·21 Şub

@scaling01 dunno if you've seen this, but might be useful to you - they're saying the ceiling is 95% at least x.com/jyangballin/st…

John Yang@jyangballin

Across all mini-SWE-agent + <model> runs, SWE-bench Verified's current "ceiling"? - 87.4% (0.874 - 0.8) * 500 = another *37* instances that aren't solved consistently. If you recalculate this number across all official SWE-bench Verified submissions? - 95% from SWE-bench site

English

355

Lisan al Gaib@scaling01·21 Şub

What's the upper bound on SWE-Bench-Verified? I yoinked a bunch of EpochAI data and analyzed it. However, I only got to download 21 files, now it seems im permanently blocked from downloading anything lol 44/500 = 8.8% of SWE-Bench-Verified questions are either unsolvable or still too hard for even the best models like Opus 4.6 Thinking With that we can say that the upper-bound is at least 91.2% on SWE-Bench Verified. The hardest subset of question seems to be pylint-dev and astropy. There were 3 questions that were only solvable by Gemini 3.0 Pro, 2 questions only solvable by GPT-5.1 and 2 questions only solvable by Kimi-K2.5 You will see the GPT-5.1 rows and columns are very bright, meaning there is almost no overlap in failures between GPT-5.1 and the other models, which is why they reran it. Its pass rate is still pretty high at 65.9% but I guess it failed a lot of the easy tasks which is why the rows and columns are bright. The two tasks that were only solvable by GPT-5.1 might be buggy.

English

132

14.2K

Descubrir

@shuvom_s @heeney_luke @bfiafls @__vining @luke_drago_ @BCalusinski @pangramlabs @xeophon