Tom M

19

Garry Kasparov@Kasparov63·8h

Indeed. The mantra to "follow the science" breaks down on the left as soon as it conflicts with their dogma.

Tom Elliott@tomselliott

NYT's @NickKristof to fellow progressives: "A black kid in Mississippi is 2.5 times as likely to be proficient in math & reading by 4th grade as a black kid in Calif. Do we need to look a little bit less at what the Trump Admin is doing ... & look a little more in the mirror?"

English

20

70

662

35.8K

Tom M@TomM251385·4h

@JLARSONPDX @Sakshi26133806 @tomselliott @NickKristof NAEP does age based sampling, if you hold back the worst performers one year, they are too old to take the test when they enter the 4th grade the next year, and are excluded from the sample.

English

0

26

Jeff Larson@JLARSONPDX·7h

@TomM251385 @Sakshi26133806 @tomselliott @NickKristof Third grade until dropping out is the callous horseshit. So outcomes are better and the retention rate has dropped? What do you think you’re proving here? Are not California outcomes more than 5.5% worse?

English

0

30

Tom Elliott@tomselliott·10h

NYT's @NickKristof to fellow progressives: "A black kid in Mississippi is 2.5 times as likely to be proficient in math & reading by 4th grade as a black kid in Calif. Do we need to look a little bit less at what the Trump Admin is doing ... & look a little more in the mirror?"

English

262

2.1K

14.8K

938.9K

Tom M@TomM251385·6h

@nwbvt @tomselliott @NickKristof No it isn't apples to apples since the goal is educating students. If. Lot of students take an extra year or two to achieve the same target proficiency it does not mean you are doing a good job educating. The claim was implying Mississippi is doing better at educating.

English

0

1

59

Nick Brown@nwbvt·6h

@TomM251385 @tomselliott @NickKristof No, it's already apples to apples. They are all at the same grade level. Giving struggling kids an extra year of schooling is one of the positive things they are doing.

English

0

65

Tom M@TomM251385·6h

@nwbvt @tomselliott @NickKristof I'm not arguing that they shouldn't be held back, I'm pointing out that Mississippi has a 6.5-9% 3rd grade retention rate, California has a 1%, if you excluded the 5.5%-8% of the worst California 4th grade performers then it would apples to apples.

English

0

1

121

Nick Brown@nwbvt·7h

@TomM251385 @tomselliott @NickKristof Kids who haven't met learning milestones *should* be held back. Giving struggling kids an extra year of public education isn't "gaming the system", it's exactly how the system is supposed to used.

English

0

2

126

Tom M@TomM251385·7h

@JLARSONPDX @Sakshi26133806 @tomselliott @NickKristof They held back (retention rate) was 9% of 3rd grade students in 2019 for Mississippi, 6.5% for 2025. California has less than 1% held back in the third grade.

English

0

45

Jeff Larson@JLARSONPDX·7h

@TomM251385 @Sakshi26133806 @tomselliott @NickKristof Lol this is nonsense. It’s way too big of a sample size for that callous bullshit “analysis”

English

0

41

Tom M@TomM251385·7h

@loganrobinson @tomselliott @NickKristof The goal is presumably percentage of children who attain adult literacy and numeracy. If you hold back all poor performing kids in grade school till they drop out or you expel them you can really boost the test scores and high school graduation rate while being awful educators.

English

0

125

Logan Robinson (露雁呂敏尊)@loganrobinson·7h

@TomM251385 @tomselliott @NickKristof But shouldn't the purpose of the school be to graduate kids who actually learned the material? Which system is better: the one that "graduates" 18 year olds who can't read or the one who graduates 19 year olds who can? Mississippi is an American success story.

English

0

3

163

Tom M@TomM251385·7h

@Sakshi26133806 @tomselliott @NickKristof If you only test kids who can be guaranteed able to pass the test you can easily get nearly 100% pass rate it doesn't mean you are good at educating. Also retention is automatic so there is no guarantee they will ever take the test (they might be 3rd grade till dropping out.)

English

0

132

Sakshi@Sakshi26133806·7h

@TomM251385 @tomselliott @NickKristof So- they’re holding them back because they’re not good enough to make it through the next year- how is it bad? Or gaming the system? They take the test the next year?

English

0

4

173

Tom M@TomM251385·7h

@nardaz @tomselliott @NickKristof I'm saying that it is a meaningless comparison.

English

1

184

Lenny Thomas@nardaz·7h

@TomM251385 @tomselliott @NickKristof So it’s better to pass a kid who isn’t ready?

English

0

6

220

Tom M@TomM251385·1d

@PatrickC1995 The argument isn't about labor, but the network effects and infrastructure requirements and frameworks required for any business to succeed. Amazon could never have been created without huge tax expenditures on roads, a robust USPS, a massively subsidized car industry, etc.

English

0

1.3K

Patrick Carroll@PatrickC1995·1d

One of the biggest mental blocks of the left is their inability to think about value creation in any way other than manual labor. A good counter-example is the professional athlete, or pop-star. Who did Tayler Swift exploit to become a billionaire?

𐌁𐌉Ᏽ 𐌕𐌉𐌌𐌉@OrevaZSN

No one “earns” a billion dollars. No one can work a billion times harder than anyone else. There is no good, ethical, or righteous billionaire. That kind of wealth can only be accumulated through the exploitation of the working class. No exceptions.

English

295

335

6.2K

502.8K

Tom M@TomM251385·1d

@HeMuyu0327 Try also just providing the original position information.

English

1

143

Muyu He@HeMuyu0327·2d

I am now highly skeptical of the claim that adding the token embedding to deeper layers improves the model by "preserving the original token information", and think that the reason it improves at all is much simpler. How the hypothesis was made. It was proposed in the Value Residual Learning Paper based on the **fact** that if you add the first value vector v1 / token embedding x0 to deeper layers' value vector / residual stream with equal weight (0.5 * v1 + 0.5 * v), the model's validation loss improves significantly. And we later found that adding **any** linear transformation of x0 helps just as much. Ablation setup. If the model truly improves because deeper layers have access to the original x0 information, then this ablation should not change the model performance: killing the gradient of x0 when it is added to subsequent layer in this extra path. Since x0 (and the embedding layer) will receive regular updates via the standard computation path, x0 will always be able to supply token information to deeper layers, and deep layers' attention module can learn to use it properly. Therefore, for both adding x0 to later x and adding a linear transformation of x0 to later v, we run an ablation that detaches/kills x0's gradient during the forward pass. Experiment result. In both ablations, we find that most of the improvement is gone. Although the model has access to a perfectly valid x0 information in deeper layers and can update attn/MLP weights to utilize it, it never recovers most of the benefits we see in the baseline. This seems to suggest that "value residual learning" mostly (not all) works not because valuable x0 info is passed down, but because there is some benefit to **the embedding layer** by adding x0 to deeper layers. There might be two ways the embedding layer can be benefitted. One is just pure gradient benefits: that value residual learning is some advanced form of residual connection that handles vanishing gradients better. Need to do some math to see if this holds. The other is that the forward pass set up in this way updates the embedding space in a meaningful way, so tokens can have a more optimal representation. Next up will want to do ablations to test both hypotheses. And of course I might just have missed something simple.

English

9

12

146

9.2K

Tom M@TomM251385·2d

@AnthonyGSupreme They immediately began using it for explosives, grenades, incendiary arrows, etc.

English

David Shapiro (L/0)@DaveShapi

2

874

Anthony@AnthonyGSupreme·2d

This debunks White people's claims of "Everyone would do what we did if they were in our position"...First thing China did when they discovered Gunpower was create fireworks...first thing whites did was create arms to kill each other and the rest of the world lol

For context, China had the technology for gunpowder, printing, and blue ocean exploration hundreds of years before Europe. They failed to use it. That goes beyond a generational fumble. That is a self inflicted civilizational fumble that China still hasn't made up for.

English

93

3.5K

20.1K

290.8K

Tom M@TomM251385·3d

@0xdoug Lots of corporate devs use the API with fast path. That is about 10k-20k a month for a high productivity developer. (1.5-3x token usage and 6x cost per token). So just corporate devs on fast API 40B/(120k-240k) would only require about 200k devs or 5% of total US developers.

English

1

58

Doug Colkitt@0xdoug·3d

A lot of people objected to, were even outraged, that restricting the revenue analysis to American software workers. But let’s think about how much Anthropic revenue could be international. Excluding China and Russia (Claude not available), the US makes up about 20% of global devs. But one thing to keep in mind is, to get to $44 billion ARR, the only thing that really matters is devs paying $1k/month. Even Max subscribers don’t move the needle at that scale. Realistically, devs outside America aren’t expensive enough for this to make financial sense. Median software engineer salary in India (largest international dev market outside US and China) is $30k, is it reasonable believe that Indian firms are casually accepting a 50% cost hike on their software engineers to give them API scale token budgets? Again I believe many, if not most Indian devs are using Claude. But (like myself) my guess is the vast majority are using the much more cost effective subscription and managing limits. Same story in most other countries. Median software engineer salary in Europe is $60k, in Japan $50k, in Brazil $30k, in Britain $65k . I’m not saying there is zero API usage in these countries, but I think it’s unrealistic to think these countries have anywhere near the casual API spend that US software engineers have. I think a conservative upper bound on international API revenue is maybe 50%. If we go back to US devs being 20% of the global market, then assuming they have 2X Claude spend intensity seems reasonable. So at most, the original analysis has a 2X on the denominator. Doesn’t really fundamentally change anything in the order of magnitudes. To the extent software work is driving the Anthropic ARR, either near majorities of devs must be spending thousands per month, or there must be significant super users who are blowing through tens or even hundreds of thousand per month.

Doug Colkitt@0xdoug

I’m really struggling to see how the back of the envelope math on this works out… There are generously 4 million characterized “software workers” in America. That’s pretty broad and includes a lot of people who aren’t really classical engineers don’t produce that much code. That comes out to nearly $1k per month of average Claude spend across every dev in America. Yes, there’s some international usage, but it can’t be that much. Yes there is some non software Cowork usage, but that doesn’t use that many tokens. Yes, some non engineers are using Claude to vibe code, but I really doubt many are spending hundreds per month on. Even if we assume 50% of all software workers are using Claude, that comes out to $2k spend per month per Claude user. Thats 10X more than the highest tier Max subscription. So almost all of Anthropics revenue has to be API billing So the only explanation is that something like 20%+ of software engineers are not only Claude users but on API billing and regularly spending thousands per month. At $5/m Opus tokens that means the average API user has to be going through something like 25 million tokens per day. *OR* the other possibility is API revenue is heavily power law dominated. Maybe there’s just something like 100k super users who are making up the majority of the revenue. For that to work the typical super user would have to be spending on the order of $50k/month and guzzling nearly 1 billion tokens per day.

English

22

0

27

9.4K

Tom M@TomM251385·3d

@WellPaidGeek @0xdoug Claude fast is 6x more per token.

English

1

9

Well Paid Geek 🚀💻 JavaScript@WellPaidGeek·3d

@TomM251385 @0xdoug Mine is $1200 a month I don’t see how people are spending this much

English

Tannor Manson@Futurenvesting

0

22

Doug Colkitt@0xdoug·3d

I’m really struggling to see how the back of the envelope math on this works out… There are generously 4 million characterized “software workers” in America. That’s pretty broad and includes a lot of people who aren’t really classical engineers don’t produce that much code. That comes out to nearly $1k per month of average Claude spend across every dev in America. Yes, there’s some international usage, but it can’t be that much. Yes there is some non software Cowork usage, but that doesn’t use that many tokens. Yes, some non engineers are using Claude to vibe code, but I really doubt many are spending hundreds per month on. Even if we assume 50% of all software workers are using Claude, that comes out to $2k spend per month per Claude user. Thats 10X more than the highest tier Max subscription. So almost all of Anthropics revenue has to be API billing So the only explanation is that something like 20%+ of software engineers are not only Claude users but on API billing and regularly spending thousands per month. At $5/m Opus tokens that means the average API user has to be going through something like 25 million tokens per day. *OR* the other possibility is API revenue is heavily power law dominated. Maybe there’s just something like 100k super users who are making up the majority of the revenue. For that to work the typical super user would have to be spending on the order of $50k/month and guzzling nearly 1 billion tokens per day.

Anthropic is now showing off $44 BILLION in annual recurring revenue. This is up $14 billion (+46.6%) since last month! BULLISH for AI Infrastructure $NVDA $AMD

English

291

20

492

492.3K

Tom M@TomM251385·4d

@Dan_Jeffries1 The reason AI didn't eliminate radiologists isn't because the models can't be readily trained to replace them, but because radiologists have the political power to prevent it.

English

894

Daniel Jeffries@Dan_Jeffries1·4d

Jensen is one the smartest and most far seeing folks the world. "If an AI scientist warns people that AI is going to permeate across radiology and radiologists are going to get wiped out, it might seem helpful but it's hurtful. If we convince everybody not to be radiologists and we now need radiologists, that actually is hurtful to society. "It is hurtful to convince all the young college graduates not to study software engineering because we are going to need more software engineers than ever. That's hurtful." "Scaring people with nonsensical things, which are not going to happen, that this is an existential threat, there's a 20% chance that is is existential, that's ridiculous. "That it's going to wipe out 50% of college level jobs. "That is it going to completely destroy democracy. "These kinds of comments are not helpful. They are made by...CEOS. And you become a CEO, maybe you adopt a God complex and somehow you know everything." Brutal. And right.

English

249

821

5.3K

842K

Tom M@TomM251385·5d

@ThriceRewarded @AutonomousMann @MercuriusFilius 250 transfer the remainin 3/4 to 2,3,4; 333, transfer the remaining 2/3rds to 3,4; 500, transfer the remaining half to 4; 1000. -> 1000+500+333.33+250 = 2083.33

English

3

609

ThriceRewarded@ThriceRewarded·5d

@AutonomousMann @MercuriusFilius I don’t think that’s right. The best you can do is three transfers, each time after driving for half of what the other three cars have, abandoning them along the way. 500 + 250 + 125 + 1000 1875, right?

English

6

0

3

12.1K

Mercurius@MercuriusFilius·5d

How would you answer this common Goldman Sachs interview question?

English

442

11

369

998.1K

Tom M@TomM251385·6d

@warDaniel47 The correct answer is yes. Male mice are used for ectopic pregnancy models, there is no reason a blastocyst couldn't be implanted in a man.

English

0

1

411

War Correspondent@warDaniel47·6d

🚨 HOLY CRAP. This actually just happened on Capitol Hill. SEN. JOSH HAWLEY: "Can men get pregnant?" LIBERAL DR. VERMA: "I'm not sure what the goal of the question is." HAWLEY: "The goal is to establish a biological reality. Can men get pregnant?" VERMA: "I take care of people with many identities." HAWLEY: "Can men get pregnant?" VERMA: "Again, as I'm saying-" HAWLEY: "You said science and evidence should control. Can men get pregnant? You're a doctor, I think." VERMA: "Science and evidence should guide medicine." HAWLEY: "Do science and evidence tell us that men can get pregnant?" VERMA: "I think yes-no questions like this are a political tool." WOW. 🤯🤯🤯 @HawleyMO

English

2.6K

5.8K

27.9K

1.4M

Tom M@TomM251385·6d

@fosbix @TheAhmadOsman Turbo is polarized which drastically reduces quantization sensitivity. The author is talking about quantizing non polar kv.

English

0

28

fos@fosbix·29 Nis

@TheAhmadOsman This is bullshit, I have tested the Gemma 4 and Qwen 3.5/3.6 MoEs with K=turbo4 V=turbo2 and the performance at 128k/256k context is excellent

English

0

3

415

Ahmad@TheAhmadOsman·28 Nis

I keep seeing this advice to quantize the KVCache to 4-bit and save on memory Please don’t do that KV Cache quantization beyond FP8 usually is asking for a nerfed and incoherent model

English

34

11

267

23.3K

Tom M@TomM251385·28 Nis

@avidseries Has to do with cultural views on the importance of trying your hardest on exams (civil service exams as a primary method of advancement cultures like China and India). If you give extrinsic rewards then American students put in similar effort and get similar results.

English