Tom M

185 posts

Tom M

Tom M

@TomM251385

Katılım Kasım 2025
38 Takip Edilen20 Takipçiler
Tom M
Tom M@TomM251385·
@Kasparov63 They do it by 5-9% of Mississippi students repeating the 3rd grade vs 1% of California students. So it isn't a meaningful comparison. Compare the bottom 8% of students in the 5th grade California students to the bottom 8% of 4th grade students in Mississippi.
English
0
0
0
19
Garry Kasparov
Garry Kasparov@Kasparov63·
Indeed. The mantra to "follow the science" breaks down on the left as soon as it conflicts with their dogma.
Tom Elliott@tomselliott

NYT's @NickKristof to fellow progressives: "A black kid in Mississippi is 2.5 times as likely to be proficient in math & reading by 4th grade as a black kid in Calif. Do we need to look a little bit less at what the Trump Admin is doing ... & look a little more in the mirror?"

English
20
70
662
35.8K
Tom M
Tom M@TomM251385·
@JLARSONPDX @Sakshi26133806 @tomselliott @NickKristof NAEP does age based sampling, if you hold back the worst performers one year, they are too old to take the test when they enter the 4th grade the next year, and are excluded from the sample.
English
1
0
0
26
Tom Elliott
Tom Elliott@tomselliott·
NYT's @NickKristof to fellow progressives: "A black kid in Mississippi is 2.5 times as likely to be proficient in math & reading by 4th grade as a black kid in Calif. Do we need to look a little bit less at what the Trump Admin is doing ... & look a little more in the mirror?"
English
262
2.1K
14.8K
938.9K
Tom M
Tom M@TomM251385·
@nwbvt @tomselliott @NickKristof No it isn't apples to apples since the goal is educating students. If. Lot of students take an extra year or two to achieve the same target proficiency it does not mean you are doing a good job educating. The claim was implying Mississippi is doing better at educating.
English
2
0
1
59
Nick Brown
Nick Brown@nwbvt·
@TomM251385 @tomselliott @NickKristof No, it's already apples to apples. They are all at the same grade level. Giving struggling kids an extra year of schooling is one of the positive things they are doing.
English
1
0
0
65
Tom M
Tom M@TomM251385·
@nwbvt @tomselliott @NickKristof I'm not arguing that they shouldn't be held back, I'm pointing out that Mississippi has a 6.5-9% 3rd grade retention rate, California has a 1%, if you excluded the 5.5%-8% of the worst California 4th grade performers then it would apples to apples.
English
2
0
1
121
Nick Brown
Nick Brown@nwbvt·
@TomM251385 @tomselliott @NickKristof Kids who haven't met learning milestones *should* be held back. Giving struggling kids an extra year of public education isn't "gaming the system", it's exactly how the system is supposed to used.
English
1
0
2
126
Tom M
Tom M@TomM251385·
@loganrobinson @tomselliott @NickKristof The goal is presumably percentage of children who attain adult literacy and numeracy. If you hold back all poor performing kids in grade school till they drop out or you expel them you can really boost the test scores and high school graduation rate while being awful educators.
English
1
0
0
125
Logan Robinson (露雁 呂敏尊)
@TomM251385 @tomselliott @NickKristof But shouldn't the purpose of the school be to graduate kids who actually learned the material? Which system is better: the one that "graduates" 18 year olds who can't read or the one who graduates 19 year olds who can? Mississippi is an American success story.
English
1
0
3
163
Tom M
Tom M@TomM251385·
@Sakshi26133806 @tomselliott @NickKristof If you only test kids who can be guaranteed able to pass the test you can easily get nearly 100% pass rate it doesn't mean you are good at educating. Also retention is automatic so there is no guarantee they will ever take the test (they might be 3rd grade till dropping out.)
English
2
0
0
132
Sakshi
Sakshi@Sakshi26133806·
@TomM251385 @tomselliott @NickKristof So- they’re holding them back because they’re not good enough to make it through the next year- how is it bad? Or gaming the system? They take the test the next year?
English
1
0
4
173
Tom M
Tom M@TomM251385·
@PatrickC1995 The argument isn't about labor, but the network effects and infrastructure requirements and frameworks required for any business to succeed. Amazon could never have been created without huge tax expenditures on roads, a robust USPS, a massively subsidized car industry, etc.
English
2
0
0
1.3K
Patrick Carroll
Patrick Carroll@PatrickC1995·
One of the biggest mental blocks of the left is their inability to think about value creation in any way other than manual labor. A good counter-example is the professional athlete, or pop-star. Who did Tayler Swift exploit to become a billionaire?
𐌁𐌉Ᏽ 𐌕𐌉𐌌𐌉@OrevaZSN

No one “earns” a billion dollars. No one can work a billion times harder than anyone else. There is no good, ethical, or righteous billionaire. That kind of wealth can only be accumulated through the exploitation of the working class. No exceptions.

English
295
335
6.2K
502.8K
Tom M
Tom M@TomM251385·
@HeMuyu0327 Try also just providing the original position information.
English
0
0
1
143
Muyu He
Muyu He@HeMuyu0327·
I am now highly skeptical of the claim that adding the token embedding to deeper layers improves the model by "preserving the original token information", and think that the reason it improves at all is much simpler. How the hypothesis was made. It was proposed in the Value Residual Learning Paper based on the **fact** that if you add the first value vector v1 / token embedding x0 to deeper layers' value vector / residual stream with equal weight (0.5 * v1 + 0.5 * v), the model's validation loss improves significantly. And we later found that adding **any** linear transformation of x0 helps just as much. Ablation setup. If the model truly improves because deeper layers have access to the original x0 information, then this ablation should not change the model performance: killing the gradient of x0 when it is added to subsequent layer in this extra path. Since x0 (and the embedding layer) will receive regular updates via the standard computation path, x0 will always be able to supply token information to deeper layers, and deep layers' attention module can learn to use it properly. Therefore, for both adding x0 to later x and adding a linear transformation of x0 to later v, we run an ablation that detaches/kills x0's gradient during the forward pass. Experiment result. In both ablations, we find that most of the improvement is gone. Although the model has access to a perfectly valid x0 information in deeper layers and can update attn/MLP weights to utilize it, it never recovers most of the benefits we see in the baseline. This seems to suggest that "value residual learning" mostly (not all) works not because valuable x0 info is passed down, but because there is some benefit to **the embedding layer** by adding x0 to deeper layers. There might be two ways the embedding layer can be benefitted. One is just pure gradient benefits: that value residual learning is some advanced form of residual connection that handles vanishing gradients better. Need to do some math to see if this holds. The other is that the forward pass set up in this way updates the embedding space in a meaningful way, so tokens can have a more optimal representation. Next up will want to do ablations to test both hypotheses. And of course I might just have missed something simple.
Muyu He tweet mediaMuyu He tweet media
English
9
12
146
9.2K
Tom M
Tom M@TomM251385·
@AnthonyGSupreme They immediately began using it for explosives, grenades, incendiary arrows, etc.
English
0
0
2
874
Anthony
Anthony@AnthonyGSupreme·
This debunks White people's claims of "Everyone would do what we did if they were in our position"...First thing China did when they discovered Gunpower was create fireworks...first thing whites did was create arms to kill each other and the rest of the world lol
David Shapiro (L/0)@DaveShapi

For context, China had the technology for gunpowder, printing, and blue ocean exploration hundreds of years before Europe. They failed to use it. That goes beyond a generational fumble. That is a self inflicted civilizational fumble that China still hasn't made up for.

English
93
3.5K
20.1K
290.8K
Tom M
Tom M@TomM251385·
@0xdoug Lots of corporate devs use the API with fast path. That is about 10k-20k a month for a high productivity developer. (1.5-3x token usage and 6x cost per token). So just corporate devs on fast API 40B/(120k-240k) would only require about 200k devs or 5% of total US developers.
English
0
0
1
58
Doug Colkitt
Doug Colkitt@0xdoug·
A lot of people objected to, were even outraged, that restricting the revenue analysis to American software workers. But let’s think about how much Anthropic revenue could be international. Excluding China and Russia (Claude not available), the US makes up about 20% of global devs. But one thing to keep in mind is, to get to $44 billion ARR, the only thing that really matters is devs paying $1k/month. Even Max subscribers don’t move the needle at that scale. Realistically, devs outside America aren’t expensive enough for this to make financial sense. Median software engineer salary in India (largest international dev market outside US and China) is $30k, is it reasonable believe that Indian firms are casually accepting a 50% cost hike on their software engineers to give them API scale token budgets? Again I believe many, if not most Indian devs are using Claude. But (like myself) my guess is the vast majority are using the much more cost effective subscription and managing limits. Same story in most other countries. Median software engineer salary in Europe is $60k, in Japan $50k, in Brazil $30k, in Britain $65k . I’m not saying there is zero API usage in these countries, but I think it’s unrealistic to think these countries have anywhere near the casual API spend that US software engineers have. I think a conservative upper bound on international API revenue is maybe 50%. If we go back to US devs being 20% of the global market, then assuming they have 2X Claude spend intensity seems reasonable. So at most, the original analysis has a 2X on the denominator. Doesn’t really fundamentally change anything in the order of magnitudes. To the extent software work is driving the Anthropic ARR, either near majorities of devs must be spending thousands per month, or there must be significant super users who are blowing through tens or even hundreds of thousand per month.
Doug Colkitt tweet media
Doug Colkitt@0xdoug

I’m really struggling to see how the back of the envelope math on this works out… There are generously 4 million characterized “software workers” in America. That’s pretty broad and includes a lot of people who aren’t really classical engineers don’t produce that much code. That comes out to nearly $1k per month of average Claude spend across every dev in America. Yes, there’s some international usage, but it can’t be that much. Yes there is some non software Cowork usage, but that doesn’t use that many tokens. Yes, some non engineers are using Claude to vibe code, but I really doubt many are spending hundreds per month on. Even if we assume 50% of all software workers are using Claude, that comes out to $2k spend per month per Claude user. Thats 10X more than the highest tier Max subscription. So almost all of Anthropics revenue has to be API billing So the only explanation is that something like 20%+ of software engineers are not only Claude users but on API billing and regularly spending thousands per month. At $5/m Opus tokens that means the average API user has to be going through something like 25 million tokens per day. *OR* the other possibility is API revenue is heavily power law dominated. Maybe there’s just something like 100k super users who are making up the majority of the revenue. For that to work the typical super user would have to be spending on the order of $50k/month and guzzling nearly 1 billion tokens per day.

English
22
0
27
9.4K
Doug Colkitt
Doug Colkitt@0xdoug·
I’m really struggling to see how the back of the envelope math on this works out… There are generously 4 million characterized “software workers” in America. That’s pretty broad and includes a lot of people who aren’t really classical engineers don’t produce that much code. That comes out to nearly $1k per month of average Claude spend across every dev in America. Yes, there’s some international usage, but it can’t be that much. Yes there is some non software Cowork usage, but that doesn’t use that many tokens. Yes, some non engineers are using Claude to vibe code, but I really doubt many are spending hundreds per month on. Even if we assume 50% of all software workers are using Claude, that comes out to $2k spend per month per Claude user. Thats 10X more than the highest tier Max subscription. So almost all of Anthropics revenue has to be API billing So the only explanation is that something like 20%+ of software engineers are not only Claude users but on API billing and regularly spending thousands per month. At $5/m Opus tokens that means the average API user has to be going through something like 25 million tokens per day. *OR* the other possibility is API revenue is heavily power law dominated. Maybe there’s just something like 100k super users who are making up the majority of the revenue. For that to work the typical super user would have to be spending on the order of $50k/month and guzzling nearly 1 billion tokens per day.
Tannor Manson@Futurenvesting

Anthropic is now showing off $44 BILLION in annual recurring revenue. This is up $14 billion (+46.6%) since last month! BULLISH for AI Infrastructure $NVDA $AMD

English
291
20
492
492.3K
Tom M
Tom M@TomM251385·
@Dan_Jeffries1 The reason AI didn't eliminate radiologists isn't because the models can't be readily trained to replace them, but because radiologists have the political power to prevent it.
English
0
0
0
894
Daniel Jeffries
Daniel Jeffries@Dan_Jeffries1·
Jensen is one the smartest and most far seeing folks the world. "If an AI scientist warns people that AI is going to permeate across radiology and radiologists are going to get wiped out, it might seem helpful but it's hurtful. If we convince everybody not to be radiologists and we now need radiologists, that actually is hurtful to society. "It is hurtful to convince all the young college graduates not to study software engineering because we are going to need more software engineers than ever. That's hurtful." "Scaring people with nonsensical things, which are not going to happen, that this is an existential threat, there's a 20% chance that is is existential, that's ridiculous. "That it's going to wipe out 50% of college level jobs. "That is it going to completely destroy democracy. "These kinds of comments are not helpful. They are made by...CEOS. And you become a CEO, maybe you adopt a God complex and somehow you know everything." Brutal. And right.
English
249
821
5.3K
842K
ThriceRewarded
ThriceRewarded@ThriceRewarded·
@AutonomousMann @MercuriusFilius I don’t think that’s right. The best you can do is three transfers, each time after driving for half of what the other three cars have, abandoning them along the way. 500 + 250 + 125 + 1000 1875, right?
English
6
0
3
12.1K
Mercurius
Mercurius@MercuriusFilius·
How would you answer this common Goldman Sachs interview question?
Mercurius tweet media
English
442
11
369
998.1K
Tom M
Tom M@TomM251385·
@warDaniel47 The correct answer is yes. Male mice are used for ectopic pregnancy models, there is no reason a blastocyst couldn't be implanted in a man.
English
1
0
1
411
War Correspondent
War Correspondent@warDaniel47·
🚨 HOLY CRAP. This actually just happened on Capitol Hill. SEN. JOSH HAWLEY: "Can men get pregnant?" LIBERAL DR. VERMA: "I'm not sure what the goal of the question is." HAWLEY: "The goal is to establish a biological reality. Can men get pregnant?" VERMA: "I take care of people with many identities." HAWLEY: "Can men get pregnant?" VERMA: "Again, as I'm saying-" HAWLEY: "You said science and evidence should control. Can men get pregnant? You're a doctor, I think." VERMA: "Science and evidence should guide medicine." HAWLEY: "Do science and evidence tell us that men can get pregnant?" VERMA: "I think yes-no questions like this are a political tool." WOW. 🤯🤯🤯 @HawleyMO
English
2.6K
5.8K
27.9K
1.4M
Tom M
Tom M@TomM251385·
@fosbix @TheAhmadOsman Turbo is polarized which drastically reduces quantization sensitivity. The author is talking about quantizing non polar kv.
English
1
0
0
28
fos
fos@fosbix·
@TheAhmadOsman This is bullshit, I have tested the Gemma 4 and Qwen 3.5/3.6 MoEs with K=turbo4 V=turbo2 and the performance at 128k/256k context is excellent
English
1
0
3
415
Ahmad
Ahmad@TheAhmadOsman·
I keep seeing this advice to quantize the KVCache to 4-bit and save on memory Please don’t do that KV Cache quantization beyond FP8 usually is asking for a nerfed and incoherent model
English
34
11
267
23.3K
Tom M
Tom M@TomM251385·
@avidseries Has to do with cultural views on the importance of trying your hardest on exams (civil service exams as a primary method of advancement cultures like China and India). If you give extrinsic rewards then American students put in similar effort and get similar results.
English
0
0
1
1.7K
i/o
i/o@avidseries·
Is low-income recent-immigrant Chinese kids still in the process of learning English easily outscoring middle-income black kids who grew up speaking it — is that an example of diversity being our greatest strength or an example of something we're not supposed to talk about?
English
76
90
1.9K
52.4K