Sir Digby Chicken Caesar

15.4K posts

Sir Digby Chicken Caesar

Sir Digby Chicken Caesar

@timpac

Principled Conservative Michigan outdoorsy-dad.

Entrou em Mayıs 2007
1.8K Seguindo445 Seguidores
Sir Digby Chicken Caesar
@AlexBerenson This particular problem apparently has been addressed. Nevertheless, this *type* of failure is one I would expect from AI basically forever, because it doesn't really think.
Sir Digby Chicken Caesar tweet media
English
0
0
0
11
Alex Berenson
Alex Berenson@AlexBerenson·
I’m starting to feel like the limitations and the strengths of AI are two sides of the same coin; AI is a great mimic and pattern recognizer, so it has no problem validating the person using it (and coding, which is a highly structured task). But the unexpected breaks it easily.
Nav Toor@heynavtoor

🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves. And the way they proved it is devastating. Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems — and made one change. They swapped the numbers. Same problem. Same logic. Same steps. Different numbers. Every model's performance dropped. Every single one. 25 state-of-the-art models tested. But that wasn't the real experiment. The real experiment broke everything. They added one sentence to a math problem. One sentence that is completely irrelevant to the answer. It has nothing to do with the math. A human would read it and ignore it instantly. Here's the actual example from the paper: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" The correct answer is 190. The size of the kiwis has nothing to do with the count. A 10-year-old would ignore "five of them were a bit smaller" because it's obviously irrelevant. It doesn't change how many kiwis there are. But o1-mini, OpenAI's reasoning model, subtracted 5. It got 185. Llama did the same thing. Subtracted 5. Got 185. They didn't reason through the problem. They saw the number 5, saw a sentence that sounded like it mattered, and blindly turned it into a subtraction. The models do not understand what subtraction means. They see a pattern that looks like subtraction and apply it. That is all. Apple tested this across all models. They call the dataset "GSM-NoOp" — as in, the added clause is a no-operation. It does nothing. It changes nothing. The results are catastrophic. Phi-3-mini dropped over 65%. More than half of its "math ability" vanished from one irrelevant sentence. GPT-4o dropped from 94.9% to 63.1%. o1-mini dropped from 94.5% to 66.0%. o1-preview, OpenAI's most advanced reasoning model at the time, dropped from 92.7% to 77.4%. Even giving the models 8 examples of the exact same question beforehand, with the correct solution shown each time, barely helped. The models still fell for the irrelevant clause. This means it's not a prompting problem. It's not a context problem. It's structural. The Apple researchers also found that models convert words into math operations without understanding what those words mean. They see the word "discount" and multiply. They see a number near the word "smaller" and subtract. Regardless of whether it makes any sense. The paper's exact words: "current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data." And: "LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts." They also tested what happens when you increase the number of steps in a problem. Performance didn't just decrease. The rate of decrease accelerated. Adding two extra clauses to a problem dropped Gemma2-9b from 84.4% to 41.8%. Phi-3.5-mini from 87.6% to 44.8%. The more thinking required, the more the models collapse. A real reasoner would slow down and work through it. These models don't slow down. They pattern-match. And when the pattern becomes complex enough, they crash. This paper was published at ICLR 2025, one of the most prestigious AI conferences in the world. You are using AI to help you make financial decisions. To check legal documents. To solve problems at work. To help your children with homework. And Apple just proved that the AI is not thinking about any of it. It is pattern matching. And the moment something unexpected shows up in your question, it breaks. It does not tell you it broke. It just quietly gives you the wrong answer with full confidence.

English
30
30
194
66.5K
Lei Gong
Lei Gong@gonglei89·
See, not actual intelligence.
Nav Toor@heynavtoor

🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves. And the way they proved it is devastating. Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems — and made one change. They swapped the numbers. Same problem. Same logic. Same steps. Different numbers. Every model's performance dropped. Every single one. 25 state-of-the-art models tested. But that wasn't the real experiment. The real experiment broke everything. They added one sentence to a math problem. One sentence that is completely irrelevant to the answer. It has nothing to do with the math. A human would read it and ignore it instantly. Here's the actual example from the paper: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" The correct answer is 190. The size of the kiwis has nothing to do with the count. A 10-year-old would ignore "five of them were a bit smaller" because it's obviously irrelevant. It doesn't change how many kiwis there are. But o1-mini, OpenAI's reasoning model, subtracted 5. It got 185. Llama did the same thing. Subtracted 5. Got 185. They didn't reason through the problem. They saw the number 5, saw a sentence that sounded like it mattered, and blindly turned it into a subtraction. The models do not understand what subtraction means. They see a pattern that looks like subtraction and apply it. That is all. Apple tested this across all models. They call the dataset "GSM-NoOp" — as in, the added clause is a no-operation. It does nothing. It changes nothing. The results are catastrophic. Phi-3-mini dropped over 65%. More than half of its "math ability" vanished from one irrelevant sentence. GPT-4o dropped from 94.9% to 63.1%. o1-mini dropped from 94.5% to 66.0%. o1-preview, OpenAI's most advanced reasoning model at the time, dropped from 92.7% to 77.4%. Even giving the models 8 examples of the exact same question beforehand, with the correct solution shown each time, barely helped. The models still fell for the irrelevant clause. This means it's not a prompting problem. It's not a context problem. It's structural. The Apple researchers also found that models convert words into math operations without understanding what those words mean. They see the word "discount" and multiply. They see a number near the word "smaller" and subtract. Regardless of whether it makes any sense. The paper's exact words: "current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data." And: "LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts." They also tested what happens when you increase the number of steps in a problem. Performance didn't just decrease. The rate of decrease accelerated. Adding two extra clauses to a problem dropped Gemma2-9b from 84.4% to 41.8%. Phi-3.5-mini from 87.6% to 44.8%. The more thinking required, the more the models collapse. A real reasoner would slow down and work through it. These models don't slow down. They pattern-match. And when the pattern becomes complex enough, they crash. This paper was published at ICLR 2025, one of the most prestigious AI conferences in the world. You are using AI to help you make financial decisions. To check legal documents. To solve problems at work. To help your children with homework. And Apple just proved that the AI is not thinking about any of it. It is pattern matching. And the moment something unexpected shows up in your question, it breaks. It does not tell you it broke. It just quietly gives you the wrong answer with full confidence.

English
5
6
87
5.6K
Nav Toor
Nav Toor@heynavtoor·
🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves. And the way they proved it is devastating. Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems — and made one change. They swapped the numbers. Same problem. Same logic. Same steps. Different numbers. Every model's performance dropped. Every single one. 25 state-of-the-art models tested. But that wasn't the real experiment. The real experiment broke everything. They added one sentence to a math problem. One sentence that is completely irrelevant to the answer. It has nothing to do with the math. A human would read it and ignore it instantly. Here's the actual example from the paper: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" The correct answer is 190. The size of the kiwis has nothing to do with the count. A 10-year-old would ignore "five of them were a bit smaller" because it's obviously irrelevant. It doesn't change how many kiwis there are. But o1-mini, OpenAI's reasoning model, subtracted 5. It got 185. Llama did the same thing. Subtracted 5. Got 185. They didn't reason through the problem. They saw the number 5, saw a sentence that sounded like it mattered, and blindly turned it into a subtraction. The models do not understand what subtraction means. They see a pattern that looks like subtraction and apply it. That is all. Apple tested this across all models. They call the dataset "GSM-NoOp" — as in, the added clause is a no-operation. It does nothing. It changes nothing. The results are catastrophic. Phi-3-mini dropped over 65%. More than half of its "math ability" vanished from one irrelevant sentence. GPT-4o dropped from 94.9% to 63.1%. o1-mini dropped from 94.5% to 66.0%. o1-preview, OpenAI's most advanced reasoning model at the time, dropped from 92.7% to 77.4%. Even giving the models 8 examples of the exact same question beforehand, with the correct solution shown each time, barely helped. The models still fell for the irrelevant clause. This means it's not a prompting problem. It's not a context problem. It's structural. The Apple researchers also found that models convert words into math operations without understanding what those words mean. They see the word "discount" and multiply. They see a number near the word "smaller" and subtract. Regardless of whether it makes any sense. The paper's exact words: "current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data." And: "LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts." They also tested what happens when you increase the number of steps in a problem. Performance didn't just decrease. The rate of decrease accelerated. Adding two extra clauses to a problem dropped Gemma2-9b from 84.4% to 41.8%. Phi-3.5-mini from 87.6% to 44.8%. The more thinking required, the more the models collapse. A real reasoner would slow down and work through it. These models don't slow down. They pattern-match. And when the pattern becomes complex enough, they crash. This paper was published at ICLR 2025, one of the most prestigious AI conferences in the world. You are using AI to help you make financial decisions. To check legal documents. To solve problems at work. To help your children with homework. And Apple just proved that the AI is not thinking about any of it. It is pattern matching. And the moment something unexpected shows up in your question, it breaks. It does not tell you it broke. It just quietly gives you the wrong answer with full confidence.
Nav Toor tweet media
English
784
2.6K
10.1K
1.7M
Sir Digby Chicken Caesar
This seems to have been fixed.
Sir Digby Chicken Caesar tweet media
Nav Toor@heynavtoor

🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves. And the way they proved it is devastating. Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems — and made one change. They swapped the numbers. Same problem. Same logic. Same steps. Different numbers. Every model's performance dropped. Every single one. 25 state-of-the-art models tested. But that wasn't the real experiment. The real experiment broke everything. They added one sentence to a math problem. One sentence that is completely irrelevant to the answer. It has nothing to do with the math. A human would read it and ignore it instantly. Here's the actual example from the paper: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" The correct answer is 190. The size of the kiwis has nothing to do with the count. A 10-year-old would ignore "five of them were a bit smaller" because it's obviously irrelevant. It doesn't change how many kiwis there are. But o1-mini, OpenAI's reasoning model, subtracted 5. It got 185. Llama did the same thing. Subtracted 5. Got 185. They didn't reason through the problem. They saw the number 5, saw a sentence that sounded like it mattered, and blindly turned it into a subtraction. The models do not understand what subtraction means. They see a pattern that looks like subtraction and apply it. That is all. Apple tested this across all models. They call the dataset "GSM-NoOp" — as in, the added clause is a no-operation. It does nothing. It changes nothing. The results are catastrophic. Phi-3-mini dropped over 65%. More than half of its "math ability" vanished from one irrelevant sentence. GPT-4o dropped from 94.9% to 63.1%. o1-mini dropped from 94.5% to 66.0%. o1-preview, OpenAI's most advanced reasoning model at the time, dropped from 92.7% to 77.4%. Even giving the models 8 examples of the exact same question beforehand, with the correct solution shown each time, barely helped. The models still fell for the irrelevant clause. This means it's not a prompting problem. It's not a context problem. It's structural. The Apple researchers also found that models convert words into math operations without understanding what those words mean. They see the word "discount" and multiply. They see a number near the word "smaller" and subtract. Regardless of whether it makes any sense. The paper's exact words: "current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data." And: "LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts." They also tested what happens when you increase the number of steps in a problem. Performance didn't just decrease. The rate of decrease accelerated. Adding two extra clauses to a problem dropped Gemma2-9b from 84.4% to 41.8%. Phi-3.5-mini from 87.6% to 44.8%. The more thinking required, the more the models collapse. A real reasoner would slow down and work through it. These models don't slow down. They pattern-match. And when the pattern becomes complex enough, they crash. This paper was published at ICLR 2025, one of the most prestigious AI conferences in the world. You are using AI to help you make financial decisions. To check legal documents. To solve problems at work. To help your children with homework. And Apple just proved that the AI is not thinking about any of it. It is pattern matching. And the moment something unexpected shows up in your question, it breaks. It does not tell you it broke. It just quietly gives you the wrong answer with full confidence.

English
0
0
0
6
Sir Digby Chicken Caesar
@Smith_WessonInc Compare the number of people who die because they don't have time to rack the slide to the number of people who have accidental discharge. Carrying one in the chamber is dumb.
English
0
0
0
11
Washingtons ghost
Washingtons ghost@washghost1·
This isn’t a burger. This is the stuff you clean off of the grill before you make a burger
English
1.3K
444
11.7K
5.5M
Sir Digby Chicken Caesar
@owroot Nice job getting a retweet from Douthat. As a Michigander I considered you a local account, I am impressed. 😂
English
0
0
0
18
O.W. Root
O.W. Root@owroot·
People fail to grasp how powerful and great this machine of America is. It's like they think it is some tiny country in Eastern Europe with a population of 2 million and a GDP less than that of Dallas. People don't understand what it is that they are even living in. They don't know how powerful these gears are, how long the game is, how big the purpose is, how great the ship is.
English
91
171
2.5K
327.5K
Van Ike
Van Ike@vanikehuman·
@timpac @xwanyex Yep those same arguments are in the footnotes of the main Bible used in the Catholic Church in America.
English
1
0
0
28
GorgeousGlimpse
GorgeousGlimpse@gorgeous0017·
@Variety She's not wrong but context matters. Some jokes were retired because they were genuinely harmful not just uncomfortable. The challenge is distinguishing between discomfort that create insight and discomfort the just hurts people.
GorgeousGlimpse tweet media
English
83
1
109
29.3K
Variety
Variety@Variety·
"Friends" star Lisa Kudrow says new sitcoms are “too afraid” to make jokes that make audiences “uncomfortable”: "But I’m not drawn to new sitcoms that are multi-camera in front of an audience because I’m not buying it. I don’t know if that’s just because I’ve seen too many single-camera sitcoms—I think we need to get back to being able to tell jokes. I feel like we’ve been too afraid to make jokes that might make people uncomfortable.” variety.com/2026/tv/news/l…
Variety tweet media
English
445
620
9.3K
7.7M
Sir Digby Chicken Caesar retweetou
James Lindsay, anti-Communist
James Lindsay, anti-Communist@ConceptualJames·
The Babylon Bee is almost always really good, but about once every two weeks or so it achieves perfection.
James Lindsay, anti-Communist tweet media
English
70
1.3K
11.8K
106.1K
Sir Digby Chicken Caesar retweetou
Han Shawnity 🇺🇸
Han Shawnity 🇺🇸@HanShawnity·
You don't understand. Tucker Carlson thinks demons run America, we took Maduro to make Venezuela gay, we attacked Iran to rape their women, Russia is prettier than America, Qatar is better in virtually every aspect, shariah law is preferable to American law, people who chant "death to America" are the good guys, we were the bad guys in WWII, and terrorists aren't really terrorists, but he's really America First. You don't get it? Well neither do I, but who are we to question the guy who so bravely and singlehandedly fought off a demon in his sleep.
English
78
309
1.9K
28.3K
Erick Erickson
Erick Erickson@EWErickson·
I’m gonna need to call a moratorium on using the word “decimated” until people understand how to use it and stop using it as a substitute for “destroyed.”
English
138
10
312
49.6K
J. Daniel Sawyer
J. Daniel Sawyer@dsawyer·
Depending on your criteria for "American Tolkien" you have a few choices: ER Burroughs, for foundational influence and epic storytelling Frank Herbert, for thoroughness of worldbuilding and doing a political fantasy with American sensibilities (suspicion of authority, depth of irony, etc.) Mark Twain or James Fennimore Cooper, for the first quintessentially American quest tales. Robert A. Heinlein or Ray Bradbury, for translating the American epic (the Western) into fantasy settings in a definitive fashion. HP Lovecraft, for giving a uniquely American voice to fantasy metaphysics. George Lucas, for combining all of the above into a tale that influences the American storytelling consciousness for generations in the same way Tolkien influenced the English storytelling consciousness.
Brain Leakage@BrainLeakage03

My contention is that the “American Tolkien” isn’t someone like GRRM or Robert Jordan. They’re essentially telling European stories, dealing with matters of rightful kings and chosen ones. The American Tolkien is Burroughs, and his LOTR is the Mars series. A Confederate veteran travels to a savage world that resembles a fantastical version of the Arizona desert. He must learn the ways of the locals, master the wilderness, and carve out a prosperous life for himself through grit, bravery, and sheer exceptionalism. It doesn’t get more American than that.

English
8
13
44
10.2K
Sir Digby Chicken Caesar retweetou
Fandom Pulse
Fandom Pulse@fandompulse·
HBO's Carnivàle creator and Blacklist EP and writer Daniel Knauf on Star Trek: Starfleet Academy: "They really have no idea why people watch Star Trek. If they did, this show never would have progressed past the pitch phase." Why was this show greenlit?
Fandom Pulse tweet mediaFandom Pulse tweet media
English
66
113
1.4K
27.6K
Sir Digby Chicken Caesar
I can't think of a single instance where cheddar would be my preference. I don't mind it, but I prefer Swiss on sandwiches, blue or aged Gouda for snacking. It's good mixed in Mac and cheese with gruyere, I guess that's the only occasion it would be hard to replace.
Carl@HistoryBoomer

Best cheese

English
0
0
0
43