CricketInSand
16.6K posts


New research just exposed the biggest lie in AI coding benchmarks. LLMs score 84-89% on standard coding tests. On real production code? 25-34%. That's not a gap. That's a different reality. Here's what happened: Researchers built a benchmark from actual open-source repositories real classes with real dependencies, real type systems, real integration complexity. Then they tested the same models that dominate HumanEval leaderboards. The results were brutal. The models weren't failing because the code was "harder." They were failing because it was *real*. Synthetic benchmarks test whether a model can write a self-contained function with a clean docstring. Production code requires understanding inheritance hierarchies, framework integrations, and project-specific utilities. Different universe. Same leaderboard score. But it gets worse. A separate study ran 600,000 debugging experiments across 9 LLMs. They found a bug in a program. The LLM found it too. Then they renamed a variable. Added a comment. Shuffled function order. Changed nothing about the bug itself. The LLM couldn't find the same bug anymore. 78% of the time, cosmetic changes that don't affect program behavior completely broke the model's ability to debug. Function shuffling alone reduced debugging accuracy by 83%. The models aren't reading code. They're pattern-matching against what code *looks like* in their training data. A third study confirmed this from another angle: when researchers obfuscated real-world code changing symbols, structure, and semantics while keeping functionality identical LLM pass rates dropped by up to 62.5%. The researchers call this the "Specialist in Familiarity" problem. LLMs perform well on code they've memorized. The moment you show them something unfamiliar with the same logic, they collapse. Three papers. Three different methodologies. Same conclusion: The benchmarks we use to evaluate AI coding tools are measuring memorization, not understanding. If you're shipping code generated by LLMs into production without review, these numbers should concern you. If you're building developer tools, the question isn't "what's your HumanEval score." It's "what happens when the code doesn't look like the training data."


“Kalergi Plan? Don’t talk about that! That’s just a debunked conspiracy theory, that’s just low IQ antisemitism man” Meanwhile:








Catholics are called to reject antisemitism and the lies and conspiracies that fuel it, and to stand clearly against hatred and violence directed toward our Jewish brothers and sisters. To defend religious freedom with integrity, we must also reject antisemitism. @ArchbishpSample @archdpdx Watch the full video at: ow.ly/sYF550Yw6cA


NVIDIA CEO SAYS HE IS 100% COMMITTED TO ISRAEL AND WILL HAVE STAFF THERE FOR A VERY LONG TIME - PRESS CONF



@Kneon Indian ownership will be the death of the country. Microslop is unofficially an Indian company now. They almost exclusively only hire Indians. The downward spiral is directly related to Indian ownership and participation.




Rabbi Mizrachi says Tucker Carlson shouldn't worry about the 3rd Temple being built bc when it is built "there will not be one anti-semite left in the world...they won't be around anyway" "When God will send the Messiah to purify the world...not one wicked gentile will be left"


Trump promised to get the US out of “stupid wars.” But now he and John Bolton are on the brink of launching us into a very stupid and costly war with Iran. Join me in sending a strong message to President Trump: The US must NOT go to war with Iran. #TULSI2020







AI productivity bubble: Early adopters are already burning out. “There will be a wake up call and a reckoning for entire sectors who are adopting AI.” Natasha Bernal.

There is still no productivity gain from AI in the US data

The past year has seen an explosion in coding productivity, per FT:













