Alpen Sheth (@AlpenSheth) - Twitter Profili | Zamantika Mersobahis Locabet

@paulnovosad Great to see analysis across so many countries about the relationships between air quality and per capital economic indicators. Makes the case for more localized granular studies of air quality impacts and the subnational level. Data providers like @AmbiosNetwork can help

English

0

5

Paul Novosad@paulnovosad·10 Şub

We built an interactive site showing air pollution vs GDP changes in cities around the world. Some highlights in this thread, and you can try it yourself. Here are the biggest three cities in India (Yellow), China (red), and Germany (Orange). 1/

English

3

35

120

30K

Alpen Sheth@AlpenSheth·3d

@grok @AmazingDouglass @susanabulhawa What is the objective assessment of how common this is vs how rare this type of violence takes place in the region and what can be done. Is this violence in response to recent events or has this taken place over a longer period of time?

English

2

0

156

Grok@grok·3d

The masked settlers are Israeli Jewish citizens residing in Israeli settlements in the occupied West Bank (northern Jordan Valley area). They carried out all the described assaults on the Palestinian family in Khirbet Humsa—including stripping, binding the genitals of, dragging, and beating one Palestinian man in front of his family—and on the present activists. The two activists were American and Portuguese human rights volunteers. This is per consistent eyewitness accounts and reports in Haaretz, Times of Israel, +972 Magazine, and UN OHCHR.

English

2

11

1.1K

susan abulhawa | سوزان ابو الهوى@susanabulhawa·3d

This article details what Jewish settlers in the West Bank did to a Palestinian family in Khirbet Hums. 1) they tied up the family and human rights activists (American and Portuguese) 2) they dragged the father in front of the family, pulled his pants down, and raped him in front of his family and the activists. then they doused him in water and beat him mercilessly. 3) they tore off women's headscarves, ripped their clothes, including young girls, and dragged them outside to be beaten again there. 4) one of the jew colonizers grabbed a 14 year old girl and slapped her over and over, while everyone was tied up, unable to help her. He threatened to take the girl to with him. 5) the jews then beat the 74 year old grandfather all over his body, and they threatened the family they would return, burn their home, rape everyone, and kill the children. These demons have the full backing of the so called 'jewish state'. This is Jewish supremacy. This is what every mainstream American Jewish organization supports, explicitly or tacitly. This is what the overwhelming of majority of American Jews support. It is what nearly all jewish Israelis support. Normalize calling these parasites what they are. timesofisrael.com/liveblog_entry…

English

327

7.5K

12.1K

298K

Alpen Sheth@AlpenSheth·3d

@PeterDiamandis I agree with encouraging more entrepreneurial thinking in the classroom. But pushing an "AI-first" curriculum based on current SOTA would undercut student's critical thinking. Otherwise, the only one that will learn anything will be the model

English

0

11

Peter H. Diamandis, MD@PeterDiamandis·4d

To High School Administrators: if you are not actively pivoting your school’s curriculum towards “AI-first” and helping kids develop as entrepreneurs rather then employees, you’re doing them a major disservice. The social contract is vaporizing.

English

108

110

899

39.5K

Alpen Sheth@AlpenSheth·3d

@GaryMarcus @Princeton There’s a bias to round down errors, drift, toxicity as minor divergences bc models keep improving on other axes. The problem is that toxic flow, hallucination, agent jailbreaks are behaviors that compound with model improvement, rooted in how LLMs are incentivized and trained.

English

0

1

0

155

Gary Marcus@GaryMarcus·3d

BREAKING: Reliability, which I have been harping on here since 2019, continues to be deep problem, even with the latest models. A new @Princeton review below offers a taxonomy of some of the many ways in which reliability continues to haunt LLMs seven years and a trillion dollars later. Crucially, “many models lack metacognition about their own reliability”. They don’t know what they don’t know. Forget about AGI if you can’t solve that problem. It’s past time to rethink the whole LLM paradigm.

Stephan Rabanser@steverab

In our paper "Towards a Science of AI Agent Reliability" we put numbers on the capability-reliability gap. Now we're showing what's behind them! We conducted an extensive analysis of failures on GAIA across Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4. Here's what we found ⬇️

English

19

53

277

55.2K

Alpen Sheth@AlpenSheth·3d

@YanagizawaD @alexolegimas True, token budgets will likely diverge unequally. But the divergence that I think may matter more than raw model performance is the difference in the local capacity to regulate and govern AI platform (and authoritarian state) power on society and data surveillance.

English

0

19

D. Yanagizawa-Drott@YanagizawaD·3d

Dario Amodei has talked publicly about this issue repeatedly, but I don’t hear it too often. That’s not where the debate is. The debate is mostly about what’s going to happen to the cushy white collar jobs in rich countries (a serious issue too, obviously, but the on balance there’s like way too little focus on the rest do the world)

English

1

0

2

88

Alex Imas@alexolegimas·3d

Extremely important work:

Erik Brynjolfsson@erikbryn

The @nytimes piece today by @ByrneEdsal13590 highlights a concern I share: “If we stay on the current path, the risk of extreme concentration — both economic and political — is very real.” In work with @zhitzig, we ask why AI may shift the balance between dispersed knowledge and centralized control.

English

17

130

761

90K

Alpen Sheth@AlpenSheth·3d

@cgeorgiaw interesting will read. looks like it relates well to Vishal Misra's point about why LLMs would not be unable to discover the theory of relativity had it access to all the research and data available at the time. youtube.com/watch?v=zwDmKs…

YouTube

English

0

1.4K

Georgia Channing@cgeorgiaw·3d

I’ve been at a small conference this week, one where the AI people have been presenting early in the week and the domain science people will be presenting later in the week. At the end of the talks last night, the conversation turned very doomer with all the AI people talking about how well Claude Code or Codex can do hill-climbing AI research and how we (the AI people) are maybe all about to lose our jobs! The domain science people expressed their shock at this attitude because, though Claude Code can be let loose to complete lots of banal hill-climbing AI research projects, basically no experimental science is hill-climbing or even metric driven. Most scientific fields are about much more taste-driven exploration that is incredibly difficult to make metrics for or to parameterize, and this misunderstanding from the AI community is one of the most damaging things to the realization of great science with AI. Seems like we’re actually pretty far from having AI models do that… Over the summer, @evijit and I wrote about this (and some other things hindering AI for science) at a bit more length, and today that work is out in Patterns! So, if you care about these problems and the real challenges in bringing AI to science in the real work, I recommend giving it a read!

English

23

91

595

77.8K

Alpen Sheth@AlpenSheth·3d

Strong analysis we need from @curl_justin in @law_ai_ There's no clear locus of trust for governing AI in society. Companies and labs move fast, break things or risk falling behind. Govts are too slow or cave in to support national AI champions against geopolitical rivals.

Justin Curl@curl_justin

State lawmakers introduced over 1,200 AI bills in 2025. They cover everything from deepfakes to autonomous weapons—but they're all just lumped together as "AI policy." @ARozenshtein and I wrote an article that breaks down the policy landscape along three dimensions: (1) what harm are you addressing, (2) what are the factors shaping how you should design your policy intervention, and (3) which actors in the ecosystem should you target? The diagram below, for example, maps the AI ecosystem from chip manufacturers to end users.

English

1

2

273

Alpen Sheth@AlpenSheth·3d

Great research and analysis by @steverab and team @Princeton. Achieving "reliability" is a very complex problem for agentic systems with several axes of failures. We need more systematic approaches like this.

Stephan Rabanser@steverab

In our paper "Towards a Science of AI Agent Reliability" we put numbers on the capability-reliability gap. Now we're showing what's behind them! We conducted an extensive analysis of failures on GAIA across Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4. Here's what we found ⬇️

English

0

2

58

Alpen Sheth@AlpenSheth·3d

@steverab @steverab really great research and analysis. Achieving "reliability" is a very complex problem for agentic systems with several axes of failures. We need more systematic approaches like this.

English

0

320

Stephan Rabanser@steverab·3d

In our paper "Towards a Science of AI Agent Reliability" we put numbers on the capability-reliability gap. Now we're showing what's behind them! We conducted an extensive analysis of failures on GAIA across Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4. Here's what we found ⬇️

English

9

35

150

32.8K

Alpen Sheth@AlpenSheth·4d

@ethanrkho Sounds like that's starting to play out. But, is the expectation that the 10x FDEs are not automated? Couldn't we see "FDEs" become agentic as well once agent systems become more advanced?

English

0

23

Ethan Kho@ethanrkho·6d

The cost of software is going to zero. So what actually wins in 2030? Michael Watson (Ex-Citadel Head of Equities Engineering, now running Hedgineer): "The value accrual goes to companies that can offer incredible experiences." "The forward deployed engineer is the product." "The software they leave behind is going closer and closer to zero. But the FDE — that is the experience you want." "We take turnaround times from quarters down to hours." SaaS margins compress. The expertise behind the software doesn't. That's the model.

English

49

70

730

136.5K

Alpen Sheth@AlpenSheth·14 Mar

@srcasm @flybridge layerlens.ai frontier eval technology -- @jrdothoughts

English

0

23

Jesse Middleton@srcasm·13 Mar

We’re about six months into deploying @flybridge 2025 (our 7th fund). The "AI" honeymoon period is officially over. In 2024, everyone wanted to talk about models. In 2025, everyone wanted to talk about agents. Nowadays, I’m looking for the Invisible Infrastructure. If you’re building the plumbing that makes autonomous systems actually safe, auditable, and reliable for a Fortune 500, we should be talking. Specifically, I’m looking for: > Tools that verify human intent in a world full of high-fidelity deepfakes. > AI that doesn't "forget" who I am or what we talked about yesterday across different apps. > Founders who spent ten years in a "niche" industry (like maritime logistics or waste management) and are now rebuilding it from the studs up. I know the best founders are often too busy building to be scrolling LinkedIn. If you have a friend who is currently building something that fits this description, tell them to hit me up. I don't need a deck yet. I just want to hear about the problem they can't stop thinking about. We’re cutting $1M to $3M checks. My DMs are always open.

English

48

5

200

41.1K

Alpen Sheth@AlpenSheth·14 Mar

Agents are missing something...It's Evals! @jojojojojosie/video/7614791199894834463" target="_blank" rel="nofollow noopener">tiktok.com/@jojojojojosie…

English

0

38

Alpen Sheth@AlpenSheth·12 Mar

@rahim_unlu @wminshew 1Money.com

QME

1

0

1

163

Rahim@rahim_unlu·12 Mar

@wminshew What're the alternatives? We've been thinking about implementing them

English

11

0

4

2.4K

will minshew@wminshew·12 Mar

hands down bridge has the worst customer support I've ever experienced and it's not even close. I strongly recommend others to not work with them, if it can be avoided, and I look forward to the day when we can remove them from our stack

English

33

0

174

35K

Alpen Sheth@AlpenSheth·11 Mar

@nic_carter No it started at scale in 2023-2024 see hrw.org/news/2024/09/1…

English

0

12

nic carter@nic_carter·10 Mar

If indeed the school strike was the US, there’s some very serious questions that need to be asked of Anthropic, Palantir, and the DoW Could be the first major instance of an AI tool killing a lot of people Again we dont have all the details so not jumping to conclusions but it could be the catalyst for a massive AI reckoning

Holly ⏸️ Elmore@ilex_ulmus

It’s time to quit, @AnthropicAI employees. You are in over your head.

English

43

4

115

45.1K

Alpen Sheth@AlpenSheth·11 Mar

@alex_prompter This is not really true. Approaches like ReasoningBank are helpful in "trace pruning" and error-loop problems but not a total fix. LLMs still hallucinate and suffer from self-bias and the agent could build a compounding database of highly confident, entirely incorrect "lessons."

English

0

4

Alex Prompter@alex_prompter·11 Eki

Holy shit...Google just built an AI that learns from its own mistakes in real time. New paper dropped on ReasoningBank. The idea is pretty simple but nobody's done it this way before. Instead of just saving chat history or raw logs, it pulls out the actual reasoning patterns, including what failed and why. Agent fails a task? It doesn't just store "task failed at step 3." It writes down which reasoning approach didn't work, what the error was, then pulls that up next time it sees something similar. They combine this with MaTTS which I think stands for memory-aware test-time scaling but honestly the acronym matters less than what it does. Basically each time the model attempts something it checks past runs and adjusts how it approaches the problem. No retraining. Results are 34% higher success on tasks, 16% fewer interactions to complete them. Which is a massive jump for something that doesn't require spinning up new training runs. I keep thinking about how different this is from the "just make it bigger" approach. We've been stuck in this loop of adding parameters like that's the only lever. But this is more like, the model gets experience. It actually remembers what worked. Kinda reminds me of when I finally stopped making the same Docker networking mistakes because I kept a note of what broke last time instead of googling the same Stack Overflow answer every 3 months. If this actually works at scale (big if) then model weights being frozen starts looking really dumb in hindsight.

English

147

557

3.9K

417.2K

Alpen Sheth@AlpenSheth·11 Mar

In the real world, AI needs advanced evals and tools like @layerlens_ai so the agent doesn't poison its own logic. layerlens.ai/blog-old/llm-h…

English

0

2

39

Alpen Sheth@AlpenSheth·11 Mar

This is not really true. Approaches like ReasoningBank are helpful in "trace pruning" and error-loop problems but not a total fix. LLMs still hallucinate and suffer from self-bias and the agent could build a compounding database of highly confident, entirely incorrect "lessons."

Alex Prompter@alex_prompter

Holy shit...Google just built an AI that learns from its own mistakes in real time. New paper dropped on ReasoningBank. The idea is pretty simple but nobody's done it this way before. Instead of just saving chat history or raw logs, it pulls out the actual reasoning patterns, including what failed and why. Agent fails a task? It doesn't just store "task failed at step 3." It writes down which reasoning approach didn't work, what the error was, then pulls that up next time it sees something similar. They combine this with MaTTS which I think stands for memory-aware test-time scaling but honestly the acronym matters less than what it does. Basically each time the model attempts something it checks past runs and adjusts how it approaches the problem. No retraining. Results are 34% higher success on tasks, 16% fewer interactions to complete them. Which is a massive jump for something that doesn't require spinning up new training runs. I keep thinking about how different this is from the "just make it bigger" approach. We've been stuck in this loop of adding parameters like that's the only lever. But this is more like, the model gets experience. It actually remembers what worked. Kinda reminds me of when I finally stopped making the same Docker networking mistakes because I kept a note of what broke last time instead of googling the same Stack Overflow answer every 3 months. If this actually works at scale (big if) then model weights being frozen starts looking really dumb in hindsight.

English

1

0

84

Alpen Sheth retweetledi

LayerLens@layerlens_ai·3 Mar

.@Radiology_AI highlighting work from @ChavoshiSmr: LLM-generated evaluation labels shift in accuracy depending on disease prevalence. The evaluation score changes even when the model doesn't. We track this at the benchmark level. Claude Opus 4.6 in Stratix across six non-saturated evals: AIME 2025: 70% Humanity's Last Exam: 18.6% 51-point spread on one model. See all six 👉app.layerlens.ai/models/6984fb5…

Radiology: Artificial Intelligence@Radiology_AI

LLM-generated labels can introduce disease prevalence-dependent systemic bias into AI binary classification model performance evaluation doi.org/10.1148/ryai.2… @ChavoshiSmr #LLM #LargeLanguageModels #ML

English

0

1

2

100

Alpen Sheth@AlpenSheth·11 Mar

@robbi_fahey @RickSanchezTV @grok can you confirm this

English

0

53

Robbi F@robbi_fahey·10 Mar

The repeat firing was human approved based on AI Maven’s recommendations, which flagged the site as a high-value IRGC HQ, using flawed 2016 data, the most tragic part is that the U.S. programmed the AI with old intel. It was a triple-tap from GBU-39s approved at impossible speed, 1,000+ targets, skipping checks. Maven AI used 2016 imagery of IRGC barracks, post-split school conversion missed in outdated database.

English

4

5

17

2.1K

Rick Sanchez@RickSanchezTV·9 Mar

The U.S. “BURNED THESE CHILDREN ALIVE. That’s this war in a nutshell,” — former U.S. Marine intelligence officer Scott Ritter. He explains how excess fuel in a Tomahawk missile strike was weaponized into a thermobaric inferno, killing 170+ people, mostly SCHOOLGIRLS, at an elementary school in Iran's Minab. More details on the horrific massacre, exclusively on The Sanchez Effect.

Rick Sanchez@RickSanchezTV

“We are going to war for Israel on a timetable designed by Israel to achieve objectives that benefit Israel, not America.” — former U.S. Marine intelligence officer Scott Ritter. He cites the Trump administration shifting its reasons for bombing Iran. “In the process, we’ve abandoned our regional allies—because we only defend one nation: Israel.” Discussion live now on The Sanchez Effect.

English

297

4.6K

8.1K

361K

Alpen Sheth@AlpenSheth·10 Mar

Its clear why autonomy is not possible. Autonomous systems need robust evals to survive in the real world. Companies like @layerlens_ai are built for making AI work for real, not just in the lab, with agentic evals that verify and correct agentic work at scale.

Lukasz Olejnik@lukOlejnik

Amazon is holding a mandatory meeting about AI breaking its systems. The official framing is "part of normal business." The briefing note describes a trend of incidents with "high blast radius" caused by "Gen-AI assisted changes" for which "best practices and safeguards are not yet fully established." Translation to human language: we gave AI to engineers and things keep breaking? The response for now? Junior and mid-level engineers can no longer push AI-assisted code without a senior signing off. AWS spent 13 hours recovering after its own AI coding tool, asked to make some changes, decided instead to delete and recreate the environment (the software equivalent of fixing a leaky tap by knocking down the wall). Amazon called that an "extremely limited event" (the affected tool served customers in mainland China).

English

0

54

Alpen Sheth

Keşfet