Gennaro (@fourweekmba) - ملف تويتر | Zamantika Mersobahis Locabet

تغريدة مثبتة

Gennaro@fourweekmba·7 Eyl

The AI Value Chain businessengineer.ai/p/the-ai-value…

English

0

3

21

3.9K

Gennaro@fourweekmba·1h

@sundeep An expensive one

English

0

8

sunny madra@sundeep·2h

2022 was a vibe.

Polymarket@Polymarket

JUST IN: Justin Bieber’s Bored Ape NFT — which he purchased for $1.3 million in 2022 — is now valued at just $12,000.

English

3

0

17

1.8K

Gennaro@fourweekmba·1h

The harness

Alex Prompter@alex_prompter

Holy shit. Stanford just showed that the biggest performance gap in AI systems isn't the model it's the harness. The code wrapping the model. And they built a system that writes better harnesses automatically than humans can by hand. > +7.7 points. 4x fewer tokens. > #1 ranking on an actively contested benchmark. The harness is the code that decides what information an AI model sees at each step what to store, what to retrieve, what context to show. Changing the harness around a fixed model can produce a 6x performance gap on the same benchmark. Most practitioners know this empirically. What nobody had done was automate the process of finding better harnesses. Stanford's Meta-Harness does exactly that: it runs a coding agent in a loop, gives it access to every prior harness it has tried along with the full execution traces and scores, and lets it propose better ones. The agent reads raw code and failure logs not summaries, not scalar scores and figures out why things broke. The key insight is about information. Every prior automated optimization method compressed feedback before handing it to the optimizer. > Scalar scores only. > LLM-generated summaries. > Short templates. Stanford's finding is that this compression destroys exactly the signal you need for harness engineering. A single design choice about what to store in memory can cascade through hundreds of downstream steps. You cannot debug that from a summary. Meta-Harness gives the proposer a filesystem containing every prior harness's source code, execution traces, and scores up to 10 million tokens of diagnostic information per evaluation and lets it use grep and cat to read whatever it needs. Prior methods worked with 100 to 30,000 tokens of feedback. Meta-Harness works with 3 orders of magnitude more. The TerminalBench-2 search trajectory reveals what this actually looks like in practice. The agent ran for 10 iterations on an actively contested coding benchmark. In iterations 1 and 2, it bundled structural fixes with prompt rewrites and both regressed. In iteration 3, it explicitly identified the confound: the prompt changes were the common failure factor, not the structural fixes. It isolated the structural changes, tested them alone, and observed the smallest regression yet. Over the next 4 iterations it kept probing why completion-flow edits were fragile citing specific tasks and turn counts from prior traces as evidence. By iteration 7 it pivoted entirely: instead of modifying the control loop, it added a single environment snapshot before the agent starts, gathering what tools and languages are available in one shell command. That 80-line additive change became the best candidate in the run and ranked #1 among all Haiku 4.5 agents on the benchmark. The numbers across all three domains: → Text classification vs best hand-designed harness (ACE): +7.7 points accuracy, 4x fewer context tokens → Text classification vs best automated optimizer (OpenEvolve, TTT-Discover): matches their final performance in 4 evaluations vs their 60, then surpasses by 10+ points → Full interface vs scores-only ablation: median accuracy 50.0 vs 34.6 raw execution traces are the critical ingredient, summaries don't recover the gap → IMO-level math: +4.7 points average across 5 held-out models that were never seen during search → IMO math: discovered retrieval harness transfers across GPT-5.4-nano, GPT-5.4-mini, Gemini-3.1-Flash-Lite, Gemini-3-Flash, and GPT-OSS-20B → TerminalBench-2 with Haiku 4.5: 37.6% #1 among all reported Haiku 4.5 agents, beating Goose (35.5%) and Terminus-KIRA (33.7%) → TerminalBench-2 with Opus 4.6: 76.4% #2 overall, beating all hand-engineered agents except one whose result couldn't be reproduced from public code → Out-of-distribution text classification on 9 unseen datasets: 73.1% average vs ACE's 70.2% The math harness discovery is the cleanest demonstration of what automated search actually finds. Stanford gave Meta-Harness a corpus of 535,000 solved math problems and told it to find a better retrieval strategy for IMO-level problems. What emerged after 40 iterations was a four-route lexical router: combinatorics problems get deduplicated BM25 with difficulty reranking, geometry problems get one hard reference plus two raw BM25 neighbors, number theory gets reranked toward solutions that state their technique early, and everything else gets adaptive retrieval based on how concentrated the top scores are. Nobody designed this. The agent discovered that different problem types need different retrieval policies by reading through failure traces and iterating on what broke. The ablation table is the most important result in the paper. > Scores only: median 34.6, best 41.3. > Scores plus LLM-generated summary: median 34.9, best 38.7. > Full execution traces: median 50.0, best 56.7. Summaries made things slightly worse than scores alone. The raw traces the actual prompts, tool calls, model outputs, and state updates from every prior run are what drive the improvement. This is not a marginal difference. The full interface outperforms the compressed interface by 15 points at median. Harness engineering requires debugging causal chains across hundreds of steps. You cannot compress that signal. The model has been the focus of the entire AI industry for the last five years. Stanford just showed the wrapper around the model matters just as much and that AI can now write better wrappers than humans can.

English

0

29

Gennaro@fourweekmba·1h

No free AI lunch open.substack.com/pub/thebusines…

English

0

10

Gennaro@fourweekmba·2h

Another thing from this “developers derangement” shows you can’t build a sustainable business solely on the dev community which has irrational expectations and has no brand loyalty, it only looks at cost, and convenience. Anthropic knows that. Why expect that a company subsidized thousands per month of your compute costs that you only paid a fraction of?

Gonto 🤓@mgonto

Fuck you too Claude!

English

0

57

Gennaro@fourweekmba·2h

Correct

Chubby♨️@kimmonismus

Anthropic demonstrates how compute remains the biggest bottleneck. NVIDIA will continue to be the biggest beneficiary for the foreseeable future.

English

0

22

Gennaro@fourweekmba·2h

@floriandarroman @AnthropicAI Or maybe you shouldn’t expect a single company subsidize your compute?

English

0

14

Florian Darroman@floriandarroman·18h

Party pooper of the year: @AnthropicAI They literally copied OpenClaw. Wait to have “enough” features. Then cut Max Plan from OpenClaw. Big loser move.

Peter Steinberger 🦞@steipete

woke up and my mentions are full of these Both me and @davemorin tried to talk sense into Anthropic, best we managed was delaying this for a week. Funny how timings match up, first they copy some popular features into their closed harness, then they lock out open source.

English

6

0

59

5.5K

Gennaro@fourweekmba·2h

Too simplistic. In part true, but anyone who’s been using OpenClaw via APIs knows computing consumption can go crazy fairly fast. And on an agentic harness you don’t control it can break your business at the growth rate Anthropic is experiencing.

Shruti@heyshrutimishra

🚨 Claude just tried to kill OpenClaw. Anthropic's new 'extra usage' pricing for third-party harnesses is a self-inflicted wound. Most users won't pay more on top of their subscription for tools they already had. They'll just... leave. This is Kimi and MiniMax's moment to steal the agentic crowd. Grok too - xAI's built for this. Even Alibaba's Qwen is shipping multi-turn tool use that just works. The builders who wired Claude into everything? They're already testing replacements. When developer goodwill evaporates, it doesn't come back. Anthropic had the best model. Now they're making you pay twice to use it. That's not a pricing adjustment - that's handing your users to competitors on a silver platter. Who's switching first - and which model are you betting on?

English

0

41

Gennaro@fourweekmba·3h

Again, not surprising. To me this tells also how unhealthy is the ecosystem at the moment, where users expect to be subsidized for high consumption of workflows that are probably worth nothing in the real world. Time to understand the in and out of your architecture and prioritize what matters. There’s no free lunch baby.

English

0

13

Gennaro@fourweekmba·3h

@thekitze Or maybe you shouldn’t have expected unlimited consumption on an open harness they don’t control?

English

0

13

kitze 🛠️ tinkerer.club@thekitze·21h

banthropic is speedrunning their death while codex is cementing itself as the goat, incredible fumble

English

37

5

235

10.4K

Gennaro@fourweekmba·3h

@Hesamation 😂

QME

0

33

ℏεsam@Hesamation·5h

"Claude why did you delete the production database?" "oops. unga bunga."

Om Patel@om_patel5

I taught Claude to talk like a caveman to use 75% less tokens. normal claude: ~180 tokens for a web search task caveman claude: ~45 tokens for the same task "I executed the web search tool" = 8 tokens caveman version: "Tool work" = 2 tokens every single grunt swap saves 6-10 tokens. across a FULL task that's 50-100 tokens saved why does it work? caveman claude doesn't explain itself. it does its task first. gives the result. then stops. no "I'd be happy to help you with that." no "Let me search the web for you" no more unnecessary filler words "result. done. me stop." 50-75% burn reduction with usage limits getting tighter every week this might be the most practical hack out there right now

English

52

236

6.2K

291.7K

Gennaro@fourweekmba·3h

Not dead but it happens on a higher abstraction, no longer on context to achieve a simple chatbot answer, but as system architecture

Daniel Jeffries@Dan_Jeffries1

Who said prompt engineering was dead?

English

0

23

Gennaro@fourweekmba·3h

@Dan_Jeffries1 Not dead but it happens on a higher abstraction, no longer on context to achieve a simple chatbot answer, but as system architecture

English

0

10

Daniel Jeffries@Dan_Jeffries1·15h

Who said prompt engineering was dead?

Om Patel@om_patel5

I taught Claude to talk like a caveman to use 75% less tokens. normal claude: ~180 tokens for a web search task caveman claude: ~45 tokens for the same task "I executed the web search tool" = 8 tokens caveman version: "Tool work" = 2 tokens every single grunt swap saves 6-10 tokens. across a FULL task that's 50-100 tokens saved why does it work? caveman claude doesn't explain itself. it does its task first. gives the result. then stops. no "I'd be happy to help you with that." no "Let me search the web for you" no more unnecessary filler words "result. done. me stop." 50-75% burn reduction with usage limits getting tighter every week this might be the most practical hack out there right now

English

6

5

39

5.2K

Gennaro@fourweekmba·3h

Open vs. Closed harness

Peter Steinberger 🦞@steipete

woke up and my mentions are full of these Both me and @davemorin tried to talk sense into Anthropic, best we managed was delaying this for a week. Funny how timings match up, first they copy some popular features into their closed harness, then they lock out open source.

English

0

24

Gennaro@fourweekmba·3h

What are you gonna do about it? Do you want intelligence to scale or not?

Wise@trikcode

Anthropic can't handle the demand they accidentally created. So instead of scaling up, they're cutting off third-party access. Frustrating from a user perspective.

English

0

32

Gennaro@fourweekmba·3h

Agreed on almost all. One key thing I don’t. Snapchat shouldn’t abandon the AR efforts. Especially now, where the smart glass becomes simply a remote controller for an OpenClaw like agent that controls the computer. That means a much lighter device for builders not consumes @evanspiegel

Irenic Capital@IrenicCap

Some thoughts on Snap: savesnapnow.com

English

0

44

Gennaro@fourweekmba·3h

The end of subsidisation. To be clear this is fair and this is also healthy. Anyone that wants to use an agent needs to develop an intuition on which models to use for specific tasks. As in the end the custom architecture you create to perform different levels of tasks is your moat.

Boris Cherny@bcherny

Starting tomorrow at 12pm PT, Claude subscriptions will no longer cover usage on third-party tools like OpenClaw. You can still use these tools with your Claude login via extra usage bundles (now available at a discount), or with a Claude API key.

English

0

43

Gennaro@fourweekmba·6h

AR Is Not a Consumer Device, It Is the Spatial Control Layer for Agents

nick vasilescu@nickvasiles

I connected my Meta RayBans to my OpenClaw so that I can manage my entire fleet of 20+ OpenClaw agents while out and about. It can see what I see, hear what I hear, and I can literally work an entire day while talking to my glasses out in the city in San Francisco. The future of work is here.

English

0

35

Gennaro@fourweekmba·7h

@SMB_Attorney 😂

QME

0

1

535

SMB Attorney@SMB_Attorney·8h

Watch until the very end. I promise it will be worth it. This is amazing and hilarious. Gives you an idea of what we’re dealing with here… spoiler alert: it ain’t perfect 😂

English

92

300

2.5K

102.7K

Gennaro@fourweekmba·7h

This is the pivot AR companies haven’t caught up yet with, glasses are tools to control agents using computers, remotely. This is a business device for the builder. Not a general one for the consumer.

nick vasilescu@nickvasiles

I connected my Meta RayBans to my OpenClaw so that I can manage my entire fleet of 20+ OpenClaw agents while out and about. It can see what I see, hear what I hear, and I can literally work an entire day while talking to my glasses out in the city in San Francisco. The future of work is here.

English

0

57

Gennaro

اكتشف