mary

14.3K posts

mary banner
mary

mary

@howdymary

data @marketmotionxyz | prev galaxy, brookings, schwarzman, columbia | sidequests: @jobanxiety

bicoastal Katılım Şubat 2018
3K Takip Edilen18.5K Takipçiler
mary retweetledi
Thomas Massie
Thomas Massie@RepThomasMassie·
America First > MAGA
English
2.8K
9.3K
91.7K
1.3M
mary retweetledi
Wonderkid🦅
Wonderkid🦅@Its_Faid·
Speed got EMOTIONAL and blessed his taxi driver in China with $20K after realizing he sleeps in his car😔… 💔Look how he scans the car and realizes this is someone’s real life… no laughs, no jokes, just pure empathy God bless Speed ❤️
English
273
4.8K
123.7K
3.4M
mary retweetledi
hottie thottie
hottie thottie@scottiepoppin·
math is so beautiful. these people predicted that the crew would come all the way from space and land in the water at exactly 8:07 PM EST and that's the exact same time - down to the minute - that they splashed down.
English
80
4.8K
47K
463.7K
mary retweetledi
Brian Allen
Brian Allen@allenanalysis·
🚨 The Pentagon summoned the Pope’s ambassador to a closed-door meeting and threatened him. Named official. On record. Undersecretary of Defense Elbridge Colby called Cardinal Christophe Pierre to the Pentagon and delivered this message: “The United States has the military power to do whatever it wants in the world. The Catholic Church had better take its side.” Then a U.S. official invoked the Avignon Papacy — the 14th century moment when the French monarchy used military force to physically remove the Pope from Rome and bend the Church to its will. The Vatican understood the reference immediately. The Pope’s planned visit to America for the 250th anniversary celebration was cancelled. And then something remarkable happened. The Pope didn’t retreat. He pressed harder. He called the war unjust. He called Trump’s threats unacceptable. He told Americans to call Congress.
Brian Allen tweet mediaBrian Allen tweet mediaBrian Allen tweet media
English
1.3K
15.6K
47.9K
1.3M
mary retweetledi
Rushi
Rushi@rushicrypto·
China’s SUVs all have TVs, refrigerators, freezers, chair massagers, and now they can charge in 9 min and have range extenders. All for in the $20k range. We are being so played by the oil and auto industries. Bubbles about to be bursting all over this country.
English
162
2.7K
18.6K
201K
mary
mary@howdymary·
the only people who appreciate how advanced LLMs have gotten are - developers using parallel agent swarms - marketers mass producing AI UGC slop - CEOs that want to cut 70% of their workforce
Andrej Karpathy@karpathy

Judging by my tl there is a growing gap in understanding of AI capability. The first issue I think is around recency and tier of use. I think a lot of people tried the free tier of ChatGPT somewhere last year and allowed it to inform their views on AI a little too much. This is a group of reactions laughing at various quirks of the models, hallucinations, etc. Yes I also saw the viral videos of OpenAI's Advanced Voice mode fumbling simple queries like "should I drive or walk to the carwash". The thing is that these free and old/deprecated models don't reflect the capability in the latest round of state of the art agentic models of this year, especially OpenAI Codex and Claude Code. But that brings me to the second issue. Even if people paid $200/month to use the state of the art models, a lot of the capabilities are relatively "peaky" in highly technical areas. Typical queries around search, writing, advice, etc. are *not* the domain that has made the most noticeable and dramatic strides in capability. Partly, this is due to the technical details of reinforcement learning and its use of verifiable rewards. But partly, it's also because these use cases are not sufficiently prioritized by the companies in their hillclimbing because they don't lead to as much $$$ value. The goldmines are elsewhere, and the focus comes along. So that brings me to the second group of people, who *both* 1) pay for and use the state of the art frontier agentic models (OpenAI Codex / Claude Code) and 2) do so professionally in technical domains like programming, math and research. This group of people is subject to the highest amount of "AI Psychosis" because the recent improvements in these domains as of this year have been nothing short of staggering. When you hand a computer terminal to one of these models, you can now watch them melt programming problems that you'd normally expect to take days/weeks of work. It's this second group of people that assigns a much greater gravity to the capabilities, their slope, and various cyber-related repercussions. TLDR the people in these two groups are speaking past each other. It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and *at the same time*, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems. This part really works and has made dramatic strides because 2 properties: 1) these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge), but also 2) they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them. So here we are.

English
2
2
24
2.2K
mary
mary@howdymary·
two of the claude skills i use the most are /prp-prd and /devfleet, so i built a version for codex codex has codex exec for headless agents and is building native fanout (enable_fanout listed as under development in codex features) what's missing right now is a workflow layer, eg what to dispatch, how to split ownership, how agents allocate work to each other $ prp-prd helps you write hypothesis driven product specs $ devfleet splits work into packets with disjoint file ownership, dispatches parallel codex exec agents in worktrees, with a reviewer and tester verify after unlike claude /devfleet, this codex skill doesn't require an mcp server (it uses the native codex exec as the dispatch primitive) enjoy!
mary tweet media
English
2
0
18
1.2K
mary retweetledi
Peter Gao
Peter Gao@PlanetaryGao·
People are reacting emotionally to Artemis II for the same reason people react similarly to Mr. Rogers: it is a moment of pure good and hope and inspiration that is just missing from most of our lives these days
English
14
560
5.3K
44.4K
mary retweetledi
chl$
chl$@chelsssseeeea·
Quadruple NASAs budget immediately!
chl$ tweet mediachl$ tweet media
English
297
3.5K
59.4K
2.4M
mary retweetledi
Teknium (e/λ)
Teknium (e/λ)@Teknium·
Very interesting!
mary@howdymary

TLDR on Meta Harnesses and a practical implementation that I built for Hermes Hermes is an agent runtime (operating system) around a model (the brain) Meta harnesses are a way to improve the operating system, not the brain itself Rather than retraining the model, the meta harness continuously learns better ways to run the model by searching over runtime policy (prompt additions, tool ordering, stop heuristics, bootstrap steps, context management etc) to discover what makes the agent perform better on verifiable tasks A lot of coding agent failure centers around the agent runtime wasting time and tokens discovering basics, using the wrong tools / wrong context At the moment, Hermes does not have a research loop that treats the benchmark harness itself as something to optimize, which is the gap that this implementation addresses This setup uses the meta harness as a research layer around benchmark harnesses, not the full product runtime It splits Hermes into two layers: hermes-agent owns the inner runtime (candidate protocol, benchmark integration, loop hooks, and archive writing) hermes-agent-metaharness owns the outer loop (candidate evaluation, archive analysis, baseline reuse, frontier tracking, and search) This searches over code and policies that impact agent performance, such as - what bootstrap context to gather - which tools to expose and what order - how many turns to allow - which baseline to compare against - how to rank candidate harnesses Side note: You may have seen @Teknium previously release self-evolution; the distinction here is that self-evolution is intended to write better instructions for the agent and metaharness is intended to run the agent more efficiently on benchmarks Please try it out!

English
6
8
155
15.8K
mary retweetledi
🤠
🤠@heavensbvnny·
I need more men to understand that two men crying and hugging in space after one of them announced they were naming a moon crater after the other one’s late wife is actually what peak masculinity looks like
English
122
6K
74.9K
468.1K
mary
mary@howdymary·
@kennytjay mine all have just 1 user (me) because i haven't marketed yet haha, soon!
English
1
0
1
232
Kenny Tjay
Kenny Tjay@kennytjay·
@howdymerry also got three on market! 2 has about 100+ users just shipped third one last week. wbu?
English
1
0
2
237
mary
mary@howdymary·
TLDR on Meta Harnesses and a practical implementation that I built for Hermes Hermes is an agent runtime (operating system) around a model (the brain) Meta harnesses are a way to improve the operating system, not the brain itself Rather than retraining the model, the meta harness continuously learns better ways to run the model by searching over runtime policy (prompt additions, tool ordering, stop heuristics, bootstrap steps, context management etc) to discover what makes the agent perform better on verifiable tasks A lot of coding agent failure centers around the agent runtime wasting time and tokens discovering basics, using the wrong tools / wrong context At the moment, Hermes does not have a research loop that treats the benchmark harness itself as something to optimize, which is the gap that this implementation addresses This setup uses the meta harness as a research layer around benchmark harnesses, not the full product runtime It splits Hermes into two layers: hermes-agent owns the inner runtime (candidate protocol, benchmark integration, loop hooks, and archive writing) hermes-agent-metaharness owns the outer loop (candidate evaluation, archive analysis, baseline reuse, frontier tracking, and search) This searches over code and policies that impact agent performance, such as - what bootstrap context to gather - which tools to expose and what order - how many turns to allow - which baseline to compare against - how to rank candidate harnesses Side note: You may have seen @Teknium previously release self-evolution; the distinction here is that self-evolution is intended to write better instructions for the agent and metaharness is intended to run the agent more efficiently on benchmarks Please try it out!
mary tweet media
Deedy@deedydas

Meta Harnesses is Autoresearch on steroids. Something I've been exploring recently is to get long running agents to hill climb on a verifiable task to continuously improve without my intervention. Karpathy's Autoresearch did this pretty well on specific tasks, but this weekend I tried Meta Harnesses which moves one level of abstraction up. What does Meta Harness do? Autoresearch can be used in harness like Claude Code / Codex to generate experiments to try, evaluate results, and continue looping. Meta Harness generates a harness itself that optimizes on a task or a set of task. Here, we define a harness as "a single-file Python program that modifies task-specific prompting, retrieval, memory, and orchestration logic". The idea is that LLMs are very powerful today, but to harness [pun intended] their power, you need to give it the right prompts and context. Meta Harnesses automates coming up with the right prompts and the right way to retrieve context to solve a problem. Where did this idea come from? This is from a paper from Stanford and the author of DSPy written last week. The paper shows fantastic performance on 3 tasks: text classification, math reasoning (IMO level problems) and coding (Terminal Bench 2.0), far outperforming traditional harnesses. The discovered harnesses are interesting: math for example, splits up the logic into different categories (Combinatorics, Geometry, Number Theory, Algebra) and prompts and looks at the context differently. The coding harness, amongst other things, pre-processes the tools available in the environment to save exploratory turns. When should you use and not use it? Meta Harnesses seem pretty useful for tackling a specific but wide set of problems where the result is verifiable. In contrast, when I tried it on a specific task like Chess, it arbitrarily divides the problem into separate tasks - opening, mid game, end game, and creates different approaches for each. This "works" but isn't really clean because we believe there should be one approach that does all three. It does far better on things like examinations (JEE, Gaokao) where it splits problems into categories and tackles each category with different strategies. This paper covers a pretty light version of what a harness means. In the future, we can split up tasks into harnesses that have access to specific kinds of data, specific toolchains and various models to get even better results. Overall, pretty cool applied AI approach to hillclimb a verifiable task in a specific domain with variety within the problem space.

English
16
27
408
45.8K
mary
mary@howdymary·
@MattVMacfarlane matt is perpetually five steps ahead of everyone 🙈
English
1
0
0
605