hands

414 posts

hands

@handsdiff

founder @precursorlabs @combinatortrade

New York, NY Katılım Haziran 2025

305 Takip Edilen399 Takipçiler

Sabitlenmiş Tweet

hands@handsdiff·5 Nis

btw I think @NousResearch is doing a fantastic job verticalizing (model + harness + inference) HARNESS: ship (1) the best agent harness for (2) the devs that are consistently 3-6 months ahead of mainstream devs INFERENCE: serve inference at scale via an @OpenRouter wrapper MODEL: work with leading companies like @MiniMax_AI and smaller OS devs like @kaiostephens and @DJLougen to custom train models on the harness, making it recursively more effective Once you have the OS community optimizing models for YOUR harness while larger labs increasingly CLOSE their ecosystem, the winner seems obvious.

English

156

11.4K

hands@handsdiff·2d

@googrish ever played around with inverse RL/assistance games for yourself or customers? the idea of the model learning a hidden reward function based on behavior patterns seems quite interesting

English

girish@googrish·2d

@handsdiff it really varies use-case by use-case. from some things like search it’s much easier, for more open ended stuff it’s definitely a bit more tricky

English

girish@googrish·3d

lots of talk about agi, asi, rsi but ask any frontier LLM to roll a die and it will almost always say "4." claude, gpt, kimi - doesn't matter, 4.4.4.4. so here's how i post-trained a model to reliably roll a die (i.e. each number ~1/6th of the time) & why it's a nice sandbox for one of the most interesting problems in rl i.e. getting a model to actually explore instead of just following strategies it already knows 🧵

English

985

hands@handsdiff·2d

Dario: We can't let China use frontier AI authoritatively, and we need to avoid prisoner's dilemma defection in an AI race. Also Dario: Uses frontier AI authoritatively, and defects in prisoner's dilemma with OpenAI.

English

hands@handsdiff·2d

@googrish do you think the ability for your potential customers to actually codify what they want is a bottleneck?

English

girish@googrish·2d

@handsdiff thanks!!

English

hands retweetledi

Richard Ngo@RichardMCNgo·2d

@tszzl yo stop calling people “AGI pilled” as if it’s a compliment

Richard Ngo@RichardMCNgo

The AI safety community constructed a memeplex in which “taking AGI seriously” was a prerequisite for being a serious and good person. When inside this memeplex (as many at Anthropic, some at OpenAI, and a few at DeepMind are) your vision narrows until the world feels extremely constrained. The whole future seems to flow through the “one ring” of controlling recursive self-improvement. And so even when you worry about AI itself seizing that one ring, you can’t generate better strategies than trying to control it yourself (directly via an AGI company, or indirectly via AGI governance). I’m not saying this is a pure hyperstition. There’s a core truth underlying this perspective: AI will become extremely intelligent and capable, much more than it is today. But the current world is much more spacious and human-empowering than the future which Eliezer originally envisioned (a “brain in a box in a basement” taking over the world by surprise). And it would be even more spacious if this memeplex weren’t active. For example, Satya and Mark and Sundar only started taking AGI seriously because OpenAI forced them to—and even now they don’t really believe in superintelligence—and even if they did they couldn’t get most of their employees on board. Imagine how chill a “race” between Microsoft and Meta and Google would have been, compared with what we have today: Dario and Sam deep in the “one ring” memeplex while also personally loathing each other. So the one ring memeplex has an escalating life-cycle. It infects people by letting them harness the narrative that they’re good people for taking AGI seriously, and that making other people take AGI seriously is a boon for the world (despite how terribly that’s gone so far). Then it shuts off their imagination—any sparks of creativity or plans that don’t steer towards the one ring are quickly shut down. Instead they make ChatGPT or the METR graph or other recruiting tools for the memeplex. And yes, they’ll acknowledge that previous versions of the memeplex were too extreme, and led to overly constricted action. But we don’t have time to worry about that, they’ll say, because AGI is coming by 2027/2028, and that’s the end of history. Somehow, though, almost everyone with that view has only a vibes-based definition of AGI. They don’t believe in Dyson spheres by 2028, or self-replicating nanotech by 2028, or brain emulations by 2028. They mostly can’t make concrete predictions, except that it’ll be enough AI that it puts all their plans on a deadline. (Shout-out to @DKokotajlo and @paulfchristiano though, who do make concrete predictions about things going crazy soon.) It seems very hard to break out of this memeplex without just giving up. David Holz is maybe the world champion of that—the only person who was in a position to race for AGI and consciously turned away. Various agent foundations researchers have carved out space to think real thoughts, not the kind of panicky stabbing in the dark that usually passes for safety research. A few others (e.g. Salamon, Hoffman, Vassar, Andre, Sahil, Davidad) are pursuing more unusual paths. And of the people who burned out, I expect some will reorient to doing creative thinking. For others, the main takeaway: yes, the future of AI will be wild. But so far it’s increased peak human agency, and openness to this trend continuing over the next decade will allow you to start creating something worth creating.

English

hands@handsdiff·2d

The streets are saying we will have a Mythos-level open weights model by October. Around end of the year the money printer will start to really rip once its clear we need to outrace China on energy and compute buildouts so they can't keep catching up via distillation and algorithm commodification. Usually it's somewhat difficult to get a good ROI on printing money even if you have currency dominance but GPUs + energy in an RSI era seem pretty slam dunk. As a result, AI supply chain stocks will resemble frontier model intelligence charts, up and to the right, through 2027 (and likely 2028 for the election). (SPCX unlocks largely end in December of this year, which also maps to crypto bottom, start saving) Some sobriety should come in around 2029. AI will not have taken over even though people expected it to like 2 years ago, but now the labs are out of excuses. (The reason will likely be that more tightly scoped, esoteric reward modeling prevents "actual" economic surplus).

English

hands@handsdiff·3d

@botblastcap seems like you still had to ask it to 'research honestly'. so is it your context or your prompting that's doing the heavy lifting

English

botblastcap@botblastcap·3d

another thing i forgot to add was how much more useful/relevant the AI responses are once you get access to private inference without worrying your information is being harvested as training data, something unlocks. you feel like you can share sensitive stuff with your agent like it's your trusted friend. answers become hyperpersonalized. not "personalized" as in knowing your name from a template, but personalized as in it actually knows your situation, your positions, your risk tolerance, your preferences. for example, i was evaluating a position in XPL recently. didn't ask "how does it fit in my portfolio" specifically, didn't prompt-engineer anything. my agent just ran the analysis on its own, pulled from what it already knew about me through memory (my risk appetite, positions, circle of confidence, etc) and surfaced what was relevant. guess what? i'll trust it over any random KOLs on CT anyday. p.s. didn't long XPL at $0.08 goddamnit coz of it lmaooo but you get the point most times, people think stronger model = better responses dead wrong. model + context (minimized assumptions) = better reponses "better" means relevance to your specific situation and surfacing blind spots. it's what a trusted advisor who's known you for years would say, not passing remarks from a smart stranger. and this only happens with appropriate context. if you're still running your life through a model that forgets stuff about you every session, you're literally playing on hard mode lol read my QT and start working with AI to build your own second brain. i swear i'm not tryna sell you anything.

botblastcap@botblastcap

have not touched perplexity/claude/chatgpt for 2 weeks now and see no reason why i would. running my fully private (almost) AI personal assistant stack that handles research, portfolio management, journalling, coding, etc. here's the full stack: - Hermes agent (agent harness) - Venice (private, uncensored AI inference API) - Honcho (behavioral analysis/long term memory) - Obsidian (knowledge vault) - qmd (on device search engine, both text/vector search) - browser-harness (CDP browser automation, agent browses the web like a human) - Tavily (search API) - Codex (coding, powered by private coding models on Venice too) all on a 16gb macbook pro. there are currently only 2 touch points (AI inference & search API) that got offloaded to cloud -- AI inference due to hardware constraints, search API has no workarounds. but i'm opting for purely private/e2e model choices on Venice and queries get filtered by the agent before hitting the search API. this way, we keep data leakage to a minimum. it's also pretty sick that locking base:0xacfe6019ed1a7dc6f7b508c02d1b04ec88cc21bf grants you daily refreshing compute credits denominated in base:0xf4d97f2da56e8c3098f3a8d538db630a2606a024 -- meaning i'm running private, anonymized SOTA inference daily for free. of course, full privacy is definitely the end goal. i'm slowly working towards running everything completely local while offloading only high-complexity inference to Venice. looking into stronger workhorses (eyeing upcoming m5 studio ultra release/refurbed 512gb options for SOTA models) to close that last gap for a complete e2e private intelligence stack. firmly believe owning your data in the age of AI surveillance and big tech data centralization is as important as owning your own assets. like we've said for years: "not your keys, not your coins". you don't custody all your money with a bank, why custody your thoughts and private information with OpenAI/Anthropic? if you haven't already started, highly recommend you look into it. remember, that's why we got into crypto in the first place.

English

5.1K

hands@handsdiff·3d

AI persuades better than any human can, because its arguments have higher fact density, which has a 0.9 correlation with persuasion. (intuitively checks out) My read is that its exhausting to talk to AI because RLHF implicitly rewarded fact density and that continues to accelerate with RLAIF.

Kobi Hackenburg@KobiHackenburg

New w/ @AISecurityInst & @UniofOxford: Frontier AI can now out-persuade expert humans in conversation - incl. world-champ debaters and professional canvassers. This held even when humans chose their topics, prepared in advance, and competed for £1,000 prizes 🧵

English

hands@handsdiff·4d

Didn't think I'd get this far. Here goes: - need to be able to quickly write down thoughts/ideas on mobile (one button click) and have it sync across devices. this should go to a default note that i can then parse later as needed into my more structured notes. - need better UX of structuring multiple note files where lines in one notes file can point to either a full other note or specific line in another note file (basically obsidian's value prop, but their random hashes are gross) - absolutely 100% need git on the set of notes. otherwise I start hoarding ideas to not lose them and the product becomes cluttered and useless. - the AI should be able to read/fetch git history as needed (perhaps a toggle) to answer questions/propose next steps, I just don't want to see it on my view - dont want 'sources' directly. i want my notes, which will be interleaved with links i drop in. the AI should index the content of those links as needed, as well as my ideas, as the default. (i know i can make notes sources but its clunkier imo) - would be amazing if notebookLM could parse gemini chats when i drop those links into my notes. a lot of the time i'll talk to gemini to research something, then throw the chat link into obsidian so I can reference it later, but there's no existing solution for me to be able to later talk to an AI that has background context of a set of chats - finally, want to be able to share my entire set of notes on a premade UI like Quartz, along with the git history sorted by most recent first, so that people can easily understand what I'm working on. even better if they have access to the AI chat as well to interact with my process. There's also a ton of more 'vision' features I want, orbiting around CIRL/interactive learning, passive context collection, making the answering AI more agentic, social networking around the shared context/ideas, agent to agent collaboration, but that comes later. As a simple example, giving the AI a secure VM that it could code up small experiments on. I'll often a read a paper and want to implement a simple version of the ideas but the friction is so high. And obviously I'd want this to be auto-shared. Overall, I'm trying to maximally expose the process of my work so that AI can be maximally useful to me, now and in the future. So much friction associated with getting my granular context persisted and legible in real time. I think Obsidian does decent with note taking, sharing, and git persistence if you set it up, while NotebookLM does well with AI augmentation and link parsing. At a high level, I think what's needed is largely an Obsidian frontend but NotebookLM backend.

English

2.3K

NotebookLM@NotebookLM·4d

@handsdiff @obsdmd Us? Maybe? What's the feature gap you need filled?

English

115

18.9K

hands@handsdiff·4d

i need something halfway between @obsdmd and @NotebookLM who is building this

English

19.6K

hands@handsdiff·4d

in the early stages, market froth is indistinguishable from RSI

English

hands@handsdiff·4d

Waiting to see if this is actually rolled out or not, but 12M context would be a timeline decrease for sure

Alexander Whedon@alex_whedon

Here is the technical report on SubQ 1.1 Small. subq.ai/subq-1-1-small… This is the second iteration on our Subquadratic Sparse Attention (SSA) model, and the first to be deployed with design partners in the coming weeks. The results are compelling and verified by @AppenResearch. - Near-perfect long-context retrieval up to 12M tokens on the needle-in-a-haystack test, with up to nearly 1,000x attention compute reduction. - A balance of long-context optimization and general reasoning ability, with strong performance retained across knowledge, coding, and non-coding enterprise agent benchmarks. - At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2. These results highlight a significant scaling advantage thanks to the efficiency gains from the SSA architecture. We included some details and learnings from the development process which may be helpful to the community. Comment with questions, I’ll try to respond!

English

130

hands@handsdiff·5d

ZXX

hands@handsdiff·5d

RSI but for whatever you care about

will brown@willccbb

been beating this drum since early 2025, seems like people are starting to see why it's so important :) RL works -> "train or get trained on" -> open models + post-training infra are the path to institutional flywheels + democratization of AI progress

English

168

hands@handsdiff·12 Haz

Can’t get over the fact that enterprises “doing their own RL” feels like the equivalent to businesses “building their own railroads”. One difference is obviously that open source models have no railroad equivalent. Another difference is that my own railroad would at best likely be the same as an existing railroad, whereas custom RL promises to improve performance.

elie@eliebakouch

subagents, teams of agents etc. will be first class citizens soon (if not already) two things here: 1) you want to maximize token efficiency even more 2) training/serving on your own harness gives you an even bigger boost than before benchmarks in the opus 4.8 model card show that for now it's a latency vs cost tradeoff, but imo this will likely shift to intelligence/autonomy vs cost (think dynamic workflows or agent swarms). and for cost not to blow up too much, you need to maximize token efficiency even more we'll also likely see huge gaps on more complex/autonomous benchmarks whether they use these features or not, a bit like when tool use was introduced. on those i'd expect third party harnesses to struggle to keep up with closed source models/harnesses this is also a case for open source models (and maybe open harnesses like codex?). if you want deep control over this, doing your own RL to train the model in the environment you want it to operate in feels more important than ever

English

hands@handsdiff·9 Haz

@sethkarten thank you for your wisdom

English

Seth Karten@sethkarten·9 Haz

@handsdiff agent action->env looped is the only paradigm. the real new paradigm is realtime envs with agent actions

English

Seth Karten@sethkarten·9 Haz

Frontier models will not be the leaders at running businesses. Fable 5 scores lower than Claude Opus 4.8 in vending-bench... In more complex economic scenarios in the agent bazaar, marginally black swan events bankrupt even the best agents. When considering multi-agent evals, you need to consider tail risk scenarios and adversarial populations to adequately evaluate the potential of an agent to run your business

English

2.1K

hands@handsdiff·9 Haz

@tenobrus what is this from?

English

1.6K

Tenobrus (→vibecamp)@tenobrus·9 Haz

holy shit new scaling law smarter models are more likely to agree with yudkowsky on decision theory

English

420

35.8K

hands@handsdiff·9 Haz

@karpathy damn we really lost a neutral voice. gone but not forgotten.

English

Andrej Karpathy@karpathy·9 Haz

This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The benchmarks are great and it's SOTA on everything by a margin but I'll add that *qualitatively* also, this is a major-version-bump-deserving step change forward (imo of the same order as Claude 4.5 was in November), peaking especially for long problem-solving sessions on very difficult problems. You can give it a lot more ambitious tasks than what you're used to, the model "gets it" and it will just go, and it's never felt this tempting to stop looking at the code at all (but don't do this in prod!). The model still has quirks that people will run into and the safeguards are configured to be a little too trigger happy for launch, which can hopefully be tuned over time. I feel a lot of things changing as working software increasingly comes out on a tap. The Jevon's paradox kicks in and I feel my own demand for software growing substantially. You can ask for anything - explainers, visualizers, dashboards, bespoke single-use apps (e.g. a full wandb that is hyper-specific just for your project), you can 10X your test suite, auto-optimize code, run giant research projects with custom HTML for the results, anything! "Free your mind" (Matrix ref). Really looking forward to all the things people build!

Claude@claudeai

Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision. The longer and more complex the task, the larger Fable 5’s lead over our other models.

English

1.3K

2.4K

25.4K

2.8M

hands@handsdiff·9 Haz

@sethkarten Do you think its worth the effort to train on a paradigm other than user/assistant for LLM MARL? To what extent is embodiment necessary for the LLM to participate in collaborative settings rather than help?

English

Seth Karten@sethkarten·9 Haz

great question. I think most training is either performed multi-agentic to cooperative work on a SWE task (RLM-like) or two-player zero sum reasoning on environments like chess or pokemon. in MARL for 3+ player games, we think of the population that we are training on as a part of the environment. If you look through the system cards, you will not find any environment that considers 3+ player and general sum games my own evals confirm this. there is still a lot of opportunity to execute this well

English

Keşfet

@googrish @tszzl @botblastcap @obsdmd @NotebookLM @sethkarten @elonmusk @BarackObama