henry

183 posts

henry

@h14hdotcom

Katılım Mayıs 2025

80 Takip Edilen10 Takipçiler

henry@h14hdotcom·1d

@AnthropicAI is making users pay for their own inability to design SDKs/APIs. Labeling all Agent SDK use as "programatic" instead of adding detection heuristics and basic throttling is offensively lazy. Burned reputation with engineers faster than engineers burn Claude tokens.

English

henry@h14hdotcom·1d

really gotta get my agent-driven @gleamlang workflow figured out

geoff@GeoffreyHuntley

@AriesTheCoder it goes hard with higher powered languages. absolutely would not do this with python, typescript, golang etc.

English

henry@h14hdotcom·1d

@theo Using Pi because of how easily and deeply it can be customized. I love how simple it is to do granular context management like in my "effect mode" package: pi.dev/packages/effec… No GUI wrapper for pi works quite right with the commands I find essential like /tree

English

326

Theo - t3.gg@theo·1d

Are you still using the CLI versions of your preferred agent instead of desktop apps like Codex App, Conductor, or T3 Code? Tell me why below. Genuinely curious.

Theo - t3.gg@theo

Just learned it's literally impossible to paste images into Claude Code over SSH. How do you CLI people live like this??

English

855

829

219.9K

henry@h14hdotcom·1d

@CFNCotton711 @ArtificialAnlys Hence the "hardware-normalized" caveat. What I want is to understand the *relative* performance of open weight models compared to one another on equivalent hardware. E.g. pick a GPU on modal.com and run open models thru a benchmark on it with performance profiling.

English

Chris Cotton@CFNCotton711·1d

@h14hdotcom @ArtificialAnlys The speed the tokens come out really is nuanced. If you're talking about the API, it's just a balance between price to performance on their side. If you want faster you can use something like cerebras. You can get 2000 a second. Locally depends on your hardware.

English

Artificial Analysis@ArtificialAnlys·3d

OpenBMB, a Tsinghua University / ModelBest open weights collaboration, has released MiniCPM-V 4.6 1.3B Instruct, a tiny, non-reasoning model that scores 13 on the Artificial Analysis Intelligence Index This model sits 3 points ahead of Qwen3.5 0.8B (Non-reasoning, 10) and 2 points behind Qwen3.5 2B (Non-reasoning, 15) on the Intelligence Index, establishing a new Pareto-optimal point on our Intelligence vs. Total Parameters chart. Tiny models are useful for efficient inference and on-device use cases. MiniCPM-V 4.6 1.3B Instruct is a vision-language model that supports text, image, and video input with text output. @OpenBMB is a China-based lab jointly founded in 2022 by Tsinghua University’s NLP Lab and ModelBest Inc. The model’s weights have been released under an Apache 2.0 license on Hugging Face. Key results: ➤ At 1.3B parameters, MiniCPM-V 4.6 1.3B Instruct scores 13 on the Artificial Analysis Intelligence Index, the highest for any open weights model under 2B parameters. The next-most-intelligent open weights model at comparable scale is Qwen3.5 0.8B (Reasoning, 11) and used 43x as many tokens to run the Intelligence Index; Qwen3.5 2B which scores 16 (Reasoning) and 15 (Non-reasoning) requires 1.7x as many parameters (2.27B). MiniCPM-V 4.6 1.3B Instruct also tops sub-2B open weights on MMMU-Pro, scoring 38%. ➤ MiniCPM-V 4.6 1.3B Instruct extends the open weights Pareto frontier for Intelligence vs. Total Parameters. Because the model is dense, total and active parameter counts are both 1.3B, so it pushes both frontiers. The next-most-intelligent sub-2B model (Qwen3.5 0.8B (Reasoning), 11) lands 2 points behind, despite also using a reasoning mode. ➤ MiniCPM-V 4.6 1.3B Instruct is highly token efficient, and used just 5.4M output tokens to run the Intelligence Index, ~19x fewer than Qwen3.5 0.8B (Non-reasoning, 101M) and ~43x fewer than Qwen3.5 0.8B (Reasoning, 233M). This is the lowest output token count measured for any open weights model under 4B total parameters scoring 10 or above on the Index (next-lowest is Ministral 3 3B at 15.5M). ➤ MiniCPM-V 4.6 1.3B Instruct supports native multimodal input, including text, image, and video, and scores 38% on MMMU-Pro. This is the highest visual reasoning score measured for any open weights model under 2B parameters, ahead of LFM2.5-VL-1.6B (27%) and Qwen3.5 0.8B (Non-reasoning, 26%). Video input at this parameter scale is uncommon. ➤ Knowledge recall is low, in line with other sub-2B models. AA-Omniscience is -85, in the typical range for sub-2B non-reasoning models (Qwen3.5 0.8B (Non-reasoning) at -89, Exaone 4.0 1.2B (Non-reasoning) at -83), and 2 points behind Qwen3.5 2B (Non-reasoning) at -83 (1.7x the parameter count). Additional model details: ➤ Size: 1.3B total parameters (dense) ➤ Context window: 262K ➤ Precision: BF16 ➤ License: Apache 2.0 ➤ Providers: No confirmed providers on release

English

230

283.3K

henry@h14hdotcom·3d

@ArtificialAnlys Open models that can fit comfortably on 32-64GB RAM would also be lovely to see. Really curious how the local coding agent experience stacks up to the frontier/cloud experience.

English

henry@h14hdotcom·3d

@ArtificialAnlys Would love to see alt (& non-)reasoning levels for the frontier models. Lower reasoning often feels better, and I'm curious whether the measurements affirm this. Would also love to see models that are on the pareto frontier for IQ/cost, like V4 Flash, M2.7, V2.5-Pro & Grok 4.3

English

211

Artificial Analysis@ArtificialAnlys·3d

Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more When developers use AI to code they’re choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance. The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s SWE-Bench Pro ➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility ➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers Analysis of results: ➤ Opus 4.7 and GPT-5.5 lead the Index: Opus 4.7 in Cursor CLI scores 61, followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58. ➤ Open weights models are competitive, but still trail the leaders: GLM-5.1 in Claude Code is the top open-weight result at 53, followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results, but still meaningfully behind the top proprietary models. ➤ Gemini 3.1 Pro in Gemini CLI underperforms: Gemini 3.1 Pro in Gemini CLI scores 43, well below where Gemini 3.1 Pro sits on our Intelligence Index, highlighting that Gemini’s performance in Gemini CLI remains a relative weak spot for Google’s offering. ➤ Cost per task (API token pricing) varies >30x: Composer 2 in Cursor CLI is cheapest at $0.07/task, followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end, GPT-5.5 in Codex costs $2.21/task, while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage, and in GPT-5.5’s case by a relatively higher per token cost. ➤ Token usage varies >3x: GLM-5.1 in Claude Code uses the most tokens at 4.8M/task, followed by Kimi K2.6 at 3.7M/task and DeepSeek V4 Pro at 3.5M/task. GPT-5.5 in Codex uses 2.8M tokens/task, substantially more than Opus 4.7 in Claude Code at 1.7M/task. In GLM-5.1’s case, higher token usage, cost and execution time were partly driven by the model entering loops on some tasks. ➤ Cache hit rates remain high but vary materially: Cache hit rates range from 80% to 96% across combinations. Provider routing, harness prompt structure and cache behavior can materially change the economics of running the same model given cached inputs are typically <50% the API price of regular input tokens. ➤ Time per task varies >7x: Opus 4.7 in Claude Code is fastest at ~6 minutes/task, while Kimi K2.6 in Claude Code is slowest at ~40 minutes/task. This is contributed to by differences in average turns per task, token usage and API serving speed. Opus 4.7 had materially lower amount of turns to complete a task than all other models while Kimi K2.6 had the most. ➤ Cursor made real progress with Composer 2: Composer 2 in Cursor CLI scores 48, near the leading open-weight model results, while being the cheapest combination measured at $0.07/task. Cursor has stated Composer 2 is built from Kimi K2.5, showcasing they have made substantial post-training gains. This is just the start. We are planning to add additional agents (both harnesses and models). Let us know what you would like to see added next.

English

123

168

1.5K

1.7M

henry@h14hdotcom·3d

@badlogicgames @vboykis It is interesting correlation that a big focus of both the Opus-4.7 and GPT-5.5 releases was efficiency in the form of "IQ per token". Does make me wonder whether this signals a transition away from providing "biggest and best" towards the most efficient "good enough".

English

Mario Zechner@badlogicgames·4d

@vboykis fwiw, the HN crowd's been surprisingly ai-skeptical. so not sure this is a good measure for a vibe shift. the big labs and AI adjacent corps are definitely not going in that direction with their marketing, e.g. on here.

English

2.1K

vicki@vboykis·4d

Legitimately feels like an unquantifiable vibe shift the last few weeks where the pendulum is swinging back to reasonable takes and people experimenting with model choice 🙏

English

216

18K

henry@h14hdotcom·3d

Also: Take feedback gracefully. If you react to something like this defensively you're putting too much energy in ego-protection and not enough into self-improvement.

Theo - t3.gg@theo

@josefbender_ Make better content.

English

henry@h14hdotcom·3d

@ArtificialAnlys I really want to see Pi.dev from @badlogicgames. IMO it's the best open-source agent harness, but objectively, it's got the least "secret sauce" of any viable daily-use harnesses. I think this makes it an excellent baseline comparison point for other harnesses.

English

1.2K

henry@h14hdotcom·4d

@trashh_dev CachyOS + Hyprland

Indonesia

213

trash@trashh_dev·4d

what’s the best linux distro to rice like i have never talked to the opposite sex

English

134

528

48.3K

henry@h14hdotcom·5d

@yacineMTB yes & yes

English

kache@yacineMTB·5d

is anyone vibecoding making actual cool stuff or is it still all mostly slop

English

799

4.1K

546.6K

henry@h14hdotcom·6d

FYI: Caveman mode is great for chat.

henry@h14hdotcom

@badlogicgames The only use-case I like it for is chat. Not to save tokens, just to get a consistent "personality" and less fluff to read. I find that Gemini 3.1 Pro is much better to talk to as a caveman than by default. Quick example attached.

English

henry@h14hdotcom·6d

English

Mario Zechner@badlogicgames·8 May

TIL about "caveman mode" to "save tokens". how many tokens in a session are actually model output? i think i'll become a gardener.

English

273

32.1K

henry@h14hdotcom·8 May

@badlogicgames We're probably not far from a new "Agentic Manifesto" that outlines what pragmatic software work looks like in this new era. Neither Waterfall nor Agile seem like the right patterns, but I'd bet elements of each will be carried forward in some capacity.

English

henry@h14hdotcom·8 May

@badlogicgames Agile was a pragmatic response to Waterfall because it was realistic about human capabilities when working on complex projects. Now that human capabilities have increased dramatically, it's not surprising that older "unrealistic" working patterns are suddenly a lot more viable.

English

151

Mario Zechner@badlogicgames·7 May

it is absolutely crazy to me that our entire industry has succumbed to hyper waterfall. because that's what ya'll are doing with your massive plans and beads and dark factories. have you learned nothing?

Matt Pocock@mattpocockuk

The more I replace plans with prototypes, the better the outputs Who'd have thought that low fidelity prototypes were better than walls of spec Oh yeah, the entire industry for 20 years Stop going against decades of knowledge because someone in SF shipped it as a 'mode'

English

894

102.2K

henry@h14hdotcom·8 May

@cnakazawa They aren't docs tho. They're tuning instructions. LLMs are big balls of knowledge, and loading a skill at the top of a convo lets you deliberately prioritize a subset of that knowledge. Docs are for adding new knowledge, skills are for filtering out existing knowledge.

English

Christoph Nakazawa@cnakazawa·8 May

I really don't get the hype about skills. They are just docs. Just write docs and ship them inside your packages.

English

522

68K

henry@h14hdotcom·8 May

@expo @convex Throw in @clerk and I genuinely struggle to think how building real, money-making cross-platform apps could get any easier.

English

henry@h14hdotcom·8 May

So hyped about Expo UI and iOS Widgets. The @expo and @convex pairing is so blessed

Expo@expo

We just cut the SDK 56 beta 😅 ◆ 50%+ faster iOS builds (precompiled XCFrameworks) ◆ 40% faster cold starts on Android ◆ Expo UI is stable ◆ iOS widgets are stable ◆ expo-router rebuilt from scratch ◆ Inline native modules ◆ Brownfield multi-app support The obvious themes are speed and stability. But there are a lot of other interesting changes (like Expo Router decoupling itself form React Navigation). Get all the nuance and detail in the changelog below ↓

English

henry@h14hdotcom·8 May

@dhh @OpenAI @davis7 will love this

English

199

DHH@dhh·8 May

I've been driving GPT5.5 on low reasoning for the last week+ and it's very good, very efficient. Haven't been tempted to reach for Opus at all. And it's more succinct than Kimi too. Huge leap forward for @OpenAI 👌

English

157

138

275.2K

Keşfet

@AnthropicAI @gleamlang @theo @CFNCotton711 @ArtificialAnlys @OpenBMB @badlogicgames @vboykis