henry

183 posts

henry

henry

@h14hdotcom

Katılım Mayıs 2025
80 Takip Edilen10 Takipçiler
henry
henry@h14hdotcom·
@AnthropicAI is making users pay for their own inability to design SDKs/APIs. Labeling all Agent SDK use as "programatic" instead of adding detection heuristics and basic throttling is offensively lazy. Burned reputation with engineers faster than engineers burn Claude tokens.
English
0
0
0
1
henry
henry@h14hdotcom·
@theo Using Pi because of how easily and deeply it can be customized. I love how simple it is to do granular context management like in my "effect mode" package: pi.dev/packages/effec… No GUI wrapper for pi works quite right with the commands I find essential like /tree
English
0
0
2
326
henry
henry@h14hdotcom·
@CFNCotton711 @ArtificialAnlys Hence the "hardware-normalized" caveat. What I want is to understand the *relative* performance of open weight models compared to one another on equivalent hardware. E.g. pick a GPU on modal.com and run open models thru a benchmark on it with performance profiling.
English
0
0
0
6
Chris Cotton
Chris Cotton@CFNCotton711·
@h14hdotcom @ArtificialAnlys The speed the tokens come out really is nuanced. If you're talking about the API, it's just a balance between price to performance on their side. If you want faster you can use something like cerebras. You can get 2000 a second. Locally depends on your hardware.
English
1
0
1
51
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
OpenBMB, a Tsinghua University / ModelBest open weights collaboration, has released MiniCPM-V 4.6 1.3B Instruct, a tiny, non-reasoning model that scores 13 on the Artificial Analysis Intelligence Index This model sits 3 points ahead of Qwen3.5 0.8B (Non-reasoning, 10) and 2 points behind Qwen3.5 2B (Non-reasoning, 15) on the Intelligence Index, establishing a new Pareto-optimal point on our Intelligence vs. Total Parameters chart. Tiny models are useful for efficient inference and on-device use cases. MiniCPM-V 4.6 1.3B Instruct is a vision-language model that supports text, image, and video input with text output. @OpenBMB is a China-based lab jointly founded in 2022 by Tsinghua University’s NLP Lab and ModelBest Inc. The model’s weights have been released under an Apache 2.0 license on Hugging Face. Key results: ➤ At 1.3B parameters, MiniCPM-V 4.6 1.3B Instruct scores 13 on the Artificial Analysis Intelligence Index, the highest for any open weights model under 2B parameters. The next-most-intelligent open weights model at comparable scale is Qwen3.5 0.8B (Reasoning, 11) and used 43x as many tokens to run the Intelligence Index; Qwen3.5 2B which scores 16 (Reasoning) and 15 (Non-reasoning) requires 1.7x as many parameters (2.27B). MiniCPM-V 4.6 1.3B Instruct also tops sub-2B open weights on MMMU-Pro, scoring 38%. ➤ MiniCPM-V 4.6 1.3B Instruct extends the open weights Pareto frontier for Intelligence vs. Total Parameters. Because the model is dense, total and active parameter counts are both 1.3B, so it pushes both frontiers. The next-most-intelligent sub-2B model (Qwen3.5 0.8B (Reasoning), 11) lands 2 points behind, despite also using a reasoning mode. ➤ MiniCPM-V 4.6 1.3B Instruct is highly token efficient, and used just 5.4M output tokens to run the Intelligence Index, ~19x fewer than Qwen3.5 0.8B (Non-reasoning, 101M) and ~43x fewer than Qwen3.5 0.8B (Reasoning, 233M). This is the lowest output token count measured for any open weights model under 4B total parameters scoring 10 or above on the Index (next-lowest is Ministral 3 3B at 15.5M). ➤ MiniCPM-V 4.6 1.3B Instruct supports native multimodal input, including text, image, and video, and scores 38% on MMMU-Pro. This is the highest visual reasoning score measured for any open weights model under 2B parameters, ahead of LFM2.5-VL-1.6B (27%) and Qwen3.5 0.8B (Non-reasoning, 26%). Video input at this parameter scale is uncommon. ➤ Knowledge recall is low, in line with other sub-2B models. AA-Omniscience is -85, in the typical range for sub-2B non-reasoning models (Qwen3.5 0.8B (Non-reasoning) at -89, Exaone 4.0 1.2B (Non-reasoning) at -83), and 2 points behind Qwen3.5 2B (Non-reasoning) at -83 (1.7x the parameter count). Additional model details: ➤ Size: 1.3B total parameters (dense) ➤ Context window: 262K ➤ Precision: BF16 ➤ License: Apache 2.0 ➤ Providers: No confirmed providers on release
Artificial Analysis tweet media
English
10
24
230
283.3K
henry
henry@h14hdotcom·
@ArtificialAnlys Open models that can fit comfortably on 32-64GB RAM would also be lovely to see. Really curious how the local coding agent experience stacks up to the frontier/cloud experience.
English
0
0
0
3
henry
henry@h14hdotcom·
@ArtificialAnlys Would love to see alt (& non-)reasoning levels for the frontier models. Lower reasoning often feels better, and I'm curious whether the measurements affirm this. Would also love to see models that are on the pareto frontier for IQ/cost, like V4 Flash, M2.7, V2.5-Pro & Grok 4.3
English
1
0
0
211
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more When developers use AI to code they’re choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance. The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s SWE-Bench Pro ➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility ➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers Analysis of results: ➤ Opus 4.7 and GPT-5.5 lead the Index: Opus 4.7 in Cursor CLI scores 61, followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58. ➤ Open weights models are competitive, but still trail the leaders: GLM-5.1 in Claude Code is the top open-weight result at 53, followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results, but still meaningfully behind the top proprietary models. ➤ Gemini 3.1 Pro in Gemini CLI underperforms: Gemini 3.1 Pro in Gemini CLI scores 43, well below where Gemini 3.1 Pro sits on our Intelligence Index, highlighting that Gemini’s performance in Gemini CLI remains a relative weak spot for Google’s offering. ➤ Cost per task (API token pricing) varies >30x: Composer 2 in Cursor CLI is cheapest at $0.07/task, followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end, GPT-5.5 in Codex costs $2.21/task, while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage, and in GPT-5.5’s case by a relatively higher per token cost. ➤ Token usage varies >3x: GLM-5.1 in Claude Code uses the most tokens at 4.8M/task, followed by Kimi K2.6 at 3.7M/task and DeepSeek V4 Pro at 3.5M/task. GPT-5.5 in Codex uses 2.8M tokens/task, substantially more than Opus 4.7 in Claude Code at 1.7M/task. In GLM-5.1’s case, higher token usage, cost and execution time were partly driven by the model entering loops on some tasks. ➤ Cache hit rates remain high but vary materially: Cache hit rates range from 80% to 96% across combinations. Provider routing, harness prompt structure and cache behavior can materially change the economics of running the same model given cached inputs are typically <50% the API price of regular input tokens. ➤ Time per task varies >7x: Opus 4.7 in Claude Code is fastest at ~6 minutes/task, while Kimi K2.6 in Claude Code is slowest at ~40 minutes/task. This is contributed to by differences in average turns per task, token usage and API serving speed. Opus 4.7 had materially lower amount of turns to complete a task than all other models while Kimi K2.6 had the most. ➤ Cursor made real progress with Composer 2: Composer 2 in Cursor CLI scores 48, near the leading open-weight model results, while being the cheapest combination measured at $0.07/task. Cursor has stated Composer 2 is built from Kimi K2.5, showcasing they have made substantial post-training gains. This is just the start. We are planning to add additional agents (both harnesses and models). Let us know what you would like to see added next.
Artificial Analysis tweet media
English
123
168
1.5K
1.7M
henry
henry@h14hdotcom·
@badlogicgames @vboykis It is interesting correlation that a big focus of both the Opus-4.7 and GPT-5.5 releases was efficiency in the form of "IQ per token". Does make me wonder whether this signals a transition away from providing "biggest and best" towards the most efficient "good enough".
English
0
0
1
58
Mario Zechner
Mario Zechner@badlogicgames·
@vboykis fwiw, the HN crowd's been surprisingly ai-skeptical. so not sure this is a good measure for a vibe shift. the big labs and AI adjacent corps are definitely not going in that direction with their marketing, e.g. on here.
English
3
0
39
2.1K
vicki
vicki@vboykis·
Legitimately feels like an unquantifiable vibe shift the last few weeks where the pendulum is swinging back to reasonable takes and people experimenting with model choice 🙏
vicki tweet media
English
17
21
216
18K
henry
henry@h14hdotcom·
Also: Take feedback gracefully. If you react to something like this defensively you're putting too much energy in ego-protection and not enough into self-improvement.
Theo - t3.gg@theo

@josefbender_ Make better content.

English
0
0
0
8
henry
henry@h14hdotcom·
@ArtificialAnlys I really want to see Pi.dev from @badlogicgames. IMO it's the best open-source agent harness, but objectively, it's got the least "secret sauce" of any viable daily-use harnesses. I think this makes it an excellent baseline comparison point for other harnesses.
English
1
0
20
1.2K
trash
trash@trashh_dev·
what’s the best linux distro to rice like i have never talked to the opposite sex
English
134
3
528
48.3K
kache
kache@yacineMTB·
is anyone vibecoding making actual cool stuff or is it still all mostly slop
English
799
53
4.1K
546.6K
henry
henry@h14hdotcom·
FYI: Caveman mode is great for chat.
henry@h14hdotcom

@badlogicgames The only use-case I like it for is chat. Not to save tokens, just to get a consistent "personality" and less fluff to read. I find that Gemini 3.1 Pro is much better to talk to as a caveman than by default. Quick example attached.

English
0
0
0
25
henry
henry@h14hdotcom·
@badlogicgames The only use-case I like it for is chat. Not to save tokens, just to get a consistent "personality" and less fluff to read. I find that Gemini 3.1 Pro is much better to talk to as a caveman than by default. Quick example attached.
henry tweet mediahenry tweet media
English
0
0
0
87
Mario Zechner
Mario Zechner@badlogicgames·
TIL about "caveman mode" to "save tokens". how many tokens in a session are actually model output? i think i'll become a gardener.
English
49
3
273
32.1K
henry
henry@h14hdotcom·
@badlogicgames We're probably not far from a new "Agentic Manifesto" that outlines what pragmatic software work looks like in this new era. Neither Waterfall nor Agile seem like the right patterns, but I'd bet elements of each will be carried forward in some capacity.
English
0
0
0
14
henry
henry@h14hdotcom·
@badlogicgames Agile was a pragmatic response to Waterfall because it was realistic about human capabilities when working on complex projects. Now that human capabilities have increased dramatically, it's not surprising that older "unrealistic" working patterns are suddenly a lot more viable.
English
1
0
0
151
Mario Zechner
Mario Zechner@badlogicgames·
it is absolutely crazy to me that our entire industry has succumbed to hyper waterfall. because that's what ya'll are doing with your massive plans and beads and dark factories. have you learned nothing?
Matt Pocock@mattpocockuk

The more I replace plans with prototypes, the better the outputs Who'd have thought that low fidelity prototypes were better than walls of spec Oh yeah, the entire industry for 20 years Stop going against decades of knowledge because someone in SF shipped it as a 'mode'

English
45
41
894
102.2K
henry
henry@h14hdotcom·
@cnakazawa They aren't docs tho. They're tuning instructions. LLMs are big balls of knowledge, and loading a skill at the top of a convo lets you deliberately prioritize a subset of that knowledge. Docs are for adding new knowledge, skills are for filtering out existing knowledge.
English
0
0
0
51
Christoph Nakazawa
Christoph Nakazawa@cnakazawa·
I really don't get the hype about skills. They are just docs. Just write docs and ship them inside your packages.
English
67
15
522
68K
henry
henry@h14hdotcom·
@expo @convex Throw in @clerk and I genuinely struggle to think how building real, money-making cross-platform apps could get any easier.
English
0
0
1
32
DHH
DHH@dhh·
I've been driving GPT5.5 on low reasoning for the last week+ and it's very good, very efficient. Haven't been tempted to reach for Opus at all. And it's more succinct than Kimi too. Huge leap forward for @OpenAI 👌
English
157
138
4K
275.2K