Jeremy Mack

1.1K posts

Jeremy Mack banner
Jeremy Mack

Jeremy Mack

@mutewinter

Building something with a better name than @quests_dev.

参加日 Temmuz 2008
337 フォロー中1.3K フォロワー
Theo - t3.gg
Theo - t3.gg@theo·
Since OpenAI dropped gpt-oss-120b, Mistral has released 4 models that are worse than gpt-pss-120b
Artificial Analysis@ArtificialAnlys

Mistral has released Mistral Small 4, an open weights model with hybrid reasoning and image input, scoring 27 on the Artificial Analysis Intelligence Index @MistralAI's Small 4 is a 119B mixture-of-experts model with 6.5B active parameters per token, supporting both reasoning and non-reasoning modes. In reasoning mode, Mistral Small 4 scores 27 on the Artificial Analysis Intelligence Index, a 12-point improvement from Small 3.2 (15) and now among the most intelligent models Mistral has released, surpassing Mistral Large 3 (23) and matching the proprietary Magistral Medium 1.2 (27). However, it lags open weights peers with similar total parameter counts such as gpt-oss-120B (high, 33), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), and Qwen3.5 122B A10B (Reasoning, 42). Key takeaways: ➤ Reasoning and non-reasoning modes in a single model: Mistral Small 4 supports configurable hybrid reasoning with reasoning and non-reasoning modes, rather than the separate reasoning variants Mistral has released previously with their Magistral models. In reasoning mode, the model scores 27 on the Artificial Analysis Intelligence Index. In non-reasoning mode, the model scores 19, a 4-point improvement from its predecessor Mistral Small 3.2 (15) ➤ More token efficient than peers of similar size: At ~52M output tokens, Mistral Small 4 (Reasoning) uses fewer tokens to run the Artificial Analysis Intelligence Index compared to reasoning models such as gpt-oss-120B (high, ~78M), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, ~110M), and Qwen3.5 122B A10B (Reasoning, ~91M). In non-reasoning mode, the model uses ~4M output tokens ➤ Native support for image input: Mistral Small 4 is a multimodal model, accepting image input as well as text. On our multimodal evaluation, MMMU-Pro, Mistral Small 4 (Reasoning) scores 57%, ahead of Mistral Large 3 (56%) but behind Qwen3.5 122B A10B (Reasoning, 75%). Neither gpt-oss-120B nor NVIDIA Nemotron 3 Super 120B A12B support image input. All models support text output only ➤ Improvement in real-world agentic tasks: Mistral Small 4 scores an Elo of 871 on GDPval-AA, our evaluation based on OpenAI's GDPval dataset that tests models on real-world tasks across 44 occupations and 9 major industries, with models producing deliverables such as documents, spreadsheets, and diagrams in an agentic loop. This is more than double the Elo of Small 3.2 (339) and close to Mistral Large 3 (880), but behind gpt-oss-120B (high, 962), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 1021), and Qwen3.5 122B A10B (Reasoning, 1130) ➤ Lower hallucination rate than peer models of similar size: Mistral Small 4 scores -30 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. Mistral Small 4 scores ahead of gpt-oss-120B (high, -50), Qwen3.5 122B A10B (Reasoning, -40), and NVIDIA Nemotron 3 Super 120B A12B (Reasoning, -42) Key model details: ➤ Context window: 256K tokens (up from 128K on Small 3.2) ➤ Pricing: $0.15/$0.6 per 1M input/output tokens ➤ Availability: Mistral first-party API only. At native FP8 precision, Mistral Small 4's 119B parameters require ~119GB to self-host the weights (more than the 80GB of HBM3 memory on a single NVIDIA H100) ➤ Modality: Image and text input with text output only ➤ Licensing: Apache 2.0 license

English
76
18
1.4K
102.9K
Jeremy Mack
Jeremy Mack@mutewinter·
Eg “AI SDK v6.102 dropped, it fixed an issue you had a workaround for in llm-request.ts”
English
0
0
0
18
Jeremy Mack
Jeremy Mack@mutewinter·
Has anyone built an agent that watches dependencies and suggests/PRs value-added changes?
English
1
0
0
53
Jeremy Mack
Jeremy Mack@mutewinter·
Cursor + pnpm user? Add this to your agent context: ## WARNING: PNPM Install in Sandbox Terminals **NEVER run `pnpm install` (or any `pnpm add`/`pnpm remove`) from within a sandbox terminal (i.e. Shell tool calls with sandboxing enabled).** Doing so causes a divergence between the sandbox's PNPM store and the main PNPM store, leaving the lockfile and node_modules in an inconsistent state. If a package installation is needed, either: 1. Ask the user to run the install command themselves, OR 2. Run the command using `required_permissions: ["all"]` to disable sandboxing
English
0
0
1
76
Jeremy Mack
Jeremy Mack@mutewinter·
@richiemcilroy $10 of tokens to say “this kiss brought to you by jeep wrangler, for all your wrangling needs”
English
0
0
0
96
Richie - oss/acc
Richie - oss/acc@richiemcilroy·
weddings are expensive if anyone wants me to put ads in my vows I'll take bids in the comments
English
7
0
18
1.1K
Jeremy Mack
Jeremy Mack@mutewinter·
@thdxr lgtm in agent PRs → microplastics in the ocean
English
0
0
0
37
dax
dax@thdxr·
you know how everything is made out of plastic and feels like crap but still technically works we've been headed this way with software for a while but at least we used to be embarrassed by it now people are proud of how much they don't care
English
90
122
1.9K
56.6K
Rhys
Rhys@RhysSullivan·
aesthetic is an ugly spelling for what its meant to be describing
English
12
1
41
5.4K
Jeremy Mack
Jeremy Mack@mutewinter·
@yetone I’ve been calling it productivity porn, but “agent porn” is better. It’s so easy to make, since you can just claim improvements, measure nothing, and share
English
0
0
0
677
Jeremy Mack
Jeremy Mack@mutewinter·
@mattapperson I was just reflecting on this too after building a script to turn chats into markdown files to debug outcomes with a second agent. Harnesses all the way down
English
1
0
0
22
Matt Apperson
Matt Apperson@mattapperson·
What will be the core role of a “software engineer” in the future? RL Environment building. Creating the evaluations and rewards for improving models and agent harness alike.
English
1
0
0
72
Jeremy Mack
Jeremy Mack@mutewinter·
A fast and high density front end for NPM just dropped. I’m smitten
Jeremy Mack tweet media
English
3
0
2
118
Jeremy Mack
Jeremy Mack@mutewinter·
@steveruizok Fixed tile precision or variable? Did something like this in canvas for a raster render and needed to solve it at various zoom levels for full perf unlock
English
1
0
0
449
Steve Ruiz
Steve Ruiz@steveruizok·
we're rendering with dirty tiles at 600fps
English
8
3
173
19.9K
Jeremy Mack
Jeremy Mack@mutewinter·
@thdxr I'm sure a common reaction to this is going to be "skill issue!" Not that simple when you've got a stochastic squad of idiot geniuses (agents) touching hundreds of files an hour
English
1
0
23
1.9K