Jeremy Mack

1.1K posts

Jeremy Mack

@mutewinter

Building something with a better name than @quests_dev.

参加日 Temmuz 2008

337 フォロー中1.3K フォロワー

固定されたツイート

Jeremy Mack@mutewinter·23 Şub

x.com/i/article/2024…

ZXX

333

Jeremy Mack@mutewinter·8h

@theo Even GPT-5.4 Nano is barely better than 120b artificialanalysis.ai/models/compari…

English

1.6K

Theo - t3.gg@theo·9h

Since OpenAI dropped gpt-oss-120b, Mistral has released 4 models that are worse than gpt-pss-120b

Artificial Analysis@ArtificialAnlys

Mistral has released Mistral Small 4, an open weights model with hybrid reasoning and image input, scoring 27 on the Artificial Analysis Intelligence Index @MistralAI's Small 4 is a 119B mixture-of-experts model with 6.5B active parameters per token, supporting both reasoning and non-reasoning modes. In reasoning mode, Mistral Small 4 scores 27 on the Artificial Analysis Intelligence Index, a 12-point improvement from Small 3.2 (15) and now among the most intelligent models Mistral has released, surpassing Mistral Large 3 (23) and matching the proprietary Magistral Medium 1.2 (27). However, it lags open weights peers with similar total parameter counts such as gpt-oss-120B (high, 33), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), and Qwen3.5 122B A10B (Reasoning, 42). Key takeaways: ➤ Reasoning and non-reasoning modes in a single model: Mistral Small 4 supports configurable hybrid reasoning with reasoning and non-reasoning modes, rather than the separate reasoning variants Mistral has released previously with their Magistral models. In reasoning mode, the model scores 27 on the Artificial Analysis Intelligence Index. In non-reasoning mode, the model scores 19, a 4-point improvement from its predecessor Mistral Small 3.2 (15) ➤ More token efficient than peers of similar size: At ~52M output tokens, Mistral Small 4 (Reasoning) uses fewer tokens to run the Artificial Analysis Intelligence Index compared to reasoning models such as gpt-oss-120B (high, ~78M), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, ~110M), and Qwen3.5 122B A10B (Reasoning, ~91M). In non-reasoning mode, the model uses ~4M output tokens ➤ Native support for image input: Mistral Small 4 is a multimodal model, accepting image input as well as text. On our multimodal evaluation, MMMU-Pro, Mistral Small 4 (Reasoning) scores 57%, ahead of Mistral Large 3 (56%) but behind Qwen3.5 122B A10B (Reasoning, 75%). Neither gpt-oss-120B nor NVIDIA Nemotron 3 Super 120B A12B support image input. All models support text output only ➤ Improvement in real-world agentic tasks: Mistral Small 4 scores an Elo of 871 on GDPval-AA, our evaluation based on OpenAI's GDPval dataset that tests models on real-world tasks across 44 occupations and 9 major industries, with models producing deliverables such as documents, spreadsheets, and diagrams in an agentic loop. This is more than double the Elo of Small 3.2 (339) and close to Mistral Large 3 (880), but behind gpt-oss-120B (high, 962), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 1021), and Qwen3.5 122B A10B (Reasoning, 1130) ➤ Lower hallucination rate than peer models of similar size: Mistral Small 4 scores -30 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. Mistral Small 4 scores ahead of gpt-oss-120B (high, -50), Qwen3.5 122B A10B (Reasoning, -40), and NVIDIA Nemotron 3 Super 120B A12B (Reasoning, -42) Key model details: ➤ Context window: 256K tokens (up from 128K on Small 3.2) ➤ Pricing: $0.15/$0.6 per 1M input/output tokens ➤ Availability: Mistral first-party API only. At native FP8 precision, Mistral Small 4's 119B parameters require ~119GB to self-host the weights (more than the 80GB of HBM3 memory on a single NVIDIA H100) ➤ Modality: Image and text input with text output only ➤ Licensing: Apache 2.0 license

English

1.4K

102.9K

Jeremy Mack@mutewinter·10h

Eg “AI SDK v6.102 dropped, it fixed an issue you had a workaround for in llm-request.ts”

English

Jeremy Mack@mutewinter·10h

Has anyone built an agent that watches dependencies and suggests/PRs value-added changes?

English

Jeremy Mack@mutewinter·1d

Cursor + pnpm user? Add this to your agent context: ## WARNING: PNPM Install in Sandbox Terminals **NEVER run `pnpm install` (or any `pnpm add`/`pnpm remove`) from within a sandbox terminal (i.e. Shell tool calls with sandboxing enabled).** Doing so causes a divergence between the sandbox's PNPM store and the main PNPM store, leaving the lockfile and node_modules in an inconsistent state. If a package installation is needed, either: 1. Ask the user to run the install command themselves, OR 2. Run the command using `required_permissions: ["all"]` to disable sandboxing

English

Jeremy Mack@mutewinter·2d

@richiemcilroy $10 of tokens to say “this kiss brought to you by jeep wrangler, for all your wrangling needs”

English

Richie - oss/acc@richiemcilroy·2d

weddings are expensive if anyone wants me to put ads in my vows I'll take bids in the comments

English

1.1K

Jeremy Mack@mutewinter·4d

@thdxr lgtm in agent PRs → microplastics in the ocean

English

dax@thdxr·4d

you know how everything is made out of plastic and feels like crap but still technically works we've been headed this way with software for a while but at least we used to be embarrassed by it now people are proud of how much they don't care

English

122

1.9K

56.6K

Jeremy Mack@mutewinter·4d

memory shortages are so bad that manus had to start begging its own users to use their computers

Manus@ManusAI

Today, we're taking Manus out of the cloud and putting it on your desktop. Introducing My Computer, the core feature of the new Manus Desktop app. It’s your AI agent, now on your local machine.

English

Jeremy Mack@mutewinter·4d

2010

Jeremy Mack@mutewinter·6d

I just rode my bike to dairy queen

Miles Deutscher@milesdeutscher

CANCEL your weekend plans. You NEED to: • Learn Claude Code • Learn Cowork (build 1-2 practical workflows) • Set up Perplexity Computer/Perplexity Finance • Optimise Cowork (plug-ins + skills) • Set up OpenClaw • Test Google AI products (Nano Banana 2, NotebookLM & more) • Experiment with basic agentic solutions (Manus) • Use AI to create a business plan/strategy/context files • Build an AI second-brain database (Notion) • Experiment with Notion Agents' *brand new* • Learn basic automation tools (MCPs, Zapier, n8n) • Learn prompt engineering - the better you can communicate with AI, the better your Outputs • Read AI articles • Dive into robotics • Research AI stocks/ETFs/investment arbitrages You have way too much to do...

English

Jeremy Mack@mutewinter·6d

@RhysSullivan assthetic

English

Rhys@RhysSullivan·6d

aesthetic is an ugly spelling for what its meant to be describing

English

5.4K

Jeremy Mack@mutewinter·6d

@yetone I’ve been calling it productivity porn, but “agent porn” is better. It’s so easy to make, since you can just claim improvements, measure nothing, and share

English

677

yetone@yetone·6d

这些花活儿我称之为 Agent Porn，我大学就已经过完 Unix Porn 的瘾了，现在已经贤者模式了更何况很多 Agent 功能都应该是 Agentic 出来的，不应该是 human aware 的东西

马天翼@fkysly

在我目前的观察和了解的情况来说，大部分人对 Claude Code、Codex 这一类 AI Coding Agent 产品的认知还是在 “AI 能帮我写写代码” 的层面。很多人都是裸用状态： 1. 没有装过任何的 Skills、MCP、CLI，不知道折腾这些有啥用。 2. 没有配置过，不知道状态栏、上下文、压缩、记忆都有啥用。 3. 没有概念，只知道在 IDE 里打开个插件，以为 CC、Codex 只能在这里面用。 4. 没有并行、多窗口的干过事情 5. 舍不得用 Tokens，以节省为主，宁愿多花时间手动操作，不愿意花钱自动化。一点观察心得体会与推友分享👀

中文

243

124.8K

Jeremy Mack@mutewinter·6d

@mattapperson I was just reflecting on this too after building a script to turn chats into markdown files to debug outcomes with a second agent. Harnesses all the way down

English

Matt Apperson@mattapperson·6d

What will be the core role of a “software engineer” in the future? RL Environment building. Creating the evaluations and rewards for improving models and agent harness alike.

English

Jeremy Mack@mutewinter·6d

View source github.com/npmx-dev/npmx.…

English

Jeremy Mack@mutewinter·6d

A fast and high density front end for NPM just dropped. I’m smitten

English

118

Jeremy Mack@mutewinter·6d

Read about it npmx.dev/blog/alpha-rel…

English

Jeremy Mack@mutewinter·6d

Use it npmx.dev

English

Jeremy Mack@mutewinter·12 Mar

does this mean @thdxr is a mod of r/nofap

PC Gamer@pcgamer

Garry's Mod creator says programmers over-relying on AI is like using too much porn: 'You lose your ability to ejaculate using your imagination' pcgamer.com/software/ai/ga…

English

103

Jeremy Mack@mutewinter·12 Mar

@steveruizok Fixed tile precision or variable? Did something like this in canvas for a raster render and needed to solve it at various zoom levels for full perf unlock

English

449

Steve Ruiz@steveruizok·12 Mar

we're rendering with dirty tiles at 600fps

English

173

19.9K

Jeremy Mack@mutewinter·11 Mar

I hope everyone bought the dip

AI SDK@aisdk

AI SDK has passed 10 million downloads per week. Thank you to our community, from the contributors to the educators.

English

Jeremy Mack@mutewinter·10 Mar

@thdxr I'm sure a common reaction to this is going to be "skill issue!" Not that simple when you've got a stochastic squad of idiot geniuses (agents) touching hundreds of files an hour

English

1.9K

dax@thdxr·10 Mar

genie is out of the bottle everyone hitting the magic button button puts your brain in a state of laziness that seeps into all your processes and they get skipped we talk about this all the time and still our team is struggling with it

Lukasz Olejnik@lukOlejnik

Amazon is holding a mandatory meeting about AI breaking its systems. The official framing is "part of normal business." The briefing note describes a trend of incidents with "high blast radius" caused by "Gen-AI assisted changes" for which "best practices and safeguards are not yet fully established." Translation to human language: we gave AI to engineers and things keep breaking? The response for now? Junior and mid-level engineers can no longer push AI-assisted code without a senior signing off. AWS spent 13 hours recovering after its own AI coding tool, asked to make some changes, decided instead to delete and recreate the environment (the software equivalent of fixing a leaky tap by knocking down the wall). Amazon called that an "extremely limited event" (the affected tool served customers in mainland China).

English

120

190

3.4K

359.7K

ディスカバー

@theo @richiemcilroy @thdxr @RhysSullivan @yetone @mattapperson @elonmusk @BarackObama