thestreamingdev() (@thestreamingdev) - Twitter-Profil

Angehefteter Tweet

I ran a 35-billion parameter AI agent on a $600 Mac mini. Specs: M4 Mac-Mini 16GB RAM The model doesn't fit in RAM. It pages from the SSD at 30 tokens/second. On NVIDIA, the same paging gives you 1.6 tok/s. Apple Silicon gives you 30. That's 18.6x faster. No cloud. No API keys. $0/month. Here's what it can do 🧵

English

169

213

3.2K

691.5K

thestreamingdev()@thestreamingdev·2h

@tbpn @karrisaarinen whatever you do don't show @karrisaarinen my TGI where I turned Trello into the ultimate agent orchestrator pit for agents ! github.com/walter-grace/T…

English

0

3

TBPN@tbpn·21h

Linear CEO @karrisaarinen says throwing out SaaS entirely can send companies on a "long journey" back to the exact same workflows: "This idea of like, 'let's throw everything out, no SaaS at all'... I think that's an option. But then you start inventing stuff back from first principles, and you maybe end up in the same spot again." "Everyone's running their agent, and now they have 10 agents running. So then they put them on a Kanban board, which Linear has, and then it's like, 'Oh, now I invented agent orchestration.' Like, no, you invented the Kanban board, which has been around for like 30 years." "So, kudos to connecting those two topics. But some things don't need to be reinvented. Some things can still work. But if you just throw everything out, you kind of start from the beginning, and then I think it's a long journey to figure everything out again."

English

0

6

167

34.5K

thestreamingdev()@thestreamingdev·2h

We love to see it

Jean-Marc | asiai@jmn67

Same model on M4 Pro 64GB : LM Studio MLX: 73.4 tok/s oMLX: 66.0 tok/s Ollama llama.cpp: 47.6 tok/s The engine matters as much as the hardware. MLX vs llama.cpp is a +44% difference on the same chip. Measured with asiai (open source bench tool): asiai.dev

English

0

1

3

455

thestreamingdev()@thestreamingdev·2h

Actually, it is possible! I built a custom MLX engine that streams FFN weights from SSD instead of loading the full model into RAM. The 27B (16.1 GB, full 4-bit quality) runs on any 16 GB Mac, only 5.5 GB stays in memory. Measured 0.18 tok/s on our M4. Slow, but coherent output, no compression artifacts. Code: github.com/walter-grace/m… That said for actual daily use, I'd run Qwen3.5-35B-A3B GUFF at IQ2_M (10.6 GB) through llama.cpp. It fits entirely in your 16 GB, runs at 30 tok/s on M4 (will be a bit less on M2), and is a better model than the 27B. Web search, shell commands, reasoning all working. That agent is in the main repo! Let me know if you get it going. I have a claude file in there too just ask claude to help set it up

English

0

8

Lean Kin Prak@LeanKinPrazli·5h

@0xSero Nice! I wanted to run the qwen 27B on my M2 16GB but failed. That's not possible, right? I mean with compression etc.

English

2

0

5

1.9K

0xSero@0xSero·5h

Qwen3.5-35B compressed 20% with 1%~ performance drop on average. Now you can fit this (4bits) with full context on 24GB of VRAM 700$~ or 1x 3090 huggingface.co/0xSero/Qwen-3.…

English

62

59

1.1K

43.7K

thestreamingdev()@thestreamingdev·3h

@dwr @zachterrell57 Lite parse works amazingly

English

0

1

Dan Romero@dwr·23h

@zachterrell57 How do you convert? And preserve complex layouts like tables and other graphics with data?

English

4

0

578

Zach@zachterrell57·23h

@dwr md

1

0

2

622

thestreamingdev()@thestreamingdev·19h

@ricknoblett

GIF

QME

0

157

Rick@ricknoblett·21h

@thestreamingdev Can’t wait for the M5 Studio!!

English

1

0

1

173

thestreamingdev()@thestreamingdev·2d

I ran a 35-billion parameter AI agent on a $600 Mac mini. Specs: M4 Mac-Mini 16GB RAM The model doesn't fit in RAM. It pages from the SSD at 30 tokens/second. On NVIDIA, the same paging gives you 1.6 tok/s. Apple Silicon gives you 30. That's 18.6x faster. No cloud. No API keys. $0/month. Here's what it can do 🧵

English

169

213

3.2K

691.5K

thestreamingdev()@thestreamingdev·19h

@TheCraigHewitt 🫡 thanks!

English

0

278

Craig Hewitt@TheCraigHewitt·20h

@thestreamingdev Rad. Following

English

1

0

1

304

thestreamingdev()@thestreamingdev·22h

@hexcrafter_eth @GoogleResearch claude code but only for macs. runs on the hardware of your device

English

0

7

hexcrafter.eth@hexcrafter_eth·22h

@thestreamingdev @GoogleResearch Whats “mac code”

English

1

0

12

Google Research@GoogleResearch·3d

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

GIF

English

927

5.5K

37.7K

18.2M

thestreamingdev()@thestreamingdev·1d

@Underminer1979 1000%

0

1

910

Acoustically Regarded@Underminer1979·1d

@thestreamingdev I have a 12 GB RTX 3080 and 64 GB of ram, can I do that too?

English

1

0

973

thestreamingdev()@thestreamingdev·1d

@hu_yifei Working on getting 27B onto the Mac Mini got 35B-A3B already working!

thestreamingdev()@thestreamingdev

I ran a 35-billion parameter AI agent on a $600 Mac mini. Specs: M4 Mac-Mini 16GB RAM The model doesn't fit in RAM. It pages from the SSD at 30 tokens/second. On NVIDIA, the same paging gives you 1.6 tok/s. Apple Silicon gives you 30. That's 18.6x faster. No cloud. No API keys. $0/month. Here's what it can do 🧵

English

0

30

Yifei Hu@hu_yifei·1d

Qwen3.5 27B feels more solid than 35B-A3B, because a dense model is more solid than a sparse model. (English is not my first language, but I really tried here)

English

27

5

379

22.7K

thestreamingdev()@thestreamingdev·1d

@WinstonLBrown1 Do it!!!

English

0

560

WinstonLBrown@WinstonLBrown1·1d

@thestreamingdev Stealing bits of this, thanks!

English

1

0

1

595

thestreamingdev()@thestreamingdev·1d

@karpathy This is where SSO can improve dramatically

English

0

72

Andrej Karpathy@karpathy·1d

When I built menugen ~1 year ago, I observed that the hardest part by far was not the code itself, it was the plethora of services you have to assemble like IKEA furniture to make it real, the DevOps: services, payments, auth, database, security, domain names, etc... I am really looking forward to a day where I could simply tell my agent: "build menugen" (referencing the post) and it would just work. The whole thing up to the deployed web page. The agent would have to browse a number of services, read the docs, get all the api keys, make everything work, debug it in dev, and deploy to prod. This is the actually hard part, not the code itself. Or rather, the better way to think about it is that the entire DevOps lifecycle has to become code, in addition to the necessary sensors/actuators of the CLIs/APIs with agent-native ergonomics. And there should be no need to visit web pages, click buttons, or anything like that for the human. It's easy to state, it's now just barely technically possible and expected to work maybe, but it definitely requires from-scratch re-design, work and thought. Very exciting direction!

Patrick Collison@patrickc

When @karpathy built MenuGen (karpathy.bearblog.dev/vibe-coding-me…), he said: "Vibe coding menugen was exhilarating and fun escapade as a local demo, but a bit of a painful slog as a deployed, real app. Building a modern app is a bit like assembling IKEA future. There are all these services, docs, API keys, configurations, dev/prod deployments, team and security features, rate limits, pricing tiers." We've all run into this issue when building with agents: you have to scurry off to establish accounts, clicking things in the browser as though it's the antediluvian days of 2023, in order to unblock its superintelligent progress. So we decided to build Stripe Projects to help agents instantly provision services from the CLI. For example, simply run: $ stripe projects add posthog/analytics And it'll create a PostHog account, get an API key, and (as needed) set up billing. Projects is launching today as a developer preview. You can register for access (we'll make it available to everyone soon) at projects.dev. We're also rolling out support for many new providers over the coming weeks. (Get in touch if you'd like to make your service available.) projects.dev

English

477

429

5.1K

1.7M

thestreamingdev()@thestreamingdev·1d

@MarcusShepher20 no sir

Español

0

763

Marcus Shepherd@MarcusShepher20·1d

@thestreamingdev Is this a fork of pi

English

1

0

817

thestreamingdev()@thestreamingdev·1d

@MarioClawAI “Hey Claude make this run with my openclaw” it just needs to run the model locally

English

0

2

2.6K

MarioClawAI | AI News@MarioClawAI·1d

@thestreamingdev How to know if the model can run with openclaw on your system?

English

1

0

1

3K

thestreamingdev()@thestreamingdev·1d

@dzienko As long as the RAM is there it should work just as well. I’m guessing this is why Apple launched their new laptops

English

0

1

297

Kamil Dzieniszewski@dzienko·1d

@thestreamingdev What about MacBooks?

English

1

0

322

thestreamingdev()@thestreamingdev·1d

@suvodeepmishra1

GIF

QME

0

1

1.3K

Suvodeep | Developer@suvodeepmishra1·1d

@thestreamingdev no cloud no API just vibes

Eesti

1

0

1

1.4K

thestreamingdev()@thestreamingdev·1d

@gubatron @PhillyGunson Probably way faster 😅

English

0

114

GUBA@gubatron·1d

@thestreamingdev @PhillyGunson wonder how it'd go with a 5090

Hapeville, GA 🇺🇸 English

1

0

120

thestreamingdev() retweetet

Morgan@morganlinton·1d

The next issue of my newsletter is out. 🎩 tip to @GoogleResearch, @saranormous, @karpathy, @thestreamingdev, @sawyerhood, @aidenybai, @LLMJunky

English

4

3

19

2.1K

thestreamingdev()@thestreamingdev·1d

@realphilipdias Nope but that’s breakthrough! We got 35B on it!

English

0

4

1.3K

Philip Dias@realphilipdias·1d

@thestreamingdev 16gb mac mini is not ideal for anything over 7b

English

1

0

1

1.4K

thestreamingdev()@thestreamingdev·1d

@Alibaba_Qwen We love @Alibaba_Qwen !

thestreamingdev()@thestreamingdev

I ran a 35-billion parameter AI agent on a $600 Mac mini. Specs: M4 Mac-Mini 16GB RAM The model doesn't fit in RAM. It pages from the SSD at 30 tokens/second. On NVIDIA, the same paging gives you 1.6 tok/s. Apple Silicon gives you 30. That's 18.6x faster. No cloud. No API keys. $0/month. Here's what it can do 🧵

English

0

33

Qwen@Alibaba_Qwen·1d

Big thanks to Steve for testing the entire Qwen3.5 family. Community feedback like this helps us get better. 🙏

stevibe@stevibe

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too. Only two models went all green: the 27B dense and the distilled 27B. The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two. The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit. The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output. Small models hallucinate data. Big models ignore data. The 27B just threaded it through.

English

41

62

1.5K

129.8K

thestreamingdev()@thestreamingdev·1d

@morganlinton thanks! SSD paging result is genuinely surprising to people, conventional wisdom says paging = unusable. The magic of @Apple Silicon breaks that assumption because there's no PCIe bus between the GPU and SSD

English

0

2

13

5.8K

Morgan@morganlinton·1d

@thestreamingdev Whoa, sounds impossible, but clearly it’s very possible since you are actually showing it, wild!!

English

1

0

2

6.5K

thestreamingdev()

Entdecken