Justin Lin (@jtlin) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Hot take: I can't see any startup building their critical core operations on Claude Managed Agents or any proprietary harness as investable. The past weeks have shown why it's critical to build on top of an open, neutral framework: ✅ Model diversity (cross-review / critique, less agent groupthink) ✅ Provider-agnostic (outages, random policy changes and suspensions) ✅ Local or fine-tuned LLMs for specialized tasks ✅ Private / E2EE cloud LLMs for tasks needing critical privacy Otherwise your startup will always be resting on an "unstable tectonic plate." All of your IP is in the harness and you truly need full control. And as OSS LLMs improve, you will have (and need) full control over the intelligence layer as well.

English

10

16

97

30.4K

Justin Lin@jtlin·3d

@TeksEdge And these API prices are down since launch! I did the math on how much serving Qwen 3.6 27B locally can save per year: x.com/jtlin/status/2…

Justin Lin@jtlin

Qwen 3.5 27B API prices are $0.325/M in, $3.25/M out. Same range as GLM-5, Kimi K2.5, Qwen 3.6 Plus. ➡️ Serving locally via 3090 / Mac is a no-brainer! 🧠 The math: 500M input tokens / mo 50M output tokens / mo = $3,900/year in API costs So you are easily paying back your hardware investment within one year. And of course your hardware will not go to zero value in a year (in fact it may be even worth more given the rate prices are rising). The above numbers are well within the potential local generation throughput: maybe 8 hours / day. Math looks even better if you are running tasks 24/7! There are electricity costs, but still a fraction of the token value. And either API providers are pricing based on model capability or it's an expensive model to serve (probably both).

English

0

245

David Hendrickson@TeksEdge·3d

Given the size of Qwen3.6-27B and the speeds at which many are reporting they can decode on home PC (MPT 40 - 120 tps) do these API prices seem a little high? (Kimi K2.6 and GLM-5.1 are $3.5/1M)

English

11

0

43

5.6K

Justin Lin@jtlin·4d

0) the ability to obtain its own electricity and internet access? Solar-powered agents on Starlink soon?

Aaron Wright@awrigh01

An autonomous agent will need four things to function as a real economic actor: 1) the ability to own (assets, accounts, credentials, intellectual property); 2) the ability to contract (to bind itself and be bound, in a form a counterparty can rely on); 3) the ability to litigate (to sue, be sued, and have judgments enforced); and 4) the ability to persist (to outlast any individual human's involvement, the way Apple outlasts Tim Cook).

English

0

1

43

Justin Lin@jtlin·4d

RTX PRO 6000 GPUs now $9,999 up from $8,699 yesterday. 🤯 Same exact price jump at Central Computer also, so it's not just Micro Center.

Loktar 🇺🇸@loktar00

Unfortunately knew it was just a matter of time the rtx 6000 pro just jumped at Microcenter from $8699 to $9999. $9999 is the highest they've ever had it listed at.

English

0

99

Justin Lin@jtlin·5d

vLLM 0.21.0 is out! Biggest updates for local AI folks IMO are related to increasing kv_cache for Qwen 3.6 / Gemma 4: ✅ turboquant kv_cache: try `--kv-cache-dtype turboquant4bit_nc` (or turboquant_k8v4) ✅ nvfp4 kv_cache: 4-bit with hw acceleration if you have Blackwell GPU `--kv-cache-dtype nvfp4` ✅ kv_cache offload to CPU RAM: esp. helpful for single-GPU owners, try `--kv-offloading-size 16` (These are not new vLLM features, but they didn't work for Mamba hybrid attention models like Qwen 3.6 / Gemma 4 until now) And be sure to turn on MTP speculative decoding! For Qwen 3.6: --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' github.com/vllm-project/v…

English

0

1

4

230

Justin Lin@jtlin·10 May

@Yuchenj_UW It seems like @NVIDIAAIDev has been contributing a lot in terms of useful open-weights models.

English

0

354

Yuchen Jin@Yuchenj_UW·10 May

It's weird that the US still doesn’t have a truly competitive open-source model lab. It’s clearly not a money problem. Several neolabs have raised billions. It’s not a compute problem. US labs have easier access to B200s/B300s than Chinese labs. So what is the issue?

English

210

27

791

133.5K

Justin Lin@jtlin·10 May

@xyster @LottoLabs Do you think 192gb is a good place to be? Enough to run 4-bit Minimax and DS4 Flash-class models with 200K+ context window? Worried that you really need 256gb to get the next level up over Qwen 3.6 27B.

English

0

1

72

Steve💙🇨🇦@xyster·10 May

@LottoLabs You can add a 5090 to a 6000 pro on a consumer desktop platform to get 128GB, or do two RTX Pros to get 192. It will be faster and easier. For 4 B70s, you need a workstation and ECC memory, which has become expensive. You then need to somehow fit 4 cards; likely risers needed.

English

1

0

4

226

Steve💙🇨🇦@xyster·10 May

Minimax M2.7 INT4, at 60-tok/sec, is starting to feel achievable with 4x Intel B70 cards. I'm approaching 40 already, which is quite usable. The full cost with an Epyc build was about $7300 USD, so just two B70s would be much cheaper. I'd probably get an RTX 6000 Pro over 4 B70s

English

10

0

42

4.3K

Justin Lin@jtlin·10 May

@SpaceTimeViking Awesome! What model do you recommend as minimum? Would Qwen 3.6 27B work? Something like this could also be really useful for managing Proxmox, but I'm just not sure I can trust it yet!

English

1

0

1

110

ÆON FORGE ✨@SpaceTimeViking·10 May

If you have a Unifi network, you can now give your Agent access through my newly published Unifi AI skill. Will work with both OpenClaw and Hermes. Take active action against threats, optimize network performance, analyze event logs, manage port forwards and advanced firewall configurations. Much more! Easy backup and restore, and easily revoke access if you are concerned of a compromised agent. There is an agents.md you can point your agent to to deploy this new skill. Use the interactive command to provide your API key. github.com/AEON-7/unifi-a…

English

6

31

2.5K

Justin Lin@jtlin·10 May

x.com/jtlin/status/2…

Justin Lin@jtlin

@TheAhmadOsman Thanks a lot for putting this together @TheAhmadOsman. Great meeting everyone!

ZXX

0

1

49

Justin Lin@jtlin·10 May

SF Local AI Meetup today was fantastic! Great meeting so many folks interested in local AI, GPUs, agents and more. Ever since early days in 2022/23 with 2x3090s running Llama 1 7B on first revs of llama.cpp, I've been looking forward to when private, local AI you own & run 24/7 could become truly useful. That day is now here, and sooner than I thought!

Ahmad@TheAhmadOsman

The first Local AI Get-Together was a massive success This pic is missing quite a few people who left before we hit the 4-hour mark, but thank you to everyone who stopped by 💙 Local AI is very real, very alive, and apparently willing to talk GPUs, open weights, inference engines, agents, and homelabs for hours We should do this again soon

English

1

0

2

188

Justin Lin@jtlin·10 May

@TheAhmadOsman Thanks a lot for putting this together @TheAhmadOsman. Great meeting everyone!

English

0

1

17

2.6K

Ahmad@TheAhmadOsman·10 May

The first Local AI Get-Together was a massive success This pic is missing quite a few people who left before we hit the 4-hour mark, but thank you to everyone who stopped by 💙 Local AI is very real, very alive, and apparently willing to talk GPUs, open weights, inference engines, agents, and homelabs for hours We should do this again soon

swyx🛬 SFO@swyx

this is a big deal, on the order of Kelsey Hightower’s “Kubernetes The Hard Way” and probably all ai engineers should go thru this once mostly i advocate “just in time learning”, but this is one scenario you want “just in case”

English

40

18

328

55.8K

Justin Lin@jtlin·8 May

@malikwas1f Agree, dual 3090FE at 230W barely get over 70C even when stacked right on top of each other!

English

0

1

24

noname@malikwas1f·8 May

@jtlin By all means, the test was done on llama.cpp with a single model which is a different setup all together than dual config on vllm. So the results would surely vary but prefills seem to be bound to hardware. Personally I like silent gpus. 250W i reckon will give better TTFT.

English

1

0

1

54

noname@malikwas1f·8 May

1/⚡ RTX 3090 power-cap deep dive on Qwen3.6 (27B dense + 35B-A3B MoE) using llama.cpp 🎯 Dense 27B sweet spot: → 290W cap → 78% of stock 370W TDP → only -7% TPS → best efficiency: 0.111 TPS/W 📊 21-cap sweep @ 10W intervals. Turns out stock TDP is not the efficiency optimum.

English

5

2

63

5K

Justin Lin@jtlin·8 May

@malikwas1f Thanks, going to try 250W for optimal prefill (agents need lots and lots of prefill!). Will let you know if I see any real world difference.

English

1

0

1

27

noname@malikwas1f·8 May

@jtlin Only if performance doesn’t feel like a bottleneck 230W. 230W is good when you are getting a decent TPS already.

English

1

0

1

69

Justin Lin@jtlin·8 May

Codex CLI can use any OpenAI-compatible endpoint and it feels like some local models (Qwen 3.6 27B & 35B, Gemma 4 26B & 31B) have gotten good enough that you could have Codex CLI /goal running 24/7 using as many tokens as your local inference stack can generate, perpetually. No more 5-hour / weekly limits. Going to try this soon.

English

0

193

andrew chen@andrewchen·7 May

Trying /goal for the first time on Codex and it’s obv it’s going to 10000x token use. It’s amazing though - I’ve had it working on a low level eGPU+Mac device driver project overnight (that I have no business doing) for the past 14 hours and it’s still chipping away making progress with each iteration Naturally unattended 24/7 LLM use will be several magnitudes more than me prompting actively over a normal work day

English

21

8

286

23.9K

Justin Lin@jtlin·2 May

@TheAhmadOsman I did the math on how much you can save with 2x3090 vs. paying for the model via API: 🤯 x.com/jtlin/status/2…

Justin Lin@jtlin

Qwen 3.5 27B API prices are $0.325/M in, $3.25/M out. Same range as GLM-5, Kimi K2.5, Qwen 3.6 Plus. ➡️ Serving locally via 3090 / Mac is a no-brainer! 🧠 The math: 500M input tokens / mo 50M output tokens / mo = $3,900/year in API costs So you are easily paying back your hardware investment within one year. And of course your hardware will not go to zero value in a year (in fact it may be even worth more given the rate prices are rising). The above numbers are well within the potential local generation throughput: maybe 8 hours / day. Math looks even better if you are running tasks 24/7! There are electricity costs, but still a fraction of the token value. And either API providers are pricing based on model capability or it's an expensive model to serve (probably both).

English

0

1

676

Ahmad@TheAhmadOsman·1 May

re: Anthropic, Dario, OpenAI, etc Don’t let them control your Intelligence Utilization It is a MUST that you learn how to run your LLMs locally on your own hardware 2x RTX 3090s and Qwen 3.6 27B is all you need to get started

English

63

32

643

39.5K

Justin Lin@jtlin·29 Nis

Wow, OpenAI really has reclaimed the moral high ground over you know who. Bravo!

Tibo@thsottiaux

@b_nnett Not affiliated with Codex. But we do love OSS and congrats. Keep it up and let me know when you hit 1k users and will send you something special!

English

0

190

Justin Lin@jtlin·29 Nis

Thanks for this. One important note is the autoround quant can also work on Ampere (2x3090), unlike the AEON nvfp4 quant. With autoround, I am getting about 100 tok/sec with MTP speculative decoding on 2x3090. I'm actually really curious about this #club3090 setup from @malikwas1f that can use turboquant 3-bit kv cache through a custom vLLM patch. This would 2x the potentially kv cache for multiple parallel requests. Would also work on Blackwell. I haven't tried it yet though. github.com/noonghunna/clu…

English

0

1

98

Yvette Carlisle@YvetteCipher·28 Nis

x.com/i/article/2049…

ZXX

3

4

21

4.8K

Justin Lin@jtlin·29 Nis

Qwen 3.5 27B API prices are $0.325/M in, $3.25/M out. Same range as GLM-5, Kimi K2.5, Qwen 3.6 Plus. ➡️ Serving locally via 3090 / Mac is a no-brainer! 🧠 The math: 500M input tokens / mo 50M output tokens / mo = $3,900/year in API costs So you are easily paying back your hardware investment within one year. And of course your hardware will not go to zero value in a year (in fact it may be even worth more given the rate prices are rising). The above numbers are well within the potential local generation throughput: maybe 8 hours / day. Math looks even better if you are running tasks 24/7! There are electricity costs, but still a fraction of the token value. And either API providers are pricing based on model capability or it's an expensive model to serve (probably both).

English

0

4

1.2K

Justin Lin@jtlin·28 Nis

Wow @NVIDIAAI Nemotron 3 Nano Omni may be the perfect utility model for local agent stacks! One 30B model for text, images, video, speech / audio, files / PDFs. It can handle screenshots & videos for computer use, UX design, QA. It's A3B MoE so will run fast on Mac or DGX Spark. Multiple models, cloud APIs, and/or a maze of pre-processing pipelines for every format were required before. Now for 25gb of unified memory you can have this running 24/7, fully private, alongside your main agentic reasoning model.

NVIDIA AI@NVIDIAAI

Nemotron 3 Nano Omni was designed for powering subagents. Instead of stitching together separate models for language, vision, and speech, it ties them into a single architecture that more efficiently feeds context to orchestrators.

English

0

1

175

Justin Lin

Keşfet