Justin Lin

1.3K posts

Justin Lin banner
Justin Lin

Justin Lin

@jtlin

serial startup founder • early adopter • open source AI, local LLMs, personified autonomous agents

SF Bay Area Katılım Ekim 2008
420 Takip Edilen790 Takipçiler
Sabitlenmiş Tweet
Justin Lin
Justin Lin@jtlin·
Hot take: I can't see any startup building their critical core operations on Claude Managed Agents or any proprietary harness as investable. The past weeks have shown why it's critical to build on top of an open, neutral framework: ✅ Model diversity (cross-review / critique, less agent groupthink) ✅ Provider-agnostic (outages, random policy changes and suspensions) ✅ Local or fine-tuned LLMs for specialized tasks ✅ Private / E2EE cloud LLMs for tasks needing critical privacy Otherwise your startup will always be resting on an "unstable tectonic plate." All of your IP is in the harness and you truly need full control. And as OSS LLMs improve, you will have (and need) full control over the intelligence layer as well.
English
10
16
97
30.4K
David Hendrickson
David Hendrickson@TeksEdge·
Given the size of Qwen3.6-27B and the speeds at which many are reporting they can decode on home PC (MPT 40 - 120 tps) do these API prices seem a little high? (Kimi K2.6 and GLM-5.1 are $3.5/1M)
David Hendrickson tweet media
English
11
0
43
5.6K
Justin Lin
Justin Lin@jtlin·
vLLM 0.21.0 is out! Biggest updates for local AI folks IMO are related to increasing kv_cache for Qwen 3.6 / Gemma 4: ✅ turboquant kv_cache: try `--kv-cache-dtype turboquant4bit_nc` (or turboquant_k8v4) ✅ nvfp4 kv_cache: 4-bit with hw acceleration if you have Blackwell GPU `--kv-cache-dtype nvfp4` ✅ kv_cache offload to CPU RAM: esp. helpful for single-GPU owners, try `--kv-offloading-size 16` (These are not new vLLM features, but they didn't work for Mamba hybrid attention models like Qwen 3.6 / Gemma 4 until now) And be sure to turn on MTP speculative decoding! For Qwen 3.6: --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' github.com/vllm-project/v…
English
0
1
4
230
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
It's weird that the US still doesn’t have a truly competitive open-source model lab. It’s clearly not a money problem. Several neolabs have raised billions. It’s not a compute problem. US labs have easier access to B200s/B300s than Chinese labs. So what is the issue?
English
210
27
791
133.5K
Justin Lin
Justin Lin@jtlin·
@xyster @LottoLabs Do you think 192gb is a good place to be? Enough to run 4-bit Minimax and DS4 Flash-class models with 200K+ context window? Worried that you really need 256gb to get the next level up over Qwen 3.6 27B.
English
0
0
1
72
Steve💙🇨🇦
Steve💙🇨🇦@xyster·
@LottoLabs You can add a 5090 to a 6000 pro on a consumer desktop platform to get 128GB, or do two RTX Pros to get 192. It will be faster and easier. For 4 B70s, you need a workstation and ECC memory, which has become expensive. You then need to somehow fit 4 cards; likely risers needed.
English
1
0
4
226
Steve💙🇨🇦
Steve💙🇨🇦@xyster·
Minimax M2.7 INT4, at 60-tok/sec, is starting to feel achievable with 4x Intel B70 cards. I'm approaching 40 already, which is quite usable. The full cost with an Epyc build was about $7300 USD, so just two B70s would be much cheaper. I'd probably get an RTX 6000 Pro over 4 B70s
English
10
0
42
4.3K
Justin Lin
Justin Lin@jtlin·
@SpaceTimeViking Awesome! What model do you recommend as minimum? Would Qwen 3.6 27B work? Something like this could also be really useful for managing Proxmox, but I'm just not sure I can trust it yet!
English
1
0
1
110
ÆON FORGE ✨
ÆON FORGE ✨@SpaceTimeViking·
If you have a Unifi network, you can now give your Agent access through my newly published Unifi AI skill. Will work with both OpenClaw and Hermes. Take active action against threats, optimize network performance, analyze event logs, manage port forwards and advanced firewall configurations. Much more! Easy backup and restore, and easily revoke access if you are concerned of a compromised agent. There is an agents.md you can point your agent to to deploy this new skill. Use the interactive command to provide your API key. github.com/AEON-7/unifi-a…
English
6
6
31
2.5K
Justin Lin
Justin Lin@jtlin·
SF Local AI Meetup today was fantastic! Great meeting so many folks interested in local AI, GPUs, agents and more. Ever since early days in 2022/23 with 2x3090s running Llama 1 7B on first revs of llama.cpp, I've been looking forward to when private, local AI you own & run 24/7 could become truly useful. That day is now here, and sooner than I thought!
Ahmad@TheAhmadOsman

The first Local AI Get-Together was a massive success This pic is missing quite a few people who left before we hit the 4-hour mark, but thank you to everyone who stopped by 💙 Local AI is very real, very alive, and apparently willing to talk GPUs, open weights, inference engines, agents, and homelabs for hours We should do this again soon

English
1
0
2
188
Ahmad
Ahmad@TheAhmadOsman·
The first Local AI Get-Together was a massive success This pic is missing quite a few people who left before we hit the 4-hour mark, but thank you to everyone who stopped by 💙 Local AI is very real, very alive, and apparently willing to talk GPUs, open weights, inference engines, agents, and homelabs for hours We should do this again soon
Ahmad tweet media
swyx🛬 SFO@swyx

this is a big deal, on the order of Kelsey Hightower’s “Kubernetes The Hard Way” and probably all ai engineers should go thru this once mostly i advocate “just in time learning”, but this is one scenario you want “just in case”

English
40
18
328
55.8K
Justin Lin
Justin Lin@jtlin·
@malikwas1f Agree, dual 3090FE at 230W barely get over 70C even when stacked right on top of each other!
English
0
0
1
24
noname
noname@malikwas1f·
@jtlin By all means, the test was done on llama.cpp with a single model which is a different setup all together than dual config on vllm. So the results would surely vary but prefills seem to be bound to hardware. Personally I like silent gpus. 250W i reckon will give better TTFT.
English
1
0
1
54
noname
noname@malikwas1f·
1/⚡ RTX 3090 power-cap deep dive on Qwen3.6 (27B dense + 35B-A3B MoE) using llama.cpp 🎯 Dense 27B sweet spot: → 290W cap → 78% of stock 370W TDP → only -7% TPS → best efficiency: 0.111 TPS/W 📊 21-cap sweep @ 10W intervals. Turns out stock TDP is not the efficiency optimum.
noname tweet media
English
5
2
63
5K
Justin Lin
Justin Lin@jtlin·
@malikwas1f Thanks, going to try 250W for optimal prefill (agents need lots and lots of prefill!). Will let you know if I see any real world difference.
English
1
0
1
27
noname
noname@malikwas1f·
@jtlin Only if performance doesn’t feel like a bottleneck 230W. 230W is good when you are getting a decent TPS already.
English
1
0
1
69
Justin Lin
Justin Lin@jtlin·
Codex CLI can use any OpenAI-compatible endpoint and it feels like some local models (Qwen 3.6 27B & 35B, Gemma 4 26B & 31B) have gotten good enough that you could have Codex CLI /goal running 24/7 using as many tokens as your local inference stack can generate, perpetually. No more 5-hour / weekly limits. Going to try this soon.
English
0
0
0
193
andrew chen
andrew chen@andrewchen·
Trying /goal for the first time on Codex and it’s obv it’s going to 10000x token use. It’s amazing though - I’ve had it working on a low level eGPU+Mac device driver project overnight (that I have no business doing) for the past 14 hours and it’s still chipping away making progress with each iteration Naturally unattended 24/7 LLM use will be several magnitudes more than me prompting actively over a normal work day
English
21
8
286
23.9K
Ahmad
Ahmad@TheAhmadOsman·
re: Anthropic, Dario, OpenAI, etc Don’t let them control your Intelligence Utilization It is a MUST that you learn how to run your LLMs locally on your own hardware 2x RTX 3090s and Qwen 3.6 27B is all you need to get started
English
63
32
643
39.5K
Justin Lin
Justin Lin@jtlin·
Wow, OpenAI really has reclaimed the moral high ground over you know who. Bravo!
Tibo@thsottiaux

@b_nnett Not affiliated with Codex. But we do love OSS and congrats. Keep it up and let me know when you hit 1k users and will send you something special!

English
0
0
0
190
Justin Lin
Justin Lin@jtlin·
Thanks for this. One important note is the autoround quant can also work on Ampere (2x3090), unlike the AEON nvfp4 quant. With autoround, I am getting about 100 tok/sec with MTP speculative decoding on 2x3090. I'm actually really curious about this #club3090 setup from @malikwas1f that can use turboquant 3-bit kv cache through a custom vLLM patch. This would 2x the potentially kv cache for multiple parallel requests. Would also work on Blackwell. I haven't tried it yet though. github.com/noonghunna/clu…
English
0
1
1
98
Justin Lin
Justin Lin@jtlin·
Qwen 3.5 27B API prices are $0.325/M in, $3.25/M out. Same range as GLM-5, Kimi K2.5, Qwen 3.6 Plus. ➡️ Serving locally via 3090 / Mac is a no-brainer! 🧠 The math: 500M input tokens / mo 50M output tokens / mo = $3,900/year in API costs So you are easily paying back your hardware investment within one year. And of course your hardware will not go to zero value in a year (in fact it may be even worth more given the rate prices are rising). The above numbers are well within the potential local generation throughput: maybe 8 hours / day. Math looks even better if you are running tasks 24/7! There are electricity costs, but still a fraction of the token value. And either API providers are pricing based on model capability or it's an expensive model to serve (probably both).
English
0
0
4
1.2K
Justin Lin
Justin Lin@jtlin·
Wow @NVIDIAAI Nemotron 3 Nano Omni may be the perfect utility model for local agent stacks! One 30B model for text, images, video, speech / audio, files / PDFs. It can handle screenshots & videos for computer use, UX design, QA. It's A3B MoE so will run fast on Mac or DGX Spark. Multiple models, cloud APIs, and/or a maze of pre-processing pipelines for every format were required before. Now for 25gb of unified memory you can have this running 24/7, fully private, alongside your main agentic reasoning model.
NVIDIA AI@NVIDIAAI

Nemotron 3 Nano Omni was designed for powering subagents. Instead of stitching together separate models for language, vision, and speech, it ties them into a single architecture that more efficiently feeds context to orchestrators.

English
0
0
1
175