Andrew Zhu

2.9K posts

Andrew Zhu

@xhinker

on Personal AI | open source contributor | startup founder member | ex-Microsoft blog: https://t.co/QjnjnNVoCD

WA, US 参加日 Ağustos 2009

250 フォロー中265 フォロワー

固定されたツイート

Andrew Zhu@xhinker·3 Haz

Today, book "Using Stable Diffusion with Python" is published. Hope it is useful for SD developers. #stablediffusion #diffusion #ai amazon.com/Using-Stable-D…

English

2.3K

Andrew Zhu@xhinker·1h

@LottoLabs Today, 27B finished a super complex code change, involves 7 code files, hundreds lines of code, and Qwen3.6-27B GOT id DONE, and running well. Previously, CC and Codex may fail on similar task, super super amazing

English

Lotto@LottoLabs·3h

Qwen 27b will absolutely nail tasks like this Unless you’re doing mid-high complexity coding 27b will pretty much do all tool calls/skill following you ask of it

Ariel@redtachyon

OpenClaw (with codex) just refused to torrent some anime for me. So anyways I'm now completely radicalized for open models and will be seeking a locally-hosted solution. Can't have a clanker disobey me.

English

851

Andrew Zhu@xhinker·10h

llama.cpp cli: llama-server \ -m /path/to/.lmstudio/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \ --mmproj /path/to/.lmstudio/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/mmproj-F32.gguf \ --alias unsloth/Qwen3.6-35B-A3B-MTP-GGUF \ --host 0.0.0.0 \ -ngl 99 \ --batch-size 4096 \ --ubatch-size 1024 \ --flash-attn on \ --jinja \ --mlock \ --metrics \ --temp 0.5 \ --ctx-size 262144 \ --parallel 1 \ --image-min-tokens 1024 \ --reasoning-format none \ --timeout 1200 \ --port 8082 \ --spec-type draft-mtp,ngram-mod \ --spec-draft-n-max 2 \ --spec-draft-n-min 0 \ --spec-ngram-mod-n-match 24 \ --spec-ngram-mod-n-min 48 \ --spec-ngram-mod-n-max 64

Indonesia

Andrew Zhu@xhinker·10h

The old M1Max Macbook Pro is still workable, got 60t/s from Qwen3.6-34B-A3B-MTP , impressive!

English

Andrew Zhu@xhinker·10h

@digitalnoah @LottoLabs I tried, yes, now speed from around 45 -> 60

English

Noah King@digitalnoah·10h

@xhinker @LottoLabs Why no MTP? So much faster

English

Lotto@LottoLabs·13h

Hey apple bros is M1 at all viable for local models or too slow?

English

10.4K

Andrew Zhu@xhinker·10h

@wesleimade @LottoLabs with MTP, I got around 58t/s in this M1Max machine

English

Andrew Zhu@xhinker·10h

@wesleimade llama-server \ -m /Users/andrewzhu/.lmstudio/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \ --mmproj /Users/andrewzhu/.lmstudio/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/mmproj-F32.gguf \ --alias unsloth/Qwen3.6-35B-A3B-MTP-GGUF \ --host 0.0.0.0 \ -ngl 99 \ --batch-size 1024 \ --ubatch-size 512 \ --flash-attn on \ --jinja \ --mlock \ --metrics \ --ctx-size 262144 \ --parallel 2 \ --kv-unified \ --image-min-tokens 1024 \ --reasoning-format none \ --timeout 1200 \ --ctx-checkpoints 32 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --port 8082

Indonesia

Andrew Zhu@xhinker·11h

@wesleimade @LottoLabs Llama.cpp

Español

wesleimade@wesleimade·11h

@xhinker @LottoLabs How do you run it? Lammacpp?

English

Andrew Zhu@xhinker·14h

@AiXsatoshi Do you use NVLink or just PCIe?

English

AI✖️Satoshi⏩️@AiXsatoshi·1d

llama.cpp tensor parallelテスト Qwen3.6-27B-Q4_K_M 　1 x GPU → 65tok/s 　2 x GPU → 98tok/s

日本語

4.1K

Andrew Zhu@xhinker·15h

Self improve AI is nothing new, Anthrophic, again, stealing other people's idea, entitle for themself.

English

Andrew Zhu@xhinker·15h

@ZeroZ_JQ 亲测有效, 我的AI 小伙伴表示非常开心, 简直就是随心所欲, 发发命中

中文

141

Andrew Zhu@xhinker·15h

@ZeroZ_JQ 我给你一个第五选项, HTML+JS+CSS, 亲测非非常好, 这些框架以前都是要么补足人的不足, 或者适应团队开发, 现在还用这些干啥, 又慢, bug 有多, 是时候回归原本了

中文

关木@ZeroZ_JQ·1d

AI 给我提出了史诗级的难题，我犹豫了半小时了

中文

108

225

85.6K

Andrew Zhu@xhinker·15h

@0xSero make sense, thank you, looks like it is the only way out

English

313

0xSero@0xSero·15h

@xhinker You'd plug it in to the washing machine outlet, our spread it over a few circuits.

English

327

0xSero@0xSero·16h

I am one of the 10 most knowledgable people in self hosting in residential areas. I have pretty much done everything there is to do, I wonder how valuable my version of a train model obsession will be. - intel - AMD - Apple - spark / 3090s / 6000s Cooling, power management

English

377

23.1K

Andrew Zhu@xhinker·15h

@0xSero I know, I am doing so now, but every US wall socket have a Waltage limitation, otherwise the fuse will be tripped. how do you solve the single socket waltage limitation? use multple wall sockets?

English

324

0xSero@0xSero·15h

@xhinker Power capping the gpus

English

1.1K

Andrew Zhu@xhinker·16h

@Jackywine 我的AI 工具已经是在自己构建自己了

中文

427

Jackywine@Jackywine·1d

今天 Anthropic 这篇文章，被所有人转发了 anthropic.com/institute/recu… 但是只有真正去官网看的人才会体会到，这段动画的“恐怖感” 递归开始了

中文

612

174.6K

Andrew Zhu@xhinker·16h

@jianshuo 干这些小活, 完全不用cc , 本地 qwen3.6-27B 从未失手

中文

Jianshuo Wang@jianshuo·16h

Claude Code真是整理神器，用它整理文件夹，整理Photos里的照片，整理百度网盘，简直太爽了，陈年老文件得救了

中文

1.1K

Andrew Zhu@xhinker·16h

while NVFP4 is faster in terms of prefill, my practical experience shows, Int4, FP4 or Q4 will give stupid results when handling complex task, and some time even random XML tags that break everything when tool call reach to 50 times. Q6 don't have these problem, so, stick with Q6 for now

English

190

witcheer@witcheer·21h

everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K_M (same 4-bit budget) >Q6_K - same llama.cpp b9365 and same harness. ~~~ prefill (processing your prompt): NVFP4 wins big, and it's real. +32 to 42% over equal-bit Q4_K_M at every context from 512 to 16k, so that gain is pure FP4-tensor-core compute. vs Q6 it's +52 to 68%. concretely at pp512: 5415 tok/s vs 3826 (Q4) vs 3222 (Q6). ~~~ decode (generating tokens): here's the myth. vs an equal-size Q4 it moves only +9% (84 vs 77 tok/s). the headline "+36% vs Q6" decode number isn't the FP4 cores at all but it's just NVFP4 being smaller (14.6GB vs 21GB). decode is memory-bandwidth bound, so it tracks footprint, not how the weights are packed. prefill = compute, decode = size. ~~~ the 4-bit tax is almost nothing: 93.2 vs 94.0 q_avg across five tasks vs Q6. MMLU, ARC, HellaSwag, GSM8K all land within half a point; only code dips meaningfully (HumanEval 90.2 vs 92.7). net, vs the Q6 a lot of people serve: ~+60% prefill +36% decode -30% VRAM (17.3 vs 23.5GB) for -0.8 quality. for an always-on local agent that's an easy yes - faster replies, more context headroom, and 6GB of VRAM handed back.

English

111

16.4K

Andrew Zhu@xhinker·1d

@boson_ai Congrats!, the result is amazing

English

403

Andrew Zhu がリツイート

BosonAI@boson_ai·1d

Higgs Audio v3 TTS is here. Built for voice AI that speaks, not just reads: • 100 languages with single-digit WER/CER • inline control over emotion, style, prosody, and sound effects • API, Workspace, and open weights • Blog 👉 boson.ai/blog/higgs-aud… Watch the demo 👇

English

365

41.8K

ディスカバー

@LottoLabs @digitalnoah @wesleimade @AiXsatoshi @ZeroZ_JQ @0xSero @elonmusk @BarackObama