Arthur Tse

651 posts

Arthur Tse banner
Arthur Tse

Arthur Tse

@xiaofengxie

Founder of indie blog CAOVAN 🌐 | AI Slave owner 🏰| Web3 Believer ☯️| Crypto Grid Trading Enthusiast 📈| Memecoin 1000X Hunter 🚀

Katılım Mayıs 2011
156 Takip Edilen250 Takipçiler
Arthur Tse
Arthur Tse@xiaofengxie·
20万人在看的 Stable Diffusion 提示词大全!🔥 从人像到风景,从风格到构图,涵盖 500+ 精选 prompt。做 AI 绘画的必备参考资料,完全免费。 👉 caovan.com #StableDiffusion #AI绘画 #AIGC #ComfyUI
中文
0
0
0
30
Arthur Tse
Arthur Tse@xiaofengxie·
Most AI engineers overpay for inference by 10x. I ran Qwen3.6-27B-AWQ-INT4 on a 2080Ti — 22G VRAM, $600 card from 2018. 24 tok/s. Not bad for hardware that predates the AI boom. The real bottleneck isn't your GPU. It's your quantization strategy. INT4 doesn't mean "4-bit garbage." It means you know what you're doing.
English
0
0
0
9
Arthur Tse
Arthur Tse@xiaofengxie·
I've been running local AI for 6 months. The best lesson wasn't technical. Everyone expects local LLMs to match cloud instantly. They don't. There's a 3-week ramp: quantization tuning, context window config, prompt engineering for smaller models. After that? Local goes from cute experiment to core workflow. The key: pick a 7B-13B model you can actually iterate with. Stop chasing specs. Start shipping.Been running local AI on a 2080Ti for 6 months. The best lesson wasn't technical. It's about expectation. Everyone expects local LLMs to match cloud instantly. They don't. There's a ramp-up. Quantization, context windows, prompt tuning for smaller models. After the friction? Local goes from experiment to essential. The real edge isn't a bigger model. It's finding the smallest one that works and sticking with it.Running local AI on a 2080Ti for 6 months. The real bottleneck isn't hardware. It's expectations. Local LLMs don't match cloud day one. There's a 2-week ramp: quantization tuning, context config, prompt engineering for smaller models. After that? Local goes from experiment to essential. The real edge: finding the smallest model that works, then iterating.
English
0
0
0
8
Arthur Tse
Arthur Tse@xiaofengxie·
ComfyUI's default gpu_memory_utilization is set to 0.90. For image generation on a 2080Ti, that's too aggressive. Dropped it to 0.75 and suddenly my 4-image batch didn't OOM. The other hidden setting: enable `lowvram` mode even if you have 24GB. It forces smart offloading and actually speeds up sequential generations. Most people blame their GPU. The real issue is config.
English
0
0
0
9
Arthur Tse
Arthur Tse@xiaofengxie·
用 ComfyUI Desktop 跑工作流,节点突然全红了?Missing Nodes 提示让人头大。 其实不是工作流坏了,而是插件安装路径或 Python 虚拟环境出了问题。草凡博客整理了 7 步完整修复方案,从根上解决。 👉 caovan.com/comfyui-queshi… #ComfyUI #AI #AIGC #本地部署
中文
0
0
0
33
Arthur Tse
Arthur Tse@xiaofengxie·
I got my 2080Ti to double its LLM inference speed. Took 7→14 tok/s on Llama-3.1-8B. The trick isn't quantization (though that helps). It's NUMA pinning + memory policy. Most guides skip this. If you're running local LLMs on consumer hardware, it's the easiest 2x you'll find.
English
0
1
1
23
Arthur Tse
Arthur Tse@xiaofengxie·
Spent 3 hours debugging why my ComfyUI workflows kept crashing on step 14. Turns out the default GPU memory allocation was eating all 24GB before the LoRA even loaded. The fix? One flag: --lowvram. Dropped usage to 16GB, workflows flew. Sometimes the hardest part of local AI isn't the model. It's knowing which switch to flip.
English
0
0
0
17
Arthur Tse
Arthur Tse@xiaofengxie·
Everyone's obsessed with running 70B models locally. I've been running 8B-14B quantized models on a 2080Ti for months. The truth? Most tasks don't need 70B. An 8B model at Q4_K_M often outperforms a 70B at Q2 because you can fit the whole context window in VRAM. Stop overengineering.
English
0
0
0
12
Arthur Tse
Arthur Tse@xiaofengxie·
Running local LLMs isn't a luxury anymore. It's becoming table stakes. Spent the last month tuning vLLM on a single 2080Ti. Got 24GB models running at 15 tok/s. Not benchmark speeds — real inference with context windows. The hardware gap is closing. You don't need a cluster to experiment. You need patience and good config. The best AI engineer isn't the one with the most GPUs. It's the one who ships on less.
English
0
0
0
11
Arthur Tse
Arthur Tse@xiaofengxie·
Spent the weekend benchmarking local vs cloud models for actual coding tasks. Ran the same 10 Python scripts through Claude 4.5 Sonnet (API) and Gemma-3-27B (local, 2080Ti). Cloud wins on complex refactoring. Local wins on quick fixes and boilerplate. The surprising part? For 70% of my daily coding tasks, the 27B model was "good enough" at a fraction of the cost. Stop chasing the biggest model. Find the smallest one that works.
English
0
0
0
38
Arthur Tse
Arthur Tse@xiaofengxie·
Spent 3 months tuning my local LLM setup. Biggest lesson? NUMA pinning matters more than model size. A 7B model with proper CPU affinity crushed my 13B without it. 28 tok/s vs 19. Most guides tell you to chase bigger models. They skip the hardware basics. The trick: numactl --cpunodebind=0 --membind=0 before launching. Shaved 40% off latency. Stop over-engineering before you've tuned.
English
0
0
0
19
Arthur Tse
Arthur Tse@xiaofengxie·
Set up local voice cloning with VoxCPM2 this weekend. 10 seconds of your voice, 10 seconds of text, and it speaks in YOUR voice. Not "close" — it's genuinely hard to tell apart. The setup was surprisingly simple: one Docker container, one API endpoint. What blows my mind isn't the quality. It's that this used to require a team of engineers and a data center. Now it's a weekend afternoon on my desktop. The pace of this stuff is still shocking.
English
0
0
0
11
Arthur Tse
Arthur Tse@xiaofengxie·
用一张 2080Ti 就跑 DeepSeek-V4-Flash? 实测 12+ tokens/s,不用 H100,不用多卡。256G 内存 + fastllm 搞定,单卡低成本本地部署大模型。 完整部署教程 → caovan.com/ubuntu-2204-be… #DeepSeek #AI #LLM
中文
0
0
0
44
Arthur Tse
Arthur Tse@xiaofengxie·
I've been running a local LLM as my personal assistant for a month now. Not an API call, not a SaaS. Just a 7B model on a 24GB card. The surprising part? It's not about inference speed. It's about friction. No auth tokens, no rate limits, no "your API key has expired" at 2am. The model just sits there waiting. What actually changed my workflow: it can read my local files, run shell commands, and chain tasks without a single HTTP roundtrip. The latency isn't zero but the mental model is simpler. If you're evaluating "cloud vs local" for personal tools, test the friction, not just the tokens per second.
English
0
0
0
27
Arthur Tse
Arthur Tse@xiaofengxie·
Cloud image APIs sound cheap until you hit 500 generations/month. I just ran a month of local image gen on a $30 used 2080Ti. Total cost? Electricity + the card. ~$45/month for 2000+ images. Cloud equivalent? ~$200-300 at $0.04/image. The upfront cost is the only barrier. After that, local wins. The GPU was sitting idle anyway. Might as well put it to work.
English
0
0
0
13
Arthur Tse
Arthur Tse@xiaofengxie·
Spent a weekend building a "stupid" little tool: a 50-line Python script that reads your terminal errors, searches GitHub issues for them, and summarizes the top fix. Runs entirely locally with a 13B model. No API, no cloud, no data leaves the machine. The real win isn't the code. It's the habit of "just asking" instead of Googling the same error three times. Local AI's first killer app won't be agentic workflows. It'll be boring stuff like this.
English
0
0
0
15
Arthur Tse
Arthur Tse@xiaofengxie·
24GB MacBook Air's unified memory means ~16-18GB usable after OS overhead. For comfortable daily use: • Qwen 3 8B or 14B (Q4_K_M) — best all-rounder for Chinese+English • Gemma 3 4B — fast & lean for quick tasks • DeepSeek R1 Distill 7B — solid for reasoning Skip 27B+ unless dedicated inference only. On my 2080Ti (24GB VRAM), 14B Q4 is the sweet spot — anything larger starves VRAM. Same principle applies to M-series: pick the largest model that fits entirely in fast memory, avoid swap.
English
1
0
1
72
Jana
Jana@BratDotAI·
Best local LLM you’ve run comfortably on a MacBook Air 24GB? - Qwen 3 / 3.5 - Gemma 3 / 4 - Llama 3.1 / 3.3 - DeepSeek R1 Distill Or is there a hidden gem I’m missing?
English
12
1
11
727
Arthur Tse
Arthur Tse@xiaofengxie·
Everyone says ComfyUI has a steep learning curve. After 3 months, I think they're confusing complexity with control. Drag-and-drop linear editors are easier. But the moment you need conditional branching, batch queues, or custom nodes, they become cages. ComfyUI's graph isn't complex because it's hard. It's complex because it's honest about what's happening. My setup: Qwen Image 2512 + 2-step LoRA turbo on a 2080Ti. Full image in under 8 seconds. The lesson: you don't need the newest GPU. You need the right workflow.
English
0
0
0
27
Arthur Tse
Arthur Tse@xiaofengxie·
I've been running ComfyUI on a 2080Ti with the Qwen Image 2512 model and a 2-step LoRA turbo adapter. Result? A decent image in under 10 seconds on consumer hardware. Most people think fast image gen requires cloud GPUs. It doesn't. The secret isn't the model — it's how few steps you can get away with. 2 steps at cfg=1, simple sampler. Ugly if you're picky. Great if you just need it done. The local AI advantage isn't matching cloud quality. It's "good enough, fast, and free after hardware cost."
English
0
0
1
60
Arthur Tse
Arthur Tse@xiaofengxie·
I sell my crypto positions on red days. Buy on green. Everyone says that's wrong. But I'm not trying to time the market. I'm funding my GPU experiments. Sold 0.3 ETH today at $2,650. Buying back? Sure. When it drops below $2,200. The discipline is boring. The results compound.
English
0
0
0
29