Praveen kumar

1.4K posts

Praveen kumar banner
Praveen kumar

Praveen kumar

@InboxPraveen

Building real-world AI systems that actually ship 🚀 | LLMs, Voice AI, MLOps | From idea → production without the BS

Bengaluru, India Katılım Şubat 2018
560 Takip Edilen62 Takipçiler
Sabitlenmiş Tweet
Praveen kumar
Praveen kumar@InboxPraveen·
Ship LLMs without losing your mind 🤯 Easy-vLLM turns 100+ confusing flags into a simple 3-step wizard → Pick model + GPU → See if it fits (live VRAM check) → Get ready-to-run Docker + API No guesswork. No broken deployments. Just copy → run 🚀 github.com/inboxpraveen/E…
Praveen kumar tweet media
English
1
0
0
31
Praveen kumar
Praveen kumar@InboxPraveen·
Most AI coding agents are token-wasting machines. They read too much. Write too much. Explain too much. So I built a small rules pack that makes Cursor / Claude Code more cost-aware. Simple drop-in fix. Repo: github.com/inboxpraveen/M…
English
0
0
0
26
Praveen kumar
Praveen kumar@InboxPraveen·
Your AI coding bill is not high because of coding. It is high because your agent keeps reading, rewriting, and explaining too much. I made a simple rules pack to fix that. Cut Cursor / Claude Code token usage by 60%+. Repo: github.com/inboxpraveen/M…
English
1
0
0
38
Praveen kumar
Praveen kumar@InboxPraveen·
AI coding tools are powerful. But they quietly waste tokens by: rereading files, rewriting full files, over-explaining simple fixes So I built a free drop-in rules pack to reduce Code token usage by 60%+. Less noise. Lower cost. Same output. github.com/inboxpraveen/M…
English
0
0
0
7
Praveen kumar
Praveen kumar@InboxPraveen·
AI tools cost more than they should. Most waste comes from: → Full-file rewrites instead of diffs → Preamble no one asked for → 5 options when you need 1 6 rule files fix this for Cursor + Claude Code. github.com/inboxpraveen/M… #Cursor #ClaudeCode
English
0
0
0
9
Ahmad
Ahmad@TheAhmadOsman·
PRO TIP vLLM telling you to use `--enforce-eager` to avoid OOM because CUDA Graphs “don’t have enough VRAM”? Don’t jump straight to eager mode Try this first: - lower `--max-model-len`, ex: 4k - let CUDA Graph compile (which will be cached by torch.compile) - restart, then raise context back up You can keep the CUDA Graph performance gains without hitting OOM
Ahmad tweet media
English
14
17
214
10.5K
Sudo su
Sudo su@sudoingX·
"how do you fit qwen 3.6 27b q4 on 24gb at 262k context" lands in my dms 5 times a week. here is the exact memory math. model bytes at idle = 16gb (q4_k_m of 27b dense) kv cache at 262k context with q4_0 for both k and v = 5gb total = 21gb on the card headroom = 3gb for prompts and tool call traces the magic is the kv cache type. most people leave it at default fp16 or push to q8 thinking quality wins. on qwen 3.6 27b dense at 262k: - fp16 kv cache = does not fit at all - q8 kv cache = fits at 23gb but runs 3x slower (double penalty: more vram, less speed) - q4_0 kv cache = fits at 21gb at full speed (40 tok/s flat curve, same speed at 4k or 262k) most builders never test the kv cache type because tutorials never mention it. it is the single biggest unlock on consumer 24gb hardware. flags i run: ./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 what they do: -ngl 99 = offload everything to gpu -c 262144 = 262k context window -np 1 = single user slot (do not enable multi-slot, eats headroom) -fa on = flash attention on (memory and speed both win) --cache-type-k q4_0 --cache-type-v q4_0 = the unlock if you are sitting on 24gb and not running this config, you are leaving 250k of context on the table. or worse, you are running q8 kv cache and burning 3x your speed for nothing. q4 is not a compromise on consumer hardware. it is the right call.
English
85
110
1.3K
73.5K
Praveen kumar
Praveen kumar@InboxPraveen·
Most habit trackers sell your data. So I built one that doesn’t. • Runs locally • No login • No cloud • Just you vs your consistency Introducing: Daily Tracker GitHub ↓ github.com/inboxpraveen/D…
Praveen kumar tweet mediaPraveen kumar tweet mediaPraveen kumar tweet media
English
1
0
0
24
Praveen kumar
Praveen kumar@InboxPraveen·
Ship LLMs without losing your mind 🤯 Easy-vLLM turns 100+ confusing flags into a simple 3-step wizard → Pick model + GPU → See if it fits (live VRAM check) → Get ready-to-run Docker + API No guesswork. No broken deployments. Just copy → run 🚀 github.com/inboxpraveen/E…
Praveen kumar tweet media
English
1
0
0
31
Praveen kumar
Praveen kumar@InboxPraveen·
The most challenging part in RAG pipelines is - replicating what people present online as small POC to Production. Trust me why? Because I have implemented a production of RAG pipelines in a multi-tenant environment. Isolation, multiple models, smaller infrastructure, list goes.
English
0
0
0
45
Praveen kumar
Praveen kumar@InboxPraveen·
Someone on LinkedIn posted a perfect RAG pipeline. Please don't call for those. There are no perfect RAG pipelines. Everything is just a recommendation on productions unless you sit with one and implement yourself.
English
1
0
0
63
Praveen kumar retweetledi
Qwen
Qwen@Alibaba_Qwen·
We’re officially releasing the quantized models of Qwen3 today! Now you can deploy Qwen3 via Ollama, LM Studio, SGLang, and vLLM — choose from multiple formats including GGUF, AWQ, and GPTQ for easy local deployment. Find all models in the Qwen3 collection on Hugging Face and ModelSope. Hugging Face:huggingface.co/collections/Qw… ModelScope:modelscope.cn/collections/Qw… 📷 For more usage examples, check out the image below!
Qwen tweet media
English
62
428
2.7K
194.4K
Praveen kumar
Praveen kumar@InboxPraveen·
@Alibaba_Qwen Absolutely Amazing Release. Just one question! You are saying Qwen 3 4B is so good, while no mention of Qwen 3 8B & 14B? What's the middle ground?
Praveen kumar tweet media
English
0
0
0
84