Praveen kumar

1.4K posts

Praveen kumar

@InboxPraveen

Building real-world AI systems that actually ship 🚀 | LLMs, Voice AI, MLOps | From idea → production without the BS

Bengaluru, India Katılım Şubat 2018

560 Takip Edilen62 Takipçiler

Sabitlenmiş Tweet

Praveen kumar@InboxPraveen·28 Nis

Ship LLMs without losing your mind 🤯 Easy-vLLM turns 100+ confusing flags into a simple 3-step wizard → Pick model + GPU → See if it fits (live VRAM check) → Get ready-to-run Docker + API No guesswork. No broken deployments. Just copy → run 🚀 github.com/inboxpraveen/E…

English

Praveen kumar@InboxPraveen·1d

Most AI coding agents are token-wasting machines. They read too much. Write too much. Explain too much. So I built a small rules pack that makes Cursor / Claude Code more cost-aware. Simple drop-in fix. Repo: github.com/inboxpraveen/M…

English

Praveen kumar@InboxPraveen·1d

#llm #cursor #claudecode #claude #windsurf

QHT

Praveen kumar@InboxPraveen·1d

Your AI coding bill is not high because of coding. It is high because your agent keeps reading, rewriting, and explaining too much. I made a simple rules pack to fix that. Cut Cursor / Claude Code token usage by 60%+. Repo: github.com/inboxpraveen/M…

English

Praveen kumar@InboxPraveen·1d

AI coding tools are powerful. But they quietly waste tokens by: rereading files, rewriting full files, over-explaining simple fixes So I built a free drop-in rules pack to reduce Code token usage by 60%+. Less noise. Lower cost. Same output. github.com/inboxpraveen/M…

English

Praveen kumar@InboxPraveen·2d

6 files. No config. No API keys. Drop into any project → AI token usage drops 50%+. → CLAUDE.md (Claude Code) → .cursorrules (Cursor) → .cursor/rules/*.mdc (per-language) → PROMPT_TEMPLATES.md github.com/inboxpraveen/M… #OpenSource #DevTools

English

Praveen kumar@InboxPraveen·2d

Rewriting a 200-line file costs ~2,000 tokens. Patching it costs ~200. That's the gap. And it happens on every edit. Added 6 rule files to my project. Usage dropped 55%. github.com/inboxpraveen/M… #Cursor #LLM #AIEngineering

English

Praveen kumar@InboxPraveen·2d

AI tools cost more than they should. Most waste comes from: → Full-file rewrites instead of diffs → Preamble no one asked for → 5 options when you need 1 6 rule files fix this for Cursor + Claude Code. github.com/inboxpraveen/M… #Cursor #ClaudeCode

English

Praveen kumar@InboxPraveen·2d

I just published FIFA Player Recommendation System medium.com/p/fifa-player-…

English

Praveen kumar@InboxPraveen·28 Nis

@malikwas1f @Alibaba_Qwen @vllm_project Maybe try out this project I am building in Open source? Still early, but potentially helping people do this without being overwhelmed? github.com/inboxpraveen/E…

English

noname@malikwas1f·28 Nis

OMG! 200k context for 27b on a single 3090! Stay tuned. 🥳 @Alibaba_Qwen @vllm_project

English

177

8.8K

Praveen kumar@InboxPraveen·28 Nis

@TheAhmadOsman Maybe try out this project I am building in Open source? Still early, but potentially helping people do this without being overwhelmed? github.com/inboxpraveen/E…

English

Ahmad@TheAhmadOsman·24 Nis

DeepSeek V4 Flash is now running on 4x DGX Spark / GB10 cluster Had to patch several things in vLLM to get it up w/ PyTorch fallbacks Targeted kernel optimization is next up P.S. Codex Cli w/ GPT-5.5 XHIGH handled the whole thing on its own, now we optimize those GB10 kernels

Ahmad@TheAhmadOsman

Getting DeepSeek V4 Flash up and running on the 4x DGX Sparks right now

English

31K

Praveen kumar@InboxPraveen·28 Nis

@TheAhmadOsman Maybe try out this project I am building in Open source? Still early, but potentially helping people do this without being overwhelmed? github.com/inboxpraveen/E…

English

Ahmad@TheAhmadOsman·27 Nis

PRO TIP vLLM telling you to use `--enforce-eager` to avoid OOM because CUDA Graphs “don’t have enough VRAM”? Don’t jump straight to eager mode Try this first: - lower `--max-model-len`, ex: 4k - let CUDA Graph compile (which will be cached by torch.compile) - restart, then raise context back up You can keep the CUDA Graph performance gains without hitting OOM

English

214

10.5K

Praveen kumar@InboxPraveen·28 Nis

@sudoingX Maybe try out this project I am building in Open source? Still early, but potentially helping people do this without being overwhelmed? github.com/inboxpraveen/E…

English

507

Sudo su@sudoingX·28 Nis

"how do you fit qwen 3.6 27b q4 on 24gb at 262k context" lands in my dms 5 times a week. here is the exact memory math. model bytes at idle = 16gb (q4_k_m of 27b dense) kv cache at 262k context with q4_0 for both k and v = 5gb total = 21gb on the card headroom = 3gb for prompts and tool call traces the magic is the kv cache type. most people leave it at default fp16 or push to q8 thinking quality wins. on qwen 3.6 27b dense at 262k: - fp16 kv cache = does not fit at all - q8 kv cache = fits at 23gb but runs 3x slower (double penalty: more vram, less speed) - q4_0 kv cache = fits at 21gb at full speed (40 tok/s flat curve, same speed at 4k or 262k) most builders never test the kv cache type because tutorials never mention it. it is the single biggest unlock on consumer 24gb hardware. flags i run: ./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 what they do: -ngl 99 = offload everything to gpu -c 262144 = 262k context window -np 1 = single user slot (do not enable multi-slot, eats headroom) -fa on = flash attention on (memory and speed both win) --cache-type-k q4_0 --cache-type-v q4_0 = the unlock if you are sitting on 24gb and not running this config, you are leaving 250k of context on the table. or worse, you are running q8 kv cache and burning 3x your speed for nothing. q4 is not a compromise on consumer hardware. it is the right call.

English

110

1.3K

73.5K

Praveen kumar@InboxPraveen·28 Nis

Most habit trackers sell your data. So I built one that doesn’t. • Runs locally • No login • No cloud • Just you vs your consistency Introducing: Daily Tracker GitHub ↓ github.com/inboxpraveen/D…

English

Praveen kumar@InboxPraveen·28 Nis

#LLM #AI #vLLM #MLOps #OpenSource

QHT

Praveen kumar@InboxPraveen·28 Nis

English

Praveen kumar@InboxPraveen·26 May

The most challenging part in RAG pipelines is - replicating what people present online as small POC to Production. Trust me why? Because I have implemented a production of RAG pipelines in a multi-tenant environment. Isolation, multiple models, smaller infrastructure, list goes.

English

Praveen kumar@InboxPraveen·26 May

Someone on LinkedIn posted a perfect RAG pipeline. Please don't call for those. There are no perfect RAG pipelines. Everything is just a recommendation on productions unless you sit with one and implement yourself.

English

Praveen kumar@InboxPraveen·25 May

Building Generative AI product for Indian Client with se eral languages? Try fine tuning this for your usecase. You will not be disappointed. huggingface.co/sarvamai/sarva…

English

Praveen kumar retweetledi

Qwen@Alibaba_Qwen·12 May

We’re officially releasing the quantized models of Qwen3 today! Now you can deploy Qwen3 via Ollama, LM Studio, SGLang, and vLLM — choose from multiple formats including GGUF, AWQ, and GPTQ for easy local deployment. Find all models in the Qwen3 collection on Hugging Face and ModelSope. Hugging Face：huggingface.co/collections/Qw… ModelScope：modelscope.cn/collections/Qw… 📷 For more usage examples, check out the image below!

English

428

2.7K

194.4K

Praveen kumar@InboxPraveen·30 Nis

@Alibaba_Qwen Absolutely Amazing Release. Just one question! You are saying Qwen 3 4B is so good, while no mention of Qwen 3 8B & 14B? What's the middle ground?

English

Keşfet

@malikwas1f @Alibaba_Qwen @vllm_project @TheAhmadOsman @sudoingX @elonmusk @BarackObama @taylorswift13