Sidharth

202 posts

Sidharth

Sidharth

@sid250581

AI Engineer. Building to internalize. 21 · MedAI

India Katılım Kasım 2023
523 Takip Edilen55 Takipçiler
Sidharth
Sidharth@sid250581·
Gemma 4 31B works well but fails in json output in a long run
English
0
0
0
24
Sidharth
Sidharth@sid250581·
I don't know splitting a 31B dense model across a 3090 + 3060 is slower than just using the 3090 alone Memory bandwidth mismatch destroys any VRAM gain. The 3060 contributed extra VRAM on paper but the communication overhead between GPUs was worse than just offloading 2–3 layers to RAM and staying on one card. Dual GPU MTP even crashed mid-run: GGML_ASSERT failed → ggml_reshape_3d → llm_build_gemma4_mtp The fix: SPLIT_MODE=none, MAIN_GPU=0, PARALLEL=1. Single 3090. That's it
English
0
0
0
13
Sidharth
Sidharth@sid250581·
6.85x faster generation on Gemma 4 31B same RTX 3090, same context window Atomic llama-server with MTP heads: > 47.51 tok/s generated > 327 tok/s prompt processing Homebrew llama-server, no MTP: > 6.94 tok/s > partial CPU offload. Key differences that actually matter: > MTP (Multi-Token Prediction) heads enabled vs disabled > TurboQuant KV cache fits everything on-device at 131k context > Atomic build: -ngl 999, no CPU spillover. >Homebrew: forced -ngl 45 VRAM profile at 131k ctx: Model: ~17.5 GiB Context: ~2.2 GiB Compute: ~0.5 GiB Free: ~3.1 GiB headroom → enough to push to 262k context 🔴Tested 262144 context on the 3090. It holds at 40-45 tok/s with -b 1024 -ub 256. That's 262k tokens of active context on consumer hardware, no degradation in generation speed. If VRAM is tight at 262k, ladder down: reduce batch buffers first (-b 512 -ub 128), then try turbo2 KV cache, then pull a few layers to RAM. Avoid touching --parallel keep it at 1. The 3090 is genuinely underrated for 31B inference if you're running the right stack. Anyone else benchmarking MTP vs non-MTP on llama.cpp builds? Curious if the gap holds on 4090s or if it closes.
Sidharth tweet media
English
0
1
2
58
Sidharth
Sidharth@sid250581·
Day 2 of the DevOps course 20+ bash commands down today Also touched Vim for the first time.Spent the first 10 minutes figuring out how to exit that thing needs muscle memory to do it fast But how guys you are often using Vim.I have seen @ThePrimeagen used
English
9
0
43
10.8K
Sidharth
Sidharth@sid250581·
🚀 Just merged my PR into @NVIDIAHealth NV-Generate-CTMR! Fixed AttributeError when `cfg_guidance_scale` was missing from GPU inference configs (e.g. 16G/24G presets). Now paired CT inference runs smoothly on limited VRAM without crashes defaults safely to 0.0 while keeping full compatibility → github.com/NVIDIA-Medtech…
Sidharth tweet media
English
0
0
3
286
Sidharth
Sidharth@sid250581·
@NVIDIAHealth 's NV-Generate-CTMR just gave me my first synthetic chest CT generated on my RTX 3090 in under 30 steps. Not a real patient scan. Fully synthetic. 256³ volume, 1.5×1.5×2.0mm spacing, paired segmentation mask included. Config: anatomy_list ["lung tumor"] so the pipeline pulled a real training mask with a tumor seed The cool part is it uses - Autoencoder (VAE) - Diffusion U-Net - ControlNet - Mask Generation Autoencoder - Mask Generation Diffusion U-Net but only consumes 15gb vram(peak) which is very high memory efficient
Sidharth tweet media
Sidharth@sid250581

Running NV-Generate-CTMR on my RTX 3090 right now >downloading the rflow-ct weights, targeting lung tumor generation. >The 16g config drops inference from 1000 steps to 30. That's the difference Testing if this runs clean on 24GB VRAM today Will post the actual output.

English
0
0
0
24
Sidharth
Sidharth@sid250581·
Running NV-Generate-CTMR on my RTX 3090 right now >downloading the rflow-ct weights, targeting lung tumor generation. >The 16g config drops inference from 1000 steps to 30. That's the difference Testing if this runs clean on 24GB VRAM today Will post the actual output.
Sidharth tweet media
NVIDIA Healthcare@NVIDIAHealth

(1/2) 🚨 Data scarcity is the #1 blocker in medical imaging AI. We built the open-source fix. NV-Generate-CTMR synthesizes realistic 3D CT & MRI volumes at scale - with paired segmentation masks - so you can train more robust models without touching real patient data.

English
0
1
0
167
Sidharth
Sidharth@sid250581·
Just enrolled in harkirat's DevOps course. Going to document everything I learn here every concepts, commands, and the stuff that breaks Follow along if you're on the same path
English
0
1
8
712
Sidharth
Sidharth@sid250581·
Thanks for asking. when we get into root of medical notes it is messy and long range documents, nested entities etc.The bidirectional attention over the input helps the model better understand the full context. and also it's tiny to deploy on device(1B) just gets into a curiosity to fine tune first especially in medical.that's it
English
1
0
1
16
Sidharth
Sidharth@sid250581·
the only ubuntu user among mac users. this is how i got into mlx india chennai. fine-tuned @Sapient_Int HRM-Text-1B on nvidia/Nemotron-PII for medical PII de-identification. as far as i can tell, nobody has fine-tuned this model before. what made it non-trivial: HRM is a PrefixLM, not a standard causal LM. the checkpoint ships with legacy fused weights (gqkv_proj, gate_up_proj) that don't map to the current transformers implementation. had to write a custom conversion layer to get zero missing weights Next I will be publishing the full code and a step-by-step guide soon shoutout to @sabeshbharathi for organizing this amazing event
Sidharth tweet mediaSidharth tweet media
English
1
0
0
95
Sidharth
Sidharth@sid250581·
i am sidharth, tamilnadu over a year building AI for health voice-based TB screening, on-device oral cancer detection, knowledge graphs for traditional spinal medicine. Won BioxAI ($15K). Google's MedGemma 2nd place($5k) >back to learning >back to building >same mission
English
0
0
1
55
Aritra 🤗
Aritra 🤗@ariG23498·
I try to blog my way into a new project. This time around I am deep diving into a vast project. The output artifacts could be a video, blog post, and many small posts. This would be about profiling, kernels, langauge model deployments, and many more. I hope to make the best use of my time and create some content that gets people going! Wish me luck. 🤗
English
7
0
46
1.3K