Sidharth

202 posts

Sidharth

@sid250581

AI Engineer. Building to internalize. 21 · MedAI

India Katılım Kasım 2023

523 Takip Edilen55 Takipçiler

Sabitlenmiş Tweet

Sidharth@sid250581·17 Nis

We won the 2nd place in Agentic Track @GoogleResearch Medgemma Hackathon Yes we (me , @c_sachiv and ramaswamy) made voice-based TB screening tool instead of a chat style Shout out to @kaggle and @GoogleResearch whole team for organizing this #MedgemmaImpactChallenge #Gemma #Google #kaggle

English

273

Sidharth@sid250581·6h

Gemma 4 31B works well but fails in json output in a long run

English

Sidharth@sid250581·6h

I don't know splitting a 31B dense model across a 3090 + 3060 is slower than just using the 3090 alone Memory bandwidth mismatch destroys any VRAM gain. The 3060 contributed extra VRAM on paper but the communication overhead between GPUs was worse than just offloading 2–3 layers to RAM and staying on one card. Dual GPU MTP even crashed mid-run: GGML_ASSERT failed → ggml_reshape_3d → llm_build_gemma4_mtp The fix: SPLIT_MODE=none, MAIN_GPU=0, PARALLEL=1. Single 3090. That's it

English

Sidharth@sid250581·18h

6.85x faster generation on Gemma 4 31B same RTX 3090, same context window Atomic llama-server with MTP heads: > 47.51 tok/s generated > 327 tok/s prompt processing Homebrew llama-server, no MTP: > 6.94 tok/s > partial CPU offload. Key differences that actually matter: > MTP (Multi-Token Prediction) heads enabled vs disabled > TurboQuant KV cache fits everything on-device at 131k context > Atomic build: -ngl 999, no CPU spillover. >Homebrew: forced -ngl 45 VRAM profile at 131k ctx: Model: ~17.5 GiB Context: ~2.2 GiB Compute: ~0.5 GiB Free: ~3.1 GiB headroom → enough to push to 262k context 🔴Tested 262144 context on the 3090. It holds at 40-45 tok/s with -b 1024 -ub 256. That's 262k tokens of active context on consumer hardware, no degradation in generation speed. If VRAM is tight at 262k, ladder down: reduce batch buffers first (-b 512 -ub 128), then try turbo2 KV cache, then pull a few layers to RAM. Avoid touching --parallel keep it at 1. The 3090 is genuinely underrated for 31B inference if you're running the right stack. Anyone else benchmarking MTP vs non-MTP on llama.cpp builds? Curious if the gap holds on 4090s or if it closes.

English

Sidharth@sid250581·1d

@RafaelNegronX @ThePrimeagen @bootdotdev No it's @kirat_tw bhai course

English

162

Rafael Negron@RafaelNegronX·1d

@sid250581 @ThePrimeagen Are you taking the course from @bootdotdev ?

English

229

Sidharth@sid250581·1d

Day 2 of the DevOps course 20+ bash commands down today Also touched Vim for the first time.Spent the first 10 minutes figuring out how to exit that thing needs muscle memory to do it fast But how guys you are often using Vim.I have seen @ThePrimeagen used

English

10.8K

Sidharth@sid250581·1d

@gregmushen @ThePrimeagen damn 29 years?? okay you convinced me my struggles are nothing thanks for the motivation

English

142

Greg Mushen@gregmushen·1d

@sid250581 @ThePrimeagen Vim is the best once you get used to using it. I’ve been using it for 29 years.

English

504

Sidharth@sid250581·1d

@ThePrimeagen That’s basically your entire adult life in Vim

English

320

ThePrimeagen@ThePrimeagen·1d

@sid250581 I have used vim probably 6000/6500 days

English

110

9.1K

Sidharth@sid250581·1d

🚀 Just merged my PR into @NVIDIAHealth NV-Generate-CTMR! Fixed AttributeError when `cfg_guidance_scale` was missing from GPU inference configs (e.g. 16G/24G presets). Now paired CT inference runs smoothly on limited VRAM without crashes defaults safely to 0.0 while keeping full compatibility → github.com/NVIDIA-Medtech…

English

286

Sidharth@sid250581·2d

@NVIDIAHealth 's NV-Generate-CTMR just gave me my first synthetic chest CT generated on my RTX 3090 in under 30 steps. Not a real patient scan. Fully synthetic. 256³ volume, 1.5×1.5×2.0mm spacing, paired segmentation mask included. Config: anatomy_list ["lung tumor"] so the pipeline pulled a real training mask with a tumor seed The cool part is it uses - Autoencoder (VAE) - Diffusion U-Net - ControlNet - Mask Generation Autoencoder - Mask Generation Diffusion U-Net but only consumes 15gb vram(peak) which is very high memory efficient

Sidharth@sid250581

Running NV-Generate-CTMR on my RTX 3090 right now >downloading the rflow-ct weights, targeting lung tumor generation. >The 16g config drops inference from 1000 steps to 30. That's the difference Testing if this runs clean on 24GB VRAM today Will post the actual output.

English

Sidharth@sid250581·2d

NVIDIA Healthcare@NVIDIAHealth

(1/2) 🚨 Data scarcity is the #1 blocker in medical imaging AI. We built the open-source fix. NV-Generate-CTMR synthesizes realistic 3D CT & MRI volumes at scale - with paired segmentation masks - so you can train more robust models without touching real patient data.

English

167

Sidharth@sid250581·2d

Day 1 of the DevOps course. Learned what Kubernetes actually is >installed kind + kubectl on Ubuntu >spun up a local cluster.

Sidharth@sid250581

Just enrolled in harkirat's DevOps course. Going to document everything I learn here every concepts, commands, and the stuff that breaks Follow along if you're on the same path

English

256

Sidharth@sid250581·2d

Just enrolled in harkirat's DevOps course. Going to document everything I learn here every concepts, commands, and the stuff that breaks Follow along if you're on the same path

English

712

Sidharth@sid250581·2d

Thanks for asking. when we get into root of medical notes it is messy and long range documents, nested entities etc.The bidirectional attention over the input helps the model better understand the full context. and also it's tiny to deploy on device(1B) just gets into a curiosity to fine tune first especially in medical.that's it

English

Pranav@Pranav_ai·3d

@sid250581 @Sapient_Int Just a doubt, why the HRM for this usecase?

English

Sidharth@sid250581·3d

the only ubuntu user among mac users. this is how i got into mlx india chennai. fine-tuned @Sapient_Int HRM-Text-1B on nvidia/Nemotron-PII for medical PII de-identification. as far as i can tell, nobody has fine-tuned this model before. what made it non-trivial: HRM is a PrefixLM, not a standard causal LM. the checkpoint ships with legacy fused weights (gqkv_proj, gate_up_proj) that don't map to the current transformers implementation. had to write a custom conversion layer to get zero missing weights Next I will be publishing the full code and a step-by-step guide soon shoutout to @sabeshbharathi for organizing this amazing event

English

Sidharth@sid250581·6d

i am sidharth, tamilnadu over a year building AI for health voice-based TB screening, on-device oral cancer detection, knowledge graphs for traditional spinal medicine. Won BioxAI ($15K). Google's MedGemma 2nd place($5k) >back to learning >back to building >same mission

English

Sidharth@sid250581·12 May

@ariG23498 You can do it best

English

Aritra 🤗@ariG23498·12 May

I try to blog my way into a new project. This time around I am deep diving into a vast project. The output artifacts could be a video, blog post, and many small posts. This would be about profiling, kernels, langauge model deployments, and many more. I hope to make the best use of my time and create some content that gets people going! Wish me luck. 🤗

English

1.3K

Keşfet

@RafaelNegronX @ThePrimeagen @bootdotdev @kirat_tw @gregmushen @NVIDIAHealth @Sapient_Int @sabeshbharathi