Greptile
830 posts

Greptile
@greptile
AI agents that review and test PRs. Trusted by Nvidia, Coinbase, Scale, Brex, Substack, Whoop, and 9000+ others.

launching 5 things: 1. multi-repo context support 2. rebuilt web app for super large orgs 3. integrations with claude/codex/devin 4. .greptile/rules files 5. rebuilt learning so greptile maintains internal docs about your company











Running an LLM efficiently is basically a fight against wasted computation, and padding is one of the biggest offenders. GPUs love big batches, but there's a catch: every sequence in a batch has to be the same length. So if you naively throw lets say 64 requests together, and one of them is 2,000 tokens while the other 63 are only 100, the GPU pads all 63 short requests up to 2,000. Congratulations, you've just turned your expensive H100 into a very enthusiastic space heater. Most of that computation is pure waste and you're burning cycles on empty tokens. The fix is length binning. Instead of batching randomly, you group requests with similar lengths together: 100 to 120 token requests in one bucket, 500 to 600 in another, 1,800 to 2,000 in a third. Now the padding overhead stays tiny, GPU utilization jumps, and throughput improves dramatically. In production, you usually add a tiny buffer window lets say 50ms so the router has enough time to collect incoming requests and place them into the right bins before dispatching. It's a classic systems trick: a microscopic increase in queueing latency for a massive gain in overall efficiency. At scale, this can easily mean serving 2 to 5× more tokens per GPU. Turns out, a lot of "AI infrastructure" is really just being extremely good at organizing lines.


