Festus

3.7K posts

Festus

@_enfinity

10X Builder | AI Performance Engineer | Co-founded @mybridgecard(YC S22)

Making GPUs Go Brrr Katılım Temmuz 2018

1.5K Takip Edilen476 Takipçiler

Sabitlenmiş Tweet

Festus@_enfinity·16 Nis

I recently got a PR merged into @nvidia's Dynamo (A Datacenter Scale Distributed Inference Serving Framework) and the process was a great A/B test of KV data transfer over TCP vs KV data transfer over RDMA. PR link: github.com/ai-dynamo/dyna… 🧵

English

151

Festus@_enfinity·9 May

The goal for Capacitor is simple: Find capacity, Compare providers, Track patterns, Eventually reserve GPUs and run workloads. If you use GPUs, I’d love feedback. What would make this genuinely useful in your workflow?

English

Festus@_enfinity·9 May

It currently supports Vast.ai, Lambda Cloud, and Runpod, with cross-provider search from one terminal command: cap watch --providers vast,lambda,runpod --gpu H100 --max-price 9 --once I also created @gpucapacitor, which will post GPU availability signals, rare capacity sightings, price notes, and general GPU market observations.

English

Festus@_enfinity·9 May

I work with GPUs from time to time, whether it’s benchmarking with @vllm_project, @sglang, or @nvidia Dynamo, or doing kernel optimization work for open-source models. Again and again, I find myself asking the same question as this tweet: x.com/snwy_me/status… So I built Capacitor: github.com/Ayobami-00/cap…, an open-source Rust CLI for watching scarce GPU capacity across cloud GPU providers.

snwy@snwy_me

where the fuck are all of the GPUs going?!? i need literally one 8xH100 node and i cannot for the life of me get one ANYWHERE

English

133

Festus@_enfinity·29 Nis

Whaoooo! Love it!

Dwarkesh Patel@dwarkesh_sp

Did a very different format with @reinerpope – a blackboard lecture where he walks through how frontier LLMs are trained and served. It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there - it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube so you can see the chalkboard. 0:00:00 – How batch size affects token cost and speed 0:31:59 – How MoE models are laid out across GPU racks 0:47:02 – How pipeline parallelism spreads model layers across racks 1:03:27 – Why Ilya said, “As we now know, pipelining is not wise.” 1:18:49 – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal 1:32:52 – Deducing long context memory costs from API pricing 2:03:52 – Convergent evolution between neural nets and cryptography

English

Festus@_enfinity·23 Nis

While we used TinyLlama as the example here, this "package" structure is the universal standard for almost every model on the Hugging Face Hub, from 1B parameters to 70B+. Next time you see those "boring" JSONs, remember they’re the glue holding the AI together.

English

Festus@_enfinity·23 Nis

End to end, this is how the inference engines use the files. When you run the from_pretrained snippet, the engine reads the config.json to build the model's physical skeleton before pinning the safetensors weights to your VRAM. Simultaneously, it loads the tokenizer to translate your words into math and follows the generation_config to decide exactly how to behave and when to stop talking.

English

Festus@_enfinity·23 Nis

You’re on @huggingface , you find an open source model, and you hit the "Files and versions" tab. Instead of one clean app file, you see a list of JSONs and something called safetensors. Most people just copy the AutoModel.from_pretrained snippet and move on but what are these files actually doing? 🧵 We're goigng to use the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model as a motivating example. You can find the link to the hugging face repo below: huggingface.co/TinyLlama/Tiny…

English

Festus@_enfinity·22 Nis

Inside the World's Largest AI Supercluster xAI Colossus youtu.be/Jf8EPSBZU7Y?si…

YouTube

English

Festus@_enfinity·21 Nis

I hit a subtle RDMA bug while working on a disaggregated @vllm_project recipe for @nvidia Dynamo on bare-metal Kubernetes. The PR behind this is now merged into NVIDIA Dynamo as #7915 (github.com/ai-dynamo/dyna…). Here was the bug: The pod could see /dev/infiniband but RDMA still failed during UCX bootstrap. The root error was ibv_create_ah failed with No such device. The issue was not UCX tuning. It was Kubernetes network-namespace reachability. The RDMA device was exposed but the RoCE net_device needed for GID resolution was not visible inside the pod.The fix was attaching a secondary interface with Multus + macvlan. The results from the same AIPerf workload and deployment topology with only the KV-cache transport path changed between the TCP and RDMA runs is shown below: TTFT: 6784.95 ms → 795.60 ms Throughput: 307.31 tok/s → 567.79 tok/s I wrote up the full debugging story, root cause, fix, and production trade-offs in this article: @owumifestus/dd31fd66fe9a" target="_blank" rel="nofollow noopener">medium.com/@owumifestus/d…

English

Festus@_enfinity·21 Nis

This is soooooooooo cool!

Aksel@akseljoonas

Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on hf.co/spaces, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and hf.co/papers, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on hf.co/datasets - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: github.com/huggingface/ml… Web + mobile: huggingface.co/spaces/smolage… And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.

English

Festus@_enfinity·20 Nis

They created a solution called Unweight that reconstructs the weights directly in on-chip memory and feeds the diesel to the tensor cores. It reduces HBM traffic by storing weights compressed in HBM and reconstructing them in on-chip memory right before matmul. Such brilliant engineering!

English

Festus@_enfinity·20 Nis

That was not just all. Most LLM workloads are memory-bound and not compute bound and specifically for the H100 which they mentioned, the bandwidth from HBM to on-chip memory cannot saturate the raw compute power. Now if you compress these weights and store the values in HBM, you’d need to decompress before transferring to on-chip shared memory. This just reduces your decode time!

English

Festus@_enfinity·20 Nis

This blog post by Cloudflare is one of the most interesting things I have read today. blog.cloudflare.com/unweight-tenso… You know how when you want to make an LLM fit in memory sensitive scenarios, one of the first things you reach for is quantization which reduces the precision of the model weight but can significantly affect the model’s output accuracy. They found a clever trick that allowed them to cut LLM weights by 15-22% without affecting the accuracy.

English

Keşfet

@gpucapacitor @vllm_project @nvidia @huggingface @elonmusk @BarackObama @taylorswift13 @cristiano