@qnguyen3@arcee_ai That’s awesome - been trying to get Brian and Jacob to meet to talk @tensorwave and see if we can collaborate - maybe we can meet once you get there.
Llama 3.1 405B could be the catalyst for much greater #AMD adoption for AI inference 📈
@AMD's MI300X may be uniquely suited to cost-effective Llama 3.1 405B inference. Its 192GB of memory allows a single 8xMI300X node to serve Llama 3.1 405B in its native FP16 precision - whereas two 8xH100 nodes are required on @nvidia.
As we have previously covered, a single NVIDIA 8xH100 node only has 640GB of memory - not enough to hold Llama 3.1 405B’s full 810GB of FP16 weights in memory at once. This means that providers are forced to deploy two 8xH100 nodes with interconnect to serve 405B in FP16 precision, forcing them to accept a significant cost and complexity penalty.
Nvidia’s future H200 and B100 come with 141GB and 192GB of high bandwidth memory respectively - but unlike those, AMD MI300X is available now. @LisaSu noted on AMD’s Q2 earnings call that AMD was demand-limited on MI300X for the remainder of 2024. Will Llama 3.1 405B alone flip that narrative?
We are starting to see adoption and support increase. Both @FireworksAI_HQ and @LeptonAI are hosting Llama 3.1 405B on AMD MI300X chips. They stand out as the lowest cost providers of Llama 3.1 405B. However, it is important to note they are serving the model at FP8 and INT8 precision respectively.
Furthermore, projects like GPU.cpp from @answerdotai (@jeremyphoward, @austinvhuang) are making it easier than ever to write and run portable code across different chip (hardware & software) architectures - decreasing the CUDA lock-in.
What is your view? Long #AMD?
Researchers have introduced a new method to speed up long context windows in LLMs. Their adaptive structured sparse attention mechanism reduces Time-to-First-Token latency without affecting accuracy or needing extra training. Learn more: buff.ly/4bR7NYI#LLMs#research
Milestone Unlocked 🚀 We have achieved FP8 on @AMD's MI300X.
Discover the implications for AI workloads in our latest blog post.
Click to learn more! 👉 buff.ly/3zNX7gp
Running out of #compute credits? If you are a startup let’s talk. We are building something special and welcome your feedback! @tensorwave#compute#ai#GPU
New achievement unlocked! 🔐
Thanks to our friends over at @mkoneai & @Gradient_AI_ , we've cracked the code on a real-time chat with 1M context window using Llama 70B!
Which is cool by itself, however, thanks to our MI300X's and their massive 192GB of memory per card; These accelerators are pivotal for running long context models efficiently, allowing model parameters and large context caches to be stored on fewer cards.😎
Also I'm pretty sure the only company that's tackling this is Google’s Gemini 1.5 Pro.
However, Google's impressive models come with significant limitations:
❌ Feature Gaps: No real-time chat with cached context
❌ Limited Customization: Minimal fine-tuning capabilities
❌ Scalability Constraints: API rate limits restrict large-scale deployments
❌ Cost Inefficiency: High expenses for long context token usage
🔗 Read the full report here: lnkd.in/gZSiyiB9
Just dropped my first blog post (out of three) on getting started with AMD ROCm for AI! 🚀
Thanks to @tensorwave and @cognitivecompai for hooking me up with some sweet MI300x GPUs. If you're curious about AMD's answer to CUDA, check it out.
open.substack.com/pub/qnguyen3/p…
Alex Rodriguez asked a question. Reggie Jackson answered it.
(Shouts to the producer and rest of the desk for staying out of Reggie’s way and just letting him talk. I doubt they expected this answer. But it’s a great few minutes of television.)
@Yuhu_ai_@elonmusk before you go build on H100s let’s talk about MI300X and our capacity. Better performance and better cost. Let’s grab coffee in SF next week. We are hosting an event on the future of compute. Worth a quick chat? Hope so.
A TensorWave Report: AMD’s MI300X Outperforms NVIDIA’s H100 for LLM Inference
There has been much anticipation around AMD’s flagship MI300X accelerator. With unmatched raw specs, the pressing question remains: Can it outperform NVIDIA’s Hopper architecture in real-world AI workloads? We have some exciting early results to share.
Read the full article here: blog.tensorwave.com/amds-mi300x-ou…
Can someone explain what @realGeorgeHotz did here?
MLPerf is a benchmark suite that is used to evaluate training and inference performance of on-premises and cloud platforms, usually used for Nvidia GPUs?
So he got an AMD GPU to benchmark on an Nvidia benchmark meaning he transliterated the Nvidia instructions for AMD GPU meaning if he continues that's Nvidia dominance over, right?