Phil Howes

28 posts

Phil Howes banner
Phil Howes

Phil Howes

@saltyph

building https://t.co/aUjKNzIyMT

oakland, arrakis Katılım Mayıs 2013
438 Takip Edilen167 Takipçiler
Phil Howes retweetledi
Baseten
Baseten@baseten·
We've launched the fastest GLM 5 API available at 190 TPS and 0.79 sec TTFT with the Baseten Inference Stack. Ready for your coding and agentic workflows. baseten.co/blog/how-we-bu…
English
16
8
104
19K
Phil Howes
Phil Howes@saltyph·
so much potential in this model and @aqaderb coming out of the gates just ripping the landscape on perf
Baseten@baseten

It’s Monday, and we could all use a little help thinking. Thankfully we have the new Kimi K2 Thinking to do it for us. Kimi K2 Thinking is now live in our Model APIs with the most performant TTFT (0.3 sec) and TPS (140) on @openrouter & @ArtificialAnlys . If you’re looking for an alternative to GPT-5, utilize coding or are building agentic AI, you *need* to give this model a try. Congrats @Kimi_Moonshot , you all are astounding. Get access in the comments ➡️

English
0
1
3
206
Phil Howes
Phil Howes@saltyph·
speculation, in this case a eagle-3, remains one of the biggest levers to go from good to great. amazing job to leapfrog the market and get the most out of our GPUs
Baseten@baseten

This week, Baseten's model performance team unlocked the fastest TPS and TTFT for gpt-oss 120b on @nvidia hardware. When gpt-oss launched we sprinted to offer it at 450 TPS... now we've exceeded 650 TPS and 0.11 sec TTFT... and we'll keep working to keep raising the bar. We are proud to offer the best E2E latency available with near-limitless scale, incredible performance, and the highest uptime 99.99%.

English
0
0
1
113
Phil Howes
Phil Howes@saltyph·
@jxmnop if you read this and still want to learn cuda anyway, we’re hiring for this at @baseten to get more brrrr/dollar. dms open
English
0
0
7
405
dr. jack morris
dr. jack morris@jxmnop·
in 2025, if you want to become a successful AI engineer or researcher, you should NOT learn CUDA furthermore – i'd guess that 80% of successful ML researchers have never written a CUDA kernel practical ML is about training models and using them to make predictions. this has nothing to do with CUDA CUDA is necessary in two cases: (a) you are developing a radically new model that isn't easily expressible in PyTorch or Jax (i.e. Mamba) (b) you are running into performance bottlenecks from current CUDA code and need to make it faster i doubt that either case applies to you chances are you aren't building the next Mamba, and the bottlenecks you'll run into in practice are different you should work on finding the right data or hardware or setting things up properly or distributing efficiently across hardware or researching new efficient ways to run models that other people are working on (like vLLM and SGLang) or better than that, work on your eval pipeline. find ways to measure your model's performance that are more realistic, comprehensive, efficient, fair, etc. TLDR: want to learn? spend your time tinkering with models in PyTorch and Jax. not writing matrix multiplications
English
63
82
1.5K
322.9K
Phil Howes
Phil Howes@saltyph·
hit new peak demand today, 3 million RPS. thanks for stress testing our infra anon internet friend
English
0
0
2
91
Phil Howes retweetledi
abu
abu@aqaderb·
2 things. 1. i have loved working on this team. model performance is so much fun and so rewarding. 2. persistence is key. we started working on model performance end of 2023 and watching us slowly become better and better has been an incredible experience.
Baseten@baseten

fast!

English
1
3
20
1.9K
sarah guo
sarah guo@saranormous·
My 6yo daughter is really into archaeology so I’ve been learning — I get more excited about ancient civilizations than about dinosaurs, and archaeology x tech is a cool intersection. A couple sites I’ve been scoping for an expedition:
English
20
3
117
28.6K
Phil Howes
Phil Howes@saltyph·
when i tell people working in infra is like being a plumber people assume it’s because of lots of pipe connecting, when in fact it’s because i spend most of my day digging through shit
English
0
0
8
143
abu
abu@aqaderb·
enduring businesses are 10x better and cheaper than incumbents. it's hard to believe that there isn't a world where AI powers 10x better products. but it's unclear if those products are cheaper. Baseten has helped, and will continue to help, builders and enterprises build those enduring businesses. we will make it cheap to run these models, fast to make your experiences magical and reliable so you can focus on building.
Baseten@baseten

We're excited to announce that we've raised a $40M Series B to help power the next generation of AI-native products with performant, reliable and scalable inference infrastructure. baseten.co/blog/announcin…

English
2
1
28
3K
Phil Howes retweetledi
Baseten
Baseten@baseten·
Ready to try open source LLMs? Switch from GPT to Mistral 7B in the smallest refactor you'll ever ship: just 3 tiny code changes. If you're making the jump, DM us for $1,000 in free credits. baseten.co/blog/gpt-vs-mi…
English
0
7
15
1.7K
Phil Howes
Phil Howes@saltyph·
Repurposing @tuhinone's Llama v2 truss, got FreeWilly 2 up in under a minute. `:s/meta-llama\/Llama-2-70b-chat-hf/stabilityai\/FreeWilly2`. 275GB of weights later we're running at 23 tok/s out of the box.
English
1
11
47
15.6K
Phil Howes retweetledi
Tuhin Srivastava
Tuhin Srivastava@tuhinone·
We keep getting asked by users if they can use the 70B parameter model in production. We're serving the chat variant of Llama-2 70B on 2xA100 and getting pretty great throughput — it's cooking!
English
4
14
89
20K