Phil Howes

28 posts

Phil Howes

@saltyph

building https://t.co/aUjKNzIyMT

oakland, arrakis Katılım Mayıs 2013

438 Takip Edilen167 Takipçiler

Phil Howes retweetledi

Baseten@baseten·6 Mar

We've launched the fastest GLM 5 API available at 190 TPS and 0.79 sec TTFT with the Baseten Inference Stack. Ready for your coding and agentic workflows. baseten.co/blog/how-we-bu…

English

104

19K

Phil Howes@saltyph·11 Kas

so much potential in this model and @aqaderb coming out of the gates just ripping the landscape on perf

Baseten@baseten

It’s Monday, and we could all use a little help thinking. Thankfully we have the new Kimi K2 Thinking to do it for us. Kimi K2 Thinking is now live in our Model APIs with the most performant TTFT (0.3 sec) and TPS (140) on @openrouter & @ArtificialAnlys . If you’re looking for an alternative to GPT-5, utilize coding or are building agentic AI, you *need* to give this model a try. Congrats @Kimi_Moonshot , you all are astounding. Get access in the comments ➡️

English

206

Phil Howes@saltyph·24 Eki

speculation, in this case a eagle-3, remains one of the biggest levers to go from good to great. amazing job to leapfrog the market and get the most out of our GPUs

Baseten@baseten

This week, Baseten's model performance team unlocked the fastest TPS and TTFT for gpt-oss 120b on @nvidia hardware. When gpt-oss launched we sprinted to offer it at 450 TPS... now we've exceeded 650 TPS and 0.11 sec TTFT... and we'll keep working to keep raising the bar. We are proud to offer the best E2E latency available with near-limitless scale, incredible performance, and the highest uptime 99.99%.

English

113

Phil Howes@saltyph·6 Ağu

💪🫡 still plenty of juice to squeeze out of this one

Amir Haghighat@amiruci

It's important to support newly released open-weight models on day 1. But it's not noteworthy. What's noteworthy is to have the inference optimization muscle to immediately blow the competition out of water on latency and throughput. As measured by OpenRouter:

English

630

Phil Howes@saltyph·13 Haz

you can just do things faster

Baseten@baseten

We're excited to introduce the Baseten Performance Client, a new open-source Python library for up to 12x higher throughput for high-volume embedding tasks! Stand up a new vector database, preprocess text, and run massive workloads in <2 minutes (vs. 15+ with AsyncOpenAI).

English

235

Phil Howes@saltyph·10 Mar

@jxmnop if you read this and still want to learn cuda anyway, we’re hiring for this at @baseten to get more brrrr/dollar. dms open

English

405

dr. jack morris@jxmnop·10 Mar

in 2025, if you want to become a successful AI engineer or researcher, you should NOT learn CUDA furthermore – i'd guess that 80% of successful ML researchers have never written a CUDA kernel practical ML is about training models and using them to make predictions. this has nothing to do with CUDA CUDA is necessary in two cases: (a) you are developing a radically new model that isn't easily expressible in PyTorch or Jax (i.e. Mamba) (b) you are running into performance bottlenecks from current CUDA code and need to make it faster i doubt that either case applies to you chances are you aren't building the next Mamba, and the bottlenecks you'll run into in practice are different you should work on finding the right data or hardware or setting things up properly or distributing efficiently across hardware or researching new efficient ways to run models that other people are working on (like vLLM and SGLang) or better than that, work on your eval pipeline. find ways to measure your model's performance that are more realistic, comprehensive, efficient, fair, etc. TLDR: want to learn? spend your time tinkering with models in PyTorch and Jax. not writing matrix multiplications

English

1.5K

322.9K

Phil Howes retweetledi

Michael Feil@feilsystem·6 Mar

New Qwen-QWQ running at 90tokens/s generation speed on a single H100 @baseten using a new spec-dec stack. Around 2x more than the rest of the leaderboard (artificialanalysis.ai/leaderboards/p…).

English

1.7K

Phil Howes@saltyph·26 Şub

hit new peak demand today, 3 million RPS. thanks for stress testing our infra anon internet friend

English

Phil Howes@saltyph·29 Oca

@tuhinone what were your <think> tokens?

English

Phil Howes retweetledi

Baseten@baseten·19 Nis

The models are available at the following links: Llama 3 8B Instruct: baseten.co/library/llama-… Llama 3 70B Instruct: baseten.co/library/llama-…

English

1.6K

Phil Howes retweetledi

Conviction@conviction·12 Nis

Congrats to Conviction and Embed companies @baseten @Figure_robot @harvey__ai @langchain @MistralAI @sierraplatform @pika_labs (and our many pioneering friends) for making the #ForbesAI50 list! Ground floor of the revolution that will lead to many massive companies.

English

21.1K

Phil Howes retweetledi

abu@aqaderb·14 Mar

2 things. 1. i have loved working on this team. model performance is so much fun and so rewarding. 2. persistence is key. we started working on model performance end of 2023 and watching us slowly become better and better has been an incredible experience.

Baseten@baseten

fast!

English

1.9K

Phil Howes@saltyph·8 Mar

@saranormous i strongly recommend @Fall_of_Civ_Pod for you both, great production quality long form history

English

162

sarah guo@saranormous·8 Mar

My 6yo daughter is really into archaeology so I’ve been learning — I get more excited about ancient civilizations than about dinosaurs, and archaeology x tech is a cool intersection. A couple sites I’ve been scoping for an expedition:

English

117

28.6K

Phil Howes@saltyph·8 Mar

when i tell people working in infra is like being a plumber people assume it’s because of lots of pipe connecting, when in fact it’s because i spend most of my day digging through shit

English

143

Phil Howes@saltyph·7 Mar

@thomasschiavone @saranormous @awscloud @baseten gotta get gpus somewhere. welcome aboard, happy to bear the brunt of the pain

GIF

English

323

Phil Howes@saltyph·4 Mar

@aqaderb couldn't do it without you friend

English

abu@aqaderb·4 Mar

enduring businesses are 10x better and cheaper than incumbents. it's hard to believe that there isn't a world where AI powers 10x better products. but it's unclear if those products are cheaper. Baseten has helped, and will continue to help, builders and enterprises build those enduring businesses. we will make it cheap to run these models, fast to make your experiences magical and reliable so you can focus on building.

Baseten@baseten

We're excited to announce that we've raised a $40M Series B to help power the next generation of AI-native products with performant, reliable and scalable inference infrastructure. baseten.co/blog/announcin…

English

Phil Howes@saltyph·4 Mar

every day i get to work with a world class team supporting customers with world class products. today we get to dream a little bigger

Baseten@baseten

English

269

Phil Howes retweetledi

Baseten@baseten·22 Kas

Ready to try open source LLMs? Switch from GPT to Mistral 7B in the smallest refactor you'll ever ship: just 3 tiny code changes. If you're making the jump, DM us for $1,000 in free credits. baseten.co/blog/gpt-vs-mi…

English

1.7K

Phil Howes@saltyph·22 Tem

Repurposing @tuhinone's Llama v2 truss, got FreeWilly 2 up in under a minute. `:s/meta-llama\/Llama-2-70b-chat-hf/stabilityai\/FreeWilly2`. 275GB of weights later we're running at 23 tok/s out of the box.

English

15.6K

Phil Howes retweetledi

Tuhin Srivastava@tuhinone·21 Tem

We keep getting asked by users if they can use the 70B parameter model in production. We're serving the chat variant of Llama-2 70B on 2xA100 and getting pretty great throughput — it's cooking!

English

20K

Keşfet

@aqaderb @jxmnop @baseten @tuhinone @Figure_robot @harvey__ai @langchain @MistralAI