Michael Feil

111 posts

Michael Feil

@feilsystem

Accelerating LLMs @Basetenco - long-context and embedding inference (https://t.co/IdBf5U7mS3) - opinions are my own.

San Francisco, CA Bergabung Şubat 2024

133 Mengikuti189 Pengikut

Michael Feil@feilsystem·20 Mar

@art_zucker Making the token+position a u64 is a good idea for lookups, e.g. I did this also a couple of times. #L77" target="_blank" rel="nofollow noopener">github.com/michaelfeil/ra… I cross compiled the package from the blog post, so `pip install fastokens-b10` is a thing.

English

Michael Feil@feilsystem·20 Mar

Really good blog by the dynamo and crusoe team. crusoe.ai/resources/blog…. tl;dr: Ported some of the learnings back to hf/tokenizers. github.com/huggingface/to… github.com/huggingface/to… at the scale of transformers, will save probably save M$ if done right. @art_zucker.

English

518

Michael Feil me-retweet

Amir Haghighat@amiruci·25 Şub

You’ve used language models, image models, video models, and voice models. Now it’s time for world models, thanks to World Labs.

English

204

836.3K

Michael Feil@feilsystem·25 Şub

As result current engines wastes around 5 to 500% in prefill performance during inference and training, when using shared prefixes. implementation: github.com/michaelfeil/ra… paper: arxiv.org/abs/2601.15013

English

Michael Feil@feilsystem·25 Şub

Turns out that all engines do just prefill multiple requests, at the same time, even when prefixes are shared. KV-style caching for training systems is possible, it just needs to look different to a vllm-style paged kv-cache. [2/x]

English

Michael Feil@feilsystem·25 Şub

tldr: We open-source a inference engine that deduplicates prefill tokens and wrote a paper (@juliuslipp). RadixMLP was missed chance by the community that developed varlen (THD-packed) inference, and overlooked by people working on training and inference engines. [1/x]

Baseten@baseten

Introducing RadixMLP: intra-batch prefix deduplication for 1.4–5x faster prefill. Tokens with identical prefixes (like system prompts or shared queries) produce identical activations. @feilsystem developed RadixMLP to eliminate this redundancy, then open-sourced it and added it to TEI and BEI. baseten.co/resources/rese…

English

645

Michael Feil me-retweet

Baseten@baseten·10 Şub

Introducing Kimi K2.5 on Baseten’s Model APIs with the most performant TTFT (0.26 sec) and TPS (340) on Artificial Analysis. Even among a landscape of incredible open source models, Kimi K2.5 stands out with its multi-modal capabilities and it's ability to accommodate an alarmingly large number of tool calls. Get the good stuff here: baseten.co/library/kimi-k…

English

15.1K

Michael Feil me-retweet

Cursor@cursor_ai·10 Şub

Composer 1.5 is now available. We’ve found it to strike a strong balance between intelligence and speed.

English

154

184

1.9K

660.3K

Michael Feil me-retweet

Baseten@baseten·4 Ara

If you need an adrenaline rush to wake up from your post-Thanksgiving stupor… we got you. @deepseek_ai V3.2 dropped this week and is now available on Baseten. It’s so smart your mother will ask why you can't be more like DeepSeek. V3.2 is currently on par with GPT-5 all whilst being multiples cheaper. V3.2 is now live on our Model APIs and on @openrouter and @ArtificialAnlys. Baseten is the fastest provider with 0.22 TTFT and 191 tps (that’s 1.5x faster than the next guy). For a model this size, it’s screaming. Get the brains, without trading off performance.

English

4.6K

Michael Feil me-retweet

Baseten@baseten·16 Eki

Powering inference for the fastest growing AI companies like OpenEvidence, Writer, and Clay means being the first to use bleeding-edge model performance tooling in production. That's why we were early adopters of NVIDIA Dynamo, giving us 50% lower latency and 60%+ higher throughput with KV cache-aware routing. These results are the tip of the iceberg — especially for our customers running large models with large context windows under heavy load.

English

6.3K

Michael Feil me-retweet

Arthur Zucker@art_zucker·9 Eyl

One huge request shouldn't tank everyone else's latency. With async tokenization, small requests don't queue behind the huge ones. ~2× lower P50 and a big P90 drop—same throughput, happier users. Code: await tokenizer.async_encode(text) (tokenizers==0.22) 🐋➡️🐟⚡️

English

5.7K

Michael Feil@feilsystem·5 Eyl

Roses are red, violets are blue; Baseten got more cash to optimize models for you.

Tuhin Srivastava@tuhinone

Today, we’re excited to announce our $150M Series D, led by BOND, with Jay Simons joining our Board. We’re also thrilled to welcome Conviction and CapitalG to the round, alongside support from 01 Advisors, IVP, Spark Capital, Greylock Partners, Scribble Ventures, and Premji Invest. The last eighteen months have been a whirlwind; as the AI application layer has taken off, we've been proud to play a small part supporting world class companies run their production workloads. Thanks to all our customers including Abridge, Bland, Clay, Gamma, Mirage, OpenEvidence, Sourcegraph, WRITER, and Zed Industries. We’re just getting started. If you’re building the next generation of AI products, we’d love to work with you.

English

7.9K

Michael Feil me-retweet

NVIDIA AI@NVIDIAAI·4 Eyl

📈 @baseten users are scaling smarter with us: ✅ 5× throughput on high-traffic endpoints ✅ 50% lower cost per token ✅ Up to 38% lower latency on the largest LLMs Built on NVIDIA Blackwell + TensorRT-LLM + Dynamo on @googlecloud—driving efficiency, speed & adoption at scale. Learn More: nvda.ws/4lUKT89

English

107

15.2K

Michael Feil me-retweet

Luc Georges@LucSGeorges·2 Eyl

An async API landed in tokenizers==0.22.0, pyo3-async-runtimes is 🔥 Thanks @feilsystem who contributed a neat PR to tokenizers to add the asynchronous bindings! This should make things faster in specific contexts, looking at you inference 👀

English

737

Michael Feil@feilsystem·26 Ağu

@Laz4rz @m_sirovatka Can't make it this year.

English

Lazarz@Laz4rz·23 Ağu

@m_sirovatka I mean you can also go, @feilsystem will probably be there

English

Lazarz@Laz4rz·23 Ağu

Applied cause why not, last year was a blast and god knows I’d use an SF trip (my frequent traveler miles yearn for it)

Mark Saroufim@marksaroufim

It's time again for our last (now yearly) celebration extravaganza of the year. GPU MODE is meeting IRL again in downtown San Francisco on Friday October 24 from 10am to 10pm to hack all day

English

1.8K

Michael Feil@feilsystem·21 Ağu

Tune in: Dynamo office hours at @baseten

NVIDIA AI Developer@NVIDIAAIDev

Inference Office Hours - Dynamo x.com/i/broadcasts/1…

English

978

Michael Feil@feilsystem·6 Ağu

Not only launching - we're pushing back, minutes after launch. github.com/openai/harmony… Go join Baseten if you want to get things shipped.

Baseten@baseten

We're excited to be an OpenAI launch partner for the release of GPT OSS 120B and 20B! Model APIs coming shortly, with performance optimizations, benchmarks, and vibe checks dropping throughout the day. Stay tuned.

English

1.3K

Michael Feil me-retweet

Tuhin Srivastava@tuhinone·6 Ağu

We're very excited to be an @OpenAI launch partner for GPT OSS. Today's a big day for open models, and we have day 0 support for GPT OSS 120b via our Model APIs: baseten.co/library/gpt-os… We'll be rolling out more performance optimizations and benchmarks over the coming hours and days, so stay tuned -- and congrats to the OpenAI team on the launch!

English

17.5K

Michael Feil me-retweet

Baseten@baseten·5 Ağu

TEI doesn't run on B200s — but BEI does. BEI achieves 3.6x higher embeddings throughput than TEI and 3.3x that of vLLM on a high-throughput (500 tokens/request) test. In other words, we’re excited to announce Baseten Embeddings Inference (BEI) for Blackwell GPUs!

English

1.2K

Michael Feil me-retweet

Philip Kiely@philipkiely·31 Tem

If you've been wondering where America's answer to DeepSeek is, check out @DeepCogito Today's open-source Cogito v2 release shows a promising research direction for faster, cheaper, smarter agents -- built right here in SF for <$3.5M.

English

5.5K

Jelajahi

@art_zucker @juliuslipp @deepseek_ai @openrouter @ArtificialAnlys @baseten @googlecloud @Laz4rz