Michael Feil

111 posts

Michael Feil banner
Michael Feil

Michael Feil

@feilsystem

Accelerating LLMs @Basetenco - long-context and embedding inference (https://t.co/IdBf5U7mS3) - opinions are my own.

San Francisco, CA Bergabung Şubat 2024
133 Mengikuti189 Pengikut
Michael Feil
Michael Feil@feilsystem·
@art_zucker Making the token+position a u64 is a good idea for lookups, e.g. I did this also a couple of times. #L77" target="_blank" rel="nofollow noopener">github.com/michaelfeil/ra… I cross compiled the package from the blog post, so `pip install fastokens-b10` is a thing.
English
0
0
0
44
Michael Feil me-retweet
Amir Haghighat
Amir Haghighat@amiruci·
You’ve used language models, image models, video models, and voice models. Now it’s time for world models, thanks to World Labs.
English
34
33
204
836.3K
Michael Feil
Michael Feil@feilsystem·
Turns out that all engines do just prefill multiple requests, at the same time, even when prefixes are shared. KV-style caching for training systems is possible, it just needs to look different to a vllm-style paged kv-cache. [2/x]
English
1
0
1
60
Michael Feil
Michael Feil@feilsystem·
tldr: We open-source a inference engine that deduplicates prefill tokens and wrote a paper (@juliuslipp). RadixMLP was missed chance by the community that developed varlen (THD-packed) inference, and overlooked by people working on training and inference engines. [1/x]
Baseten@baseten

Introducing RadixMLP: intra-batch prefix deduplication for 1.4–5x faster prefill. Tokens with identical prefixes (like system prompts or shared queries) produce identical activations. @feilsystem developed RadixMLP to eliminate this redundancy, then open-sourced it and added it to TEI and BEI. baseten.co/resources/rese…

English
1
1
11
645
Michael Feil me-retweet
Baseten
Baseten@baseten·
Introducing Kimi K2.5 on Baseten’s Model APIs with the most performant TTFT (0.26 sec) and TPS (340) on Artificial Analysis. Even among a landscape of incredible open source models, Kimi K2.5 stands out with its multi-modal capabilities and it's ability to accommodate an alarmingly large number of tool calls. Get the good stuff here: baseten.co/library/kimi-k…
Baseten tweet media
English
11
8
98
15.1K
Michael Feil me-retweet
Cursor
Cursor@cursor_ai·
Composer 1.5 is now available. We’ve found it to strike a strong balance between intelligence and speed.
Cursor tweet media
English
154
184
1.9K
660.3K
Michael Feil me-retweet
Baseten
Baseten@baseten·
If you need an adrenaline rush to wake up from your post-Thanksgiving stupor… we got you. @deepseek_ai V3.2 dropped this week and is now available on Baseten. It’s so smart your mother will ask why you can't be more like DeepSeek. V3.2 is currently on par with GPT-5 all whilst being multiples cheaper. V3.2 is now live on our Model APIs and on @openrouter and @ArtificialAnlys. Baseten is the fastest provider with 0.22 TTFT and 191 tps (that’s 1.5x faster than the next guy). For a model this size, it’s screaming. Get the brains, without trading off performance.
Baseten tweet media
English
10
12
43
4.6K
Michael Feil me-retweet
Baseten
Baseten@baseten·
Powering inference for the fastest growing AI companies like OpenEvidence, Writer, and Clay means being the first to use bleeding-edge model performance tooling in production. That's why we were early adopters of NVIDIA Dynamo, giving us 50% lower latency and 60%+ higher throughput with KV cache-aware routing. These results are the tip of the iceberg — especially for our customers running large models with large context windows under heavy load.
Baseten tweet media
English
1
3
21
6.3K
Michael Feil me-retweet
Arthur Zucker
Arthur Zucker@art_zucker·
One huge request shouldn't tank everyone else's latency. With async tokenization, small requests don't queue behind the huge ones. ~2× lower P50 and a big P90 drop—same throughput, happier users. Code: await tokenizer.async_encode(text) (tokenizers==0.22) 🐋➡️🐟⚡️
Arthur Zucker tweet media
English
2
9
49
5.7K
Michael Feil me-retweet
NVIDIA AI
NVIDIA AI@NVIDIAAI·
📈 @baseten users are scaling smarter with us: ✅ 5× throughput on high-traffic endpoints ✅ 50% lower cost per token ✅ Up to 38% lower latency on the largest LLMs Built on NVIDIA Blackwell + TensorRT-LLM + Dynamo on @googlecloud—driving efficiency, speed & adoption at scale. Learn More: nvda.ws/4lUKT89
NVIDIA AI tweet media
English
8
23
107
15.2K
Michael Feil me-retweet
Luc Georges
Luc Georges@LucSGeorges·
An async API landed in tokenizers==0.22.0, pyo3-async-runtimes is 🔥 Thanks @feilsystem who contributed a neat PR to tokenizers to add the asynchronous bindings! This should make things faster in specific contexts, looking at you inference 👀
English
0
4
14
737
Michael Feil me-retweet
Tuhin Srivastava
Tuhin Srivastava@tuhinone·
We're very excited to be an @OpenAI launch partner for GPT OSS. Today's a big day for open models, and we have day 0 support for GPT OSS 120b via our Model APIs: baseten.co/library/gpt-os… We'll be rolling out more performance optimizations and benchmarks over the coming hours and days, so stay tuned -- and congrats to the OpenAI team on the launch!
English
12
20
91
17.5K
Michael Feil me-retweet
Baseten
Baseten@baseten·
TEI doesn't run on B200s — but BEI does. BEI achieves 3.6x higher embeddings throughput than TEI and 3.3x that of vLLM on a high-throughput (500 tokens/request) test. In other words, we’re excited to announce Baseten Embeddings Inference (BEI) for Blackwell GPUs!
Baseten tweet media
English
2
5
21
1.2K
Michael Feil me-retweet
Philip Kiely
Philip Kiely@philipkiely·
If you've been wondering where America's answer to DeepSeek is, check out @DeepCogito Today's open-source Cogito v2 release shows a promising research direction for faster, cheaper, smarter agents -- built right here in SF for <$3.5M.
English
2
7
25
5.5K