Inferact

69 posts

Inferact banner
Inferact

Inferact

@inferact

Silicon Katılım Aralık 2025
3 Takip Edilen4K Takipçiler
Inferact retweetledi
DigitalOcean
DigitalOcean@digitalocean·
Among the fastest DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B inference in the market, per Artificial Analysis benchmarks (April 2026). ⚡️🤖 Sub-1-second TTFT. 230 tokens per second. Co-designed every layer of the stack with @Inferact, performance optimized @vllm_project, all on @NVIDIA HGX B300. Live on DigitalOcean Serverless Inference now. Full breakdown in the comments. ⬇️
English
1
8
31
32.1K
Inferact retweetledi
Roger Wang
Roger Wang@rogerw0108·
🚀🚀🚀 github.com/vllm-project/v… just got merged to main! Huge shoutout to the entire team @inferact that worked on day-0 support of DeepSeek V4 and our partner @NVIDIAAI for the collaboration on day-0 large scale serving enablement! More optimizations coming soon - stay tuned!
English
0
7
61
3.7K
Inferact retweetledi
Roger Wang
Roger Wang@rogerw0108·
Big thanks to @NVIDIAAI and @SemiAnalysis_ for shipping this together with @inferact @vllm_project! Day-0 large-scale serving for a new architecture is a huge milestone. DeepSeekV4 is a beast, but more optimizations are coming to @vllm_project to make it faster and cheaper!
SemiAnalysis@SemiAnalysis_

DAVIS, APRIL 25, 2026 — InferenceX has added DeepSeekv4 for @vllm_project 's day 0 support for GB200 disagg! Great work to @flowpow123 @rogerw0108 @NVIDIAAIDev @inferact for the fast support and engineering!

English
3
5
33
3.5K
Inferact retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
DeepSeekv4 Pro 1.6T is supported on InferenceX on Day 0! We have already gotten H200 vLLM working and working on @vllm_project & @sgl_project MI355, B200, B300, GB200/300 disaggregated DeepSeekv4 day 0 performance benchmarking too to track the progress of improvement. Thank you to @NVIDIAAIDev & @simon_mo_ & @rogerw0108 from @inferact for helping pull an all nighter to debug dsv4 day 0.
SemiAnalysis tweet media
English
5
10
167
18.3K
Inferact retweetledi
vLLM
vLLM@vllm_project·
🎉 Day-0 support for @deepseek_ai V4 Pro and Flash on vLLM — a new generation of DeepSeek model, purpose-built for tasks up to 1M tokens. Alongside the release, we're publishing a first-principles walkthrough of the new long-context attention and how we implemented it in vLLM. The new attention mechanism, in four moves: • Shared K/V + inverse RoPE → 2× memory savings • c4a / c128a KV compression → 4×–128× savings • DeepSeek Sparse Attention over compressed tokens • Short sliding window for locality across compression boundaries At 1M context, per-layer KV state is ~8.7× smaller than a DeepSeek V3.2-style 61-layer stack (9.62 GiB vs 83.9 GiB, bf16). fp8 attention cache + fp4 indexer cache shrink it further. vLLM side: • Unified hybrid KV cache — single logical block size (256 native positions) across all compression rates; compressor state folded into the SWA KV cache spec so prefix caching, disagg prefill, CUDA graphs and MTP reuse the same abstraction • Three page-size buckets for the full 5-way cache stack → no cross-kind fragmentation • Fused kernels: compressor + RMSNorm + RoPE + cache insert (1.4–3×), inverse RoPE + fp8 quant (2–3×), Q-norm + KV RoPE + K insert (10–20×) • Multi-stream overlap of indexer vs main-KV compression vs SWA insertion Disaggregated serving is supported out of the box and strongly recommended for best performance. Follow our recipes site for verified commands for @nvidia Blackwell (B200, B300, GB200, GB300) and Hopper (H100/H200/H20) systems. Thanks to the @deepseek_ai team for open-sourcing DeepSeek V4, and to @inferact for landing day-0 support 🤝 📝 Blog: vllm.ai/blog/deepseek-… 📖 Recipes: recipes.vllm.ai/deepseek-ai/De… 🤗 huggingface.co/deepseek-ai/De…
vLLM tweet media
DeepSeek@deepseek_ai

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English
17
90
567
120.8K
Inferact retweetledi
Woosuk Kwon
Woosuk Kwon@woosuk_k·
Going from Ampere to Hopper and to Blackwell, we always find new ways to leverage the architectural innovations to accelerate inference performance. Excited to collaborate with @nvidia to advance @inferact’s mission to grow vLLM!
Inferact@inferact

We are thrilled to announce that @nvidia is the latest investor in @inferact. We look forward to continuing the momentum driven by our deep collaboration: (1) Engineering velocity: a significant uptick in @nvidia pull requests to the @vllm_project repo. (2) Product synergy: close integration with NVIDIA Dynamo, ModelOpt, Nemotron, and more products! It’s an exciting time for the growth and development of vLLM, the world's AI inference engine!

English
1
3
94
12.8K
Inferact retweetledi
Simon Mo
Simon Mo@simon_mo_·
@vllm_project has always been about the partnership and ecosystem that support open source inference. I’m excited to continue our collaboration with @nvidia and welcome them as @inferact’s latest investor.
Inferact@inferact

We are thrilled to announce that @nvidia is the latest investor in @inferact. We look forward to continuing the momentum driven by our deep collaboration: (1) Engineering velocity: a significant uptick in @nvidia pull requests to the @vllm_project repo. (2) Product synergy: close integration with NVIDIA Dynamo, ModelOpt, Nemotron, and more products! It’s an exciting time for the growth and development of vLLM, the world's AI inference engine!

English
6
2
49
9.3K
Inferact
Inferact@inferact·
We are thrilled to announce that @nvidia is the latest investor in @inferact. We look forward to continuing the momentum driven by our deep collaboration: (1) Engineering velocity: a significant uptick in @nvidia pull requests to the @vllm_project repo. (2) Product synergy: close integration with NVIDIA Dynamo, ModelOpt, Nemotron, and more products! It’s an exciting time for the growth and development of vLLM, the world's AI inference engine!
English
8
7
81
29.4K
Inferact retweetledi
Roger Wang
Roger Wang@rogerw0108·
"Math is hard - I find myself struggle with math very often." - guy with IMO & IOI gold who just joined us.
English
1
2
26
3.1K
Inferact retweetledi
Bogomil Balkansky
Bogomil Balkansky@BogieBalkansky·
It's wonderful to see the creators of the @vllm_project start a company @inferact . vLLM has been capturing the hearts and minds of the technical community for years, and a company based on it means more innovation from the brilliant minds behind it: @simon_mo_ @WooksuK and the whole team.
Woosuk Kwon@woosuk_k

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

English
10
6
75
14.1K
Inferact retweetledi
The House Fund
The House Fund@thehousefund·
We backed @Inferact at inception, based on the Berkeley research project vLLM. Today, they announced a $150M seed led by @a16z and @LightspeedVP, with @Sequoia and The House Fund — one of the largest seed rounds ever. What started in a lab is now the open-source inference standard, powering AI at Meta, Google, and Character.AI, with 2,000+ contributors worldwide. Huge congrats to @simon_mo_, @woosuk_k, @KaichaoYou, @rogerw0108, @istoica05, @profjoeyg & team! The best of Berkeley AI + infrastructure. 🐻 Go Bears!
English
6
6
58
9K
Inferact retweetledi
Hao Zhang
Hao Zhang@haozhangml·
Big congrats on @inferact! Since we initiated vLLM’s earliest research push back in 2023, it has been incredible to watch @vllm_project become the OSS inference engine for so many teams. Building a project like this takes persistence across everything: research breakthroughs, ruthless engineering, performance + stability work, ecosystem integration, and the unglamorous grind of docs/CI/issues/releases. Huge gratitude to the maintainers & contributors—can’t wait to keep upstreaming new inference ideas in 2026 with the greater community and @inferact 🚀
Woosuk Kwon@woosuk_k

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

English
4
6
78
18K
Inferact retweetledi
Woosuk Kwon
Woosuk Kwon@woosuk_k·
Thank you so much! I still remember the day @haozhangml suggested working on LLM inference back in 2022. vLLM truly wouldn’t exist without you.
Hao Zhang@haozhangml

Big congrats on @inferact! Since we initiated vLLM’s earliest research push back in 2023, it has been incredible to watch @vllm_project become the OSS inference engine for so many teams. Building a project like this takes persistence across everything: research breakthroughs, ruthless engineering, performance + stability work, ecosystem integration, and the unglamorous grind of docs/CI/issues/releases. Huge gratitude to the maintainers & contributors—can’t wait to keep upstreaming new inference ideas in 2026 with the greater community and @inferact 🚀

English
1
5
59
8.2K
Inferact retweetledi
Lightspeed
Lightspeed@lightspeedvp·
Inferact Co-Founder Simon Mo on AI economics: "You build the data centers, the training cluster, fund the training run, produce a model… but at that point, there is no value created." "Only delivering inference is the point where you can actually capitalize on this intelligence." Inference, not training, is increasingly where AI resources are being concentrated. @simon_mo_ @inferact
Lightspeed@lightspeedvp

We co-led Inferact's $150M seed round to support them in their mission to build the inference engine for all current and future AI. In this episode of The Investment Memo, Lightspeed's Bucky Moore and James Alcorn sit down with Simon Mo (Co-Founder & CEO @inferact) to cover: - How vLLM grew to 60K+ GitHub stars - Why inference is shifting to the majority of compute - How vLLM evolved from a research project into the industry standard - Why building a company was the next step to push open-source inference forward 00:00 Introduction 02:03 The investment memo 04:47 Latency vs throughput vs cost 06:19 Paged attention explained 08:04 The evolution of attention 09:42 Growing the vLLM open source community 11:41 Working with hardware vendors 14:45 Deploying vLLM at large scale 16:03 Inferact's culture of openness 18:45 Building an open ecosystem and horizontal stack 19:45 Inferact's approach to fundraising 22:14 What is the future of inference? @simon_mo_ @buckymoore @JamesAlcorn94

English
3
3
24
6.2K
Inferact retweetledi
Lightspeed
Lightspeed@lightspeedvp·
Inferact CEO @simon_mo_ says the AI infrastructure buildout is misunderstood: "The clusters being built for training—six months later, they'll be used entirely for inference." "Inference will start to eat up that capacity, and consume all the newly provisioned energy." @inferact
Lightspeed@lightspeedvp

We co-led Inferact's $150M seed round to support them in their mission to build the inference engine for all current and future AI. In this episode of The Investment Memo, Lightspeed's Bucky Moore and James Alcorn sit down with Simon Mo (Co-Founder & CEO @inferact) to cover: - How vLLM grew to 60K+ GitHub stars - Why inference is shifting to the majority of compute - How vLLM evolved from a research project into the industry standard - Why building a company was the next step to push open-source inference forward 00:00 Introduction 02:03 The investment memo 04:47 Latency vs throughput vs cost 06:19 Paged attention explained 08:04 The evolution of attention 09:42 Growing the vLLM open source community 11:41 Working with hardware vendors 14:45 Deploying vLLM at large scale 16:03 Inferact's culture of openness 18:45 Building an open ecosystem and horizontal stack 19:45 Inferact's approach to fundraising 22:14 What is the future of inference? @simon_mo_ @buckymoore @JamesAlcorn94

English
2
1
25
4.5K
Inferact retweetledi
Yusen DAI | 戴雨森
Yusen DAI | 戴雨森@yusen·
Very excited to partner with @inferact in support of their mission to build the inference engine for AI. ZhenFund is proud to have been an early supporter of @vllm_project. Huge congrats to @simon_mo_, @woosuk_k, @KaichaoYou, @rogerw0108, @istoica05, and the rest of the founding team.
Simon Mo@simon_mo_

vLLM has grown to 2000+ contributors scale with a diverse community of model, hardwares, and applications. I see @vllm_project on the path of becoming the world's inference engine and @inferact to accelerate AI progress. We cannot be more excited about the road ahead.

English
7
2
28
12.4K
Inferact retweetledi
Lily Liu
Lily Liu@eqhylxx·
vLLM was where I first got deep into MLsys—so excited to see the company finally here. Huge congrats and best wishes to @woosuk_k and @inferact!
Woosuk Kwon@woosuk_k

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

English
2
2
57
6.3K