Inferact

100 posts

Inferact

@inferact

Building the future of inference

San Francisco, California Katılım Aralık 2025

5 Takip Edilen4.6K Takipçiler

Sabitlenmiş Tweet

Inferact@inferact·5d

Proud to host the first vLLM Conference at Ray Summit, Aug 24–26 in SF 🚀🎉 We're bringing the community together to talk about what's next: 🏗️ Building and scaling AI in production 🌐 Deploying vLLM to run inference across your own cloud and hardware 🤖 What open source means for the future of AI 🌉 Come learn where the future of inference, open source, and AI is heading — and meet the leading builders driving it 👇 vllm.ai/events/vllm-co…

vLLM@vllm_project

Announcing the first-ever vLLM Conference — hosted by @inferact at Ray Summit, Aug 24–26 in San Francisco 🎉🌉 This is where we'll get into the work pushing open, high-performance inference forward, such as: 🗺️ Where the vLLM roadmap is headed ⚡ Getting the most out of accelerators including NVIDIA, AMD, TPU 🔗 Wiring vLLM into training and serving pipelines 🚀 Running inference on production scale The summit features speakers from Inferact, NVIDIA, AMD, Google TPU, Anyscale, PyTorch, Meta, Red Hat, and more 🎤 Come learn where the future of inference, open source, and AI is heading — and meet the leading builders driving it 👇 vllm.ai/events/vllm-co…

English

2.3K

Inferact retweetledi

Simon Mo@simon_mo_·5d

🌉 August 24-26, SF. This will be the biggest vLLM Meetup ever! Join us to cover all things vLLM and meet the amazing community.

vLLM@vllm_project

English

Inferact@inferact·6 Tem

Inferact cohosted the AI Engineer World's Fair closing happy hour with @vllm_project & @novita_labs last week 🍵 Matcha, egg tarts, and the Portugal v. Croatia match ⚽ in the background — plus conversations spanning vLLM, the latest frontier models, and where AI is headed. 140+ of you showed up; new connections made, ideas swapped 🙏 thank you all!

English

1.5K

Inferact@inferact·30 Haz

🍻 Happy hour at the Inferact office this Thursday, July 2! We're co-hosting with @vllm_project and @novita_labs during World's Fair for a casual evening on leading open models, inference, infrastructure, and what's new in the space. RSVP here: luma.com/t42gxwnp

Novita AI@novita_labs

During @aiDotEngineer World’s Fair, we’re hosting a happy hour on Thursday, July 2 with special guests from @vllm_project, @inferact, and @buildwithRemy. Join AI builders, founders, and researchers for casual discussion on leading open weight models, inference, infrastructure, and recent developments. Brief intros early on, then drinks and conversation. No conference ticket required, spots are limited!

English

Inferact@inferact·29 Haz

Catch our founder & CEO @simon_mo_ on the debut of Altimeter @jaminball's First Pass, breaking down GLM5.2 and the rise of frontier open source models. 🎙️ The takeaway for enterprises: open source is becoming faster, cheaper, more reliable. Watch the full video 👇

Jamin Ball@jaminball

Introducing: First Pass! Altimeter's new video series breaking down the latest trends in AI, hosted by myself and @palak_go AI evolves so quickly. We wanted to create a series of short (~10 minute) videos taking a First Pass on the latest trends First up: @simon_mo_ on GLM5.2

English

2.5K

Inferact@inferact·24 Haz

Excited to be the first supporter to the LightSeek team on GitHub! Open source is the way, and we share their conviction that innovation should serve as a public good. Building inference in the open is how we accelerate AI progress for everyone.

LightSeek Foundation@lightseekorg

Thank you @Inferact for sponsoring us on @github. You can join them at our sponsors profile: github.com/sponsors/light…

English

2.8K

Inferact@inferact·19 Haz

World cup day at the Inferact office! ⚽️🎊🏆 Featuring: - chef @rogerw0108 - an extremely stressed @woosuk_k

English

2.2K

Inferact@inferact·18 Haz

Proud of the team on this one. Smooth Day 0 @MiniMax_AI M3 support in @vllm_project, along with our open source EAGLE3 spec decode model. Shoutout to @rogerw0108 for the ongoing push, reviews, and effort! More to come soon 👀 weights: huggingface.co/Inferact/MiniM… blog: vllm.ai/blog/2026-06-1…

SemiAnalysis@SemiAnalysis_

Great work to @vllm_project team and @NVIDIA on smooth, out-of-the-box day 0 @MiniMax_AI M3 experience with @inferact EAGLE3 spec decode. Here are the details of ongoing M3 workstream: NVIDIA, Inferact and SemiAnalysis are working hard on enabling disaggregated inferencing (PR 45879), and the Inferact team is working on enabling FlashInfer M3 MoE kernels (PR 45723). Performance should be much better once those PRs land. Huge shoutout to @rogerw0108 & @mgoin_ and the maintainers for the rapid review and mentorship here!

English

2.6K

Inferact@inferact·17 Haz

Ao Shen and @KaichaoYou cooking as usual 🧑‍🍳 A great read from SemiAnalysis on RL; check it out!

vLLM@vllm_project

A great deep dive from @SemiAnalysis_ on RL training systems and how much RL efficiency comes down to matching trainer and generator throughput! Shoutout to @KaichaoYou and Ao Shen from @inferact for the sandbox scaling experiments with vLLM + verl, building on @KaichaoYou's early RL integration work across OpenRLHF, verl, and slime🫡

English

2.4K

Inferact@inferact·16 Haz

Congrats to the team on day-0 support for GLM-5.2 in vLLM! Amazing effort from @Zai_org and the @vllm_project community

vLLM@vllm_project

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to @Zai_org on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗 recipes.vllm.ai/zai-org/GLM-5.2

English

1.2K

Inferact@inferact·14 Haz

10x 🫡🚀 Eagle3 draft model here: huggingface.co/Inferact/MiniM…

SemiAnalysis@SemiAnalysis_

DAY 0 ALERT: @MiniMax_AI M3 is now available on HuggingFace & has been added to InferenceX. The M3 architecture has ~428B parameters and ~23B activated parameters. Due to the 10x engineers from @inferact, M3 is already delivering pretty well-optimized performance on @NVIDIAAI B300 Blackwell Ultra on Day 0 @vllm_project! Furthermore, Inferact released their EAGLE3 heads, which enable even greater performance. Looking forward to Day 1, 2, and 3 performance & the team is grinding on benchmarking Day 0 MI355X performance on InferenceX too.

English

4.2K

Inferact@inferact·13 Haz

🎉 Proud of the team's work to land day-0 MiniMax M3 support in vLLM! Day-0 M3 in vLLM: 1M context, MSA sparse attention, native multimodal, and tool calling for agentic workloads. Huge effort and partnership across @MiniMax_AI , @NVIDIAAI , @AIatAMD. This is what open-source inference at the frontier looks like. 🚀

vLLM@vllm_project

🎉 Congrats to @MiniMax_AI on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at @MiniMax_AI, @NVIDIAAI, @AIatAMD, and @inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗 vllm.ai/blog/2026-06-1…

English

1.5K

Inferact@inferact·5 Haz

Proud of the vLLM team for shipping day-0 support on Nemotron 3 Ultra! 550B / 55B active, hybrid Mamba-Transformer, 1M context — servable today.

vLLM@vllm_project

🚀 Day-0 support for NVIDIA Nemotron 3 Ultra on vLLM! Ready to be served with the latest vLLM stable release, the new open frontier reasoning model is built for long-running autonomous agents: 🧠 550B total / 55B active — Hybrid Transformer-Mamba MoE 📚 Up to 1M token context ⚡ NVFP4 + BF16 🛠️ Tool calling, coding, deep research, orchestration Read our detailed model launch blog and recipes! recipes.vllm.ai/nvidia/NVIDIA-…

English

2.9K

Inferact@inferact·1 Haz

🚀 Excited to collab with @NVIDIARTXSpark pushing local AI agents forward across RTX + DGX Spark! Sharing our hands-on #vLLM + #DGXSpark blog with the @vllm_project community. We showed it off with a live 20 Questions game—first at our office warming, then at #MLSys2026, where curious attendees took turns stumping the model. Why vLLM + DGX Spark? You get a familiar serving workflow on local hardware: streaming responses, memory-efficient KV-cache management, runtime controls for unified memory, and the metrics to deploy on real workloads. ⚙️📊 Read the full blog and try it on your Spark 👇 vllm.ai/blog/2026-06-0…

NVIDIA RTX Spark@NVIDIARTXSpark

Local AI Agents are leveling up across DGX Spark & RTX PCs. NVIDIA OpenShell is coming to Windows alongside new agentic AI optimizations and creator app updates — including NVIDIA Broadcast 2.2, plus upcoming RTX acceleration for Adobe apps and Blender. More 👇

English

2.2K

Inferact@inferact·31 May

Honored to be on this list! 🎉 Cheers to AI infrastructure, @vllm_project, and building the future of inference 🚀

Redpoint@Redpoint

The Redpoint InfraRed 100 is now live. These are the companies building the infrastructure that powers everything happening in AI right now, from world models and agent runtimes to the sandboxes, databases, and security tools agents depend on. Congratulations to this year's honorees! Read the full 2026 InfraRed Report: our state of the union on AI and cloud infrastructure 👉 redpoint.com/reports/the-in…

English

2.6K

Inferact@inferact·26 May

Congrats @modal! 🚀 The shift toward teams owning their models is real, and the open inference layer is a big part of why. Excited to keep building alongside you.

Modal@modal

x.com/i/article/2057…

English

Inferact@inferact·26 May

🚀 Proud to see the Rust frontend land upstream in @vllm_project! Huge congrats to @BugenZhao for driving this work and introducing it at @PyTorch Meetup Singapore last week. A great milestone for the team and the vLLM community. 🦀 PR: github.com/vllm-project/v…

vLLM@vllm_project

🦀 The Rust frontend is officially merged into vLLM! As GPUs get faster, the frontend has become a real share of CPU time. The new Rust frontend is a drop-in alternative to the Python API server — same engine, same ZMQ boundary. Opt in with VLLM_USE_RUST_FRONTEND=1. Early numbers: on a preprocess-heavy workload, ~837 req/s vs ~162 req/s for default Python — ~5x in a single process. A few design choices we're excited about: • Layered crates with clear boundaries • Stream-native pipeline — non-streaming for free • Builds on stable Rust Huge thanks to @BugenZhao from @inferact for introducing the work at @PyTorch Meetup Singapore. github.com/vllm-project/v…

English

Inferact@inferact·23 May

That's a wrap on #MLSys2026 in Bellevue! 🚢 It was great meeting so many of you this past week — researchers, contributors, and friends of @vllm_project. The energy around inference systems right now is something else, and the conversations reminded us why this community matters. A few highlights from our team: 🎤 @rogerw0108 (co-founder, vLLM core maintainer) gave an invited talk, "Rethinking Open Source Contribution in the Age of AI Agents" — a maintainer's-eye view of how AI-generated PRs are reshaping the economics of open source, with concrete examples from vLLM. 🎤 @yifandotqiao gave a Lightning Talk, "Rethink LLM Inference Abstractions: New Trends and Challenges in LLM Serving" — on the combinatorial explosion across models, hardware, and workloads, and why serving at scale is increasingly a distributed systems problem. And of course — congrats to everyone who played 20 Questions with vLLM at our booth 🎯 Thanks to the MLSys organizers for putting on such a great week. If we missed you in Bellevue, our DMs are open — always happy to talk inference, vLLM, and what we're building. On to the next one. 🛠️

English

Keşfet

@vllm_project @novita_labs @simon_mo_ @jaminball @rogerw0108 @woosuk_k @MiniMax_AI @KaichaoYou