
Red Hat AI
2.2K posts

Red Hat AI
@RedHat_AI
Accelerating AI innovation with open platforms and community. The future of AI is open.




vLLM meetup is coming to Boston on March 31! Workshop + evening sessions covering: - @vllm_project update - Model compression and speculative decoding - Agentic AI with vLLM - Distributed inference at scale with @_llm_d_ and Kubernetes Pre-event workshop at 3:30 PM: Deploy Llama 3.1 8B and benchmark llm-d's cache-aware routing live. Shoutout to our sponsors: @RedHat, @IBM, @NVIDIAAI, The Open Accelerator, and @MITIBMLab! Register here 👇 luma.com/4rmkrrb7





𝗞𝗦𝗲𝗿𝘃𝗲 𝘃𝟬.𝟭𝟳 𝗶𝘀 𝗹𝗶𝘃𝗲! 🚀 We are thrilled to announce KServe's most significant update yet. We’ve overhauled the architecture to move beyond traditional model serving. LLMInferenceService is now fully production-ready and built on the high-performance @_llm_d_ framework. 𝗪𝗵𝗮𝘁’𝘀 𝗶𝗻𝗰𝗹𝘂𝗱𝗲𝗱? - KV-cache aware routing and disaggregated prefill-decode to maximize throughput. - Cost-aware autoscaling designed for LLM inference workloads. - Comprehensive parallelism specification for distributed inference. - Envoy AI Gateway integration for sophisticated token-based rate limiting. - A completely restructured modular Helm chart architecture. 𝗖𝗼𝗺𝗺𝘂𝗻𝗶𝘁𝘆 𝗣𝗼𝘄𝗲𝗿 🤝 This version was made possible by 𝟯𝟴 𝗰𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗼𝗿𝘀, including 𝟮𝟭 𝗻𝗲𝘄 𝗰𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗼𝗿𝘀. Thank you for your hard work! Check out the full release notes here: github.com/kserve/kserve/… Release blog: kserve.github.io/website/blog/k…














🔥 Meet Mistral Small 4: One model to do it all. ⚡ 128 experts, 119B total parameters, 256k context window ⚡ Configurable Reasoning ⚡ Apache 2.0 ⚡ 40% faster, 3x more throughput Our first model to unify the capabilities of our flagship models into a single, versatile model.

🔥 Meet Mistral Small 4: One model to do it all. ⚡ 128 experts, 119B total parameters, 256k context window ⚡ Configurable Reasoning ⚡ Apache 2.0 ⚡ 40% faster, 3x more throughput Our first model to unify the capabilities of our flagship models into a single, versatile model.