Scale Forem

Scale YouTube
Scale YouTube

Posted on

Devoxx: Large Scale Distributed LLM Inference with LLM D and Kubernetes by Abdel Sghiouar

Large-scale LLM inference on Kubernetes just got a glow-up with LLM-D, a cloud-native framework that helps you juggle performance, availability, scalability and cost-efficiency across scarce GPU/TPU resources. Instead of wrestling with custom setups, you get a plug-and-play path to serve AI models faster and cheaper.

Under the hood, LLM-D stitches together vLLM, Prometheus and the Kubernetes Gateway API, plus smart KV-cache routing and disaggregated serving to squeeze every bit of GenAI performance per dollar. Crafted by the vLLM crew (Red Hat, Google, Bytedance) and Apache 2-licensed, it’s built for teams ready to level up their LLM game.

Watch on YouTube

Top comments (0)