Your LLM projects might hum along in testing but totally tank in production because, surprise, LLMs are not your average stateless REST APIs! They're hungry, stateful beasts that gulp down GPU memory and cause mayhem with context, batching, and caching.
But fear not! This talk offers a lifeline: LLM-D's open-source sharding combined with a clever NVIDIA/AMD node pool. Get ready for live demos, handy YAML, and even a secret sauce to keep your token costs from going supernova.
Watch on YouTube
Top comments (0)