Scale Forem

Scale YouTube
Scale YouTube

Posted on

CNCF [Cloud Native Computing Foundation]: Fail Open, Fail Fast: Improving Envoy Resilience in Latency-Critical Systems

In this talk, the team tackles Envoy’s resilience challenges in latency-sensitive, large-scale storage systems—debugging intermittent “no such bucket” errors caused by rate-limiting timeouts, fine-tuning fail-open vs. fail-close behaviors, and co-locating critical services to slash cross-LB latency. They also swapped out K6 for Envoy’s Nighthawk to generate realistic high-traffic loads and expose hidden reliability bottlenecks.

On the observability side, they layered in gRPC metadata tags for seamless tracing and rolled out changes without any production hiccups. Catch these hands-on insights—and more—at KubeCon + CloudNativeCon North America in Atlanta (Nov 10–13).

Watch on YouTube

Top comments (0)