Fail Open, Fail Fast: Improving Envoy Resilience in Latency-Critical Systems
In our massive distributed storage setup, tiny network delays between Envoy and upstream services were triggering “no such bucket” errors and dropped requests at peak load. This talk walks through how we tuned Envoy’s fail-open vs. fail-close behavior, co-located critical services to chop cross-LB latency, and traded off architectural decisions to keep things humming. We also swapped out K6 for Envoy’s Nighthawk to simulate real-world traffic and unearth hidden bottlenecks.
On the observability front, we’ll share how tagging requests with gRPC metadata gave us crystal-clear insights—and how we rolled out all these tweaks without a single production outage. Come snag more practical tips and war stories at KubeCon + CloudNativeCon North America in Atlanta (Nov 10-13)!
Watch on YouTube
 

 
    
Top comments (0)