Replicated service mesh: hardening systems against failure modes in load balancing, distributed state, lifecycle management, configuration and release pushesАрхитектуры, масштабируемость
Systems Engineer, Site Reliability Engineer.
Oleg's team is responsible for ensuring reliability, availability and fault-tolerance of Google's core infrastructure dealing with all authentication and most of authorization flows for all kinds of Google accounts and resources.
Previously Oleg worked as Technical Solutions Engineer at Google for GSuite products, being global support Tech Lead for Gmail.
His experience prior to Google includes research work at German Aerospace Agency, BMW and Technical University of Munich.
This talk goes into failure modes observed in practical distributed applications in Google over years, and best practices to prevent or handle them. We consider first typical containerized service mesh (replicated vertical application services stack) and delve into potential failure modes, how to handle them and best practices for:
* L7 load balancing accounting for failures with utilization equalization;
* overload handling, circuit breaking and throttling;
* config, experiment and binary pushes;
* critical data state management & consistency;
* across the stack work conserving;
* replicated application state & cacheability;
* cross stack requests routing;
* dependencies degradation or unavailabiltiy.
Throughout the talk I'll be giving pointer to existing open source tooling and frameworks incorporating the best practices or giving you knobs to get close.