Alerting with Time Series
In a Cloud Native infrastructure, failure is normal and expected. The loss of a single node or a dozen hard drives is gracefully handled by the systems running a datacenter and there is no reason to page someone at 4am. This calls for an alerting system that understands service availability at a global scope, yet is still able to give detailed reports if and when there is a service-impacting incident. This talk explores how time series based alerting solves this problem, the Prometheus architecture behind it, and how practical anomaly detection can be implemented.