Our journey to truly understand our customer experience began with a hard look at our internal availability numbers at the start of 2025. We saw something uncomfortable: the numbers didn’t match our customers’ reality. Our monthly availability oscillated between 99.5% and 99.9%. Those peaks and valleys depended more on whether we declared a high-severity incident that month than on how the platfor...
This case study reveals a broader pattern in tech: the tension between operational convenience and customer-centric metrics. DigitalOcean’s shift from incident-based to SLI-based measurement mirrors Google’s SRE practices, suggesting an industry-wide maturation in reliability engineering. The Control Plane/Data Plane split isn’t just technical—it reflects a philosophical acknowledgment that not all failures are equal. A slow API response is an inconvenience; a Droplet outage is existential for a...
.png)