DigitalOcean Redefines Availability Metrics: From Incident Counting to SLIs

DigitalOcean Overhauls Availability Metrics to Better Reflect Customer Experience

DigitalOcean has announced a significant update to its platform availability metrics, aiming to provide a more accurate representation of customer experiences. This overhaul, which began in early 2025, addresses discrepancies between reported availability and actual user experiences, moving away from an incident-based measurement system that failed to capture the nuances of service performance. The new framework introduces Service Level Indicators (SLIs) that promise to deliver a clearer picture of reliability across its services.

Identifying the Problem with Traditional Metrics

At the start of 2025, DigitalOcean’s internal analysis revealed a troubling gap between reported availability figures and customer experiences. Monthly availability fluctuated between 99.5% and 99.9%, largely influenced by whether high-severity incidents were declared. This incident-based approach treated any declared incident as a complete outage while ignoring lower-severity issues that still affected users. As DigitalOcean expanded its offerings, this method became increasingly inadequate, creating a structural trap: including lower-severity incidents would artificially lower the overall availability metric.

The previous formula for calculating availability was straightforward but flawed. If no incidents were declared in a week, the platform was considered 100% available. However, if an incident lasted two hours, those minutes were deducted from the total available time. This simplistic approach led to three main issues:

The assumption that all customers experienced total downtime during an incident.

Lack of volume weighting across different products, treating low-traffic services the same as high-traffic ones.

A one-size-fits-all formula that did not accurately reflect individual product performance.

Introducing Control Plane and Data Plane Measurements

To rectify these shortcomings, DigitalOcean decided to separate its measurements into two distinct planes: Control Plane and Data Plane. Each plane utilizes its own SLI methodology tailored to specific operational contexts.

Control Plane Metrics

The Control Plane encompasses orchestration tasks such as API calls and Cloud Panel operations. Here, the SLI is based on the success rate of valid requests while only counting server errors (5xx) as failures. Client errors (4xx) remain in the denominator but do not count against reliability metrics since they stem from user mistakes rather than platform issues.

This change means that if users encounter errors in API or Cloud Control Panel operations due to Control Plane degradation, their ongoing workloads remain unaffected. The focus is now on measuring what truly impacts customer experience rather than simply counting incidents.

Data Plane Metrics

The Data Plane measures live product instances such as CPU and GPU Droplets, DOKS Clusters, and Serverless Inference endpoints. Unlike the Control Plane, where failures are more uniform, different products within the Data Plane fail in unique ways necessitating tailored measurement approaches.

For instance:

Droplet Networking: A Droplet is considered available when all connectivity probes pass simultaneously; any single failure counts against its availability.

DOKS Clusters: Availability is contingent upon multiple conditions being met simultaneously; failure in any component results in downtime for that minute.

Spaces Storage: Availability is measured by the ratio of successful responses (non-5xx) at the storage load balancer level.

This nuanced approach allows DigitalOcean to define failure states based on actual customer experiences rather than arbitrary thresholds.

Aggregation and Alerting Improvements

The next challenge was aggregating data across multiple regions with varying traffic volumes. DigitalOcean employed a weighted request average for Control Plane metrics so that smaller data centers contribute less to overall availability calculations compared to larger ones handling more traffic.

This method ensures that statistics accurately reflect service performance where it matters most—where customers are actively using resources. For Data Plane metrics, a similar magnitude-weighted approach was adopted to account for resource health across different products effectively.

To enhance alerting mechanisms, DigitalOcean transitioned from single burn rate alerts to multi-window alerts that combine short-term and long-term data analysis. This system allows engineers to distinguish between temporary spikes in error rates and ongoing issues requiring immediate attention.

Error Budgets as Decision-Making Tools

The introduction of error budgets—essentially an allowance for acceptable failures—has transformed how teams prioritize their work at DigitalOcean. By defining thresholds for error budgets based on SLIs, teams can make informed decisions about deploying new features or focusing on stability improvements based on their current standing within defined zones ranging from green (healthy) to red (critical).

Green (0-60%): Normal operations allowed.

Yellow (61%-80%): Caution advised; verify no impact on dependencies.

Orange (81-100%): Increased risk; large rollouts paused for reliability work.

Red (>100%): All rollouts halted except for critical fixes.

What This Means for Customers

This comprehensive overhaul of DigitalOcean’s availability metrics signifies a commitment to transparency and improved customer experience. By implementing SLIs tailored to specific operational contexts and employing sophisticated alerting mechanisms alongside error budgets as decision-making tools, DigitalOcean aims not only to improve internal performance but also enhance trust among its users. As newer products like GPU Droplets and AI agents roll out under this framework, customers can expect more reliable service backed by metrics that genuinely reflect their experiences rather than just theoretical calculations of uptime.

For more information, read the original report here.