SSL Service: Unable to create service instances
Incident Report for Pivotal Web Services
Postmortem

Summary

From approximately 2018/05/14 01:30 to 15:45 (14 hrs), the SSL Service broker failed during service instance creation. Existing service instances continued to route traffic during this outage but users may have been unable to update certificates or create new service instances.

Root Cause

Each SSL service requires a certificate, and our IaaS quota restricts the number of certificates to 500. As we have added service instances, and as our CI occasionally orphans certificates, we have exhausted our quota.

Although this outage is similar to the earlier one this week, the root cause was different: this one was caused by a certificate quota, and the earlier one was caused by IP address space exhaustion.

Impact

Users were unable to create new SSL Service instances or update certificates on existing service instances. Requests to create new instances were failing with the following error:

There was a problem provisioning resources. Please try submitting again. If the problem persists, please file a support ticket. (AWS error 409: Cannot exceed quota for ServerCertificatesPerAccount: 500)

The Impact timeline can be seen from the pws.ssl_sli.status metric, viewed from 2018/05/03 - 04. Times are in PDT. "1" is good, "0" is bad:

ssl_slis

The impact on our Service Level Indicator (SLI) captured approximately 2 hours after resolution:

ssl_slis_2

Resolution

Short Term

We deleted 319 orphaned certificates.

Long Term

  • Ensure that apps and service instances created by the SSL Service are deleted
Posted 5 months ago. May 04, 2018 - 12:27 PDT

Resolved
This incident has been resolved.
Posted 5 months ago. May 04, 2018 - 12:25 PDT
Monitoring
We found a number of certificates leftover from our continuous testing of the SSL Service. We've deleted the certificates and have observed the successful creation of an SSL Service instance. We'll continue to monitor.
Posted 5 months ago. May 04, 2018 - 08:52 PDT
Investigating
We are investigating an issue affecting the SSL Service which is preventing new service instances from being created. Existing service instances will continue to route traffic, however users may experience errors updating certificates.
Posted 5 months ago. May 04, 2018 - 07:15 PDT