From approximately 2018/05/14 01:30 to 15:45 (14 hrs), the SSL Service broker failed during service instance creation. Existing service instances continued to route traffic during this outage but users may have been unable to update certificates or create new service instances.
Each SSL service requires a certificate, and our IaaS quota restricts the number of certificates to 500. As we have added service instances, and as our CI occasionally orphans certificates, we have exhausted our quota.
Although this outage is similar to the earlier one this week, the root cause was different: this one was caused by a certificate quota, and the earlier one was caused by IP address space exhaustion.
Users were unable to create new SSL Service instances or update certificates on existing service instances. Requests to create new instances were failing with the following error:
There was a problem provisioning resources. Please try submitting again. If the problem persists, please file a support ticket. (AWS error 409: Cannot exceed quota for ServerCertificatesPerAccount: 500)
The Impact timeline can be seen from the
pws.ssl_sli.status metric, viewed from 2018/05/03 - 04. Times are in PDT. "1" is good, "0" is bad:
The impact on our Service Level Indicator (SLI) captured approximately 2 hours after resolution:
We deleted 319 orphaned certificates.