SSL Service: Unable to create service instances
Incident Report for Pivotal Web Services
Postmortem

Summary

From approximately 2018/04/28 19:00 UTC until 2018/05/01 17:00 UTC, the SSL Service broker was failing during service instance creation. Existing service instances continued to function during this outage.

Root Cause

The SSL service creates a new load balancer for each service instance. These load balancers are all assigned to the same two subnets (one in each Availability Zone). As we have added more service instances, the pool of available IP addresses in these subnets has diminished. Load balancers need a minimum of 8 available IP addresses per subnet. The 2 subnets our load balancers were using had been reduced to 6 and 10 available IPs.

Impact

Users were unable to create new SSL Service instances. Request to create new instance were failing with the following error:

[2018-05-01 16:35:12.68 (UTC)]> cf create-service ssl free 87eb557a-25e6-4b8a-4cf1-6f6de317b18b 

Creating service instance 87eb557a-25e6-4b8a-4cf1-6f6de317b18b in org ssl-automated-test-org / space ssl-automated-test-space as ssl-sa-1...

FAILED

Server error, status code: 502, error code: 10001, message: Service broker error: CreateLoadBalancer: InvalidSubnet: Not enough IP space available in subnet-50efb93c. ELB requires at least 8 free IP addresses in each subnet.

    status code: 400, request id: 9f16d192-4d5d-11e8-add8-47d81bebda7e

The Impact timeline can be seen from the pws.ssl_sli.status metric:

screen shot 2018-05-01 at 2 34 14 pm

The impact on our Service Level Indicator (SLI) captured approximately 2 hours after resolution:

screen shot 2018-05-01 at 3 15 29 pm

Resolution

Short Term

We deleted 73 service instances that had been left behind by the prod-ssl-sli pipeline job. We saw an immediate increase in available IPs in both subnets and our SLIs started passing.

Long Term

  • Increase the number of IPs available to SSL Service load balancers
  • Ensure that apps and service instances created by the SSL Service are deleted
  • Add alerts for SSL Service
Posted 5 months ago. May 01, 2018 - 13:31 PDT

Resolved
The SSL Service has been restored and service instance instance creation is functioning. We are satisfied that this incident is resolved.
Posted 5 months ago. May 01, 2018 - 13:06 PDT
Monitoring
The SSL Service has been restored and service instance instance creation is functioning.
Posted 5 months ago. May 01, 2018 - 10:03 PDT
Identified
We have identified an issue affecting the SSL Service which is preventing new service instances from being created. Existing service instances are not affected.
Posted 5 months ago. May 01, 2018 - 09:36 PDT