From approximately 2018/04/28 19:00 UTC until 2018/05/01 17:00 UTC, the SSL Service broker was failing during service instance creation. Existing service instances continued to function during this outage.
The SSL service creates a new load balancer for each service instance. These load balancers are all assigned to the same two subnets (one in each Availability Zone). As we have added more service instances, the pool of available IP addresses in these subnets has diminished. Load balancers need a minimum of 8 available IP addresses per subnet. The 2 subnets our load balancers were using had been reduced to 6 and 10 available IPs.
Users were unable to create new SSL Service instances. Request to create new instance were failing with the following error:
[2018-05-01 16:35:12.68 (UTC)]> cf create-service ssl free 87eb557a-25e6-4b8a-4cf1-6f6de317b18b Creating service instance 87eb557a-25e6-4b8a-4cf1-6f6de317b18b in org ssl-automated-test-org / space ssl-automated-test-space as ssl-sa-1... FAILED Server error, status code: 502, error code: 10001, message: Service broker error: CreateLoadBalancer: InvalidSubnet: Not enough IP space available in subnet-50efb93c. ELB requires at least 8 free IP addresses in each subnet. status code: 400, request id: 9f16d192-4d5d-11e8-add8-47d81bebda7e
The Impact timeline can be seen from the
The impact on our Service Level Indicator (SLI) captured approximately 2 hours after resolution:
We deleted 73 service instances that had been left behind by the
prod-ssl-sli pipeline job. We saw an immediate increase in available IPs in both subnets and our SLIs started passing.