From approximately 2018/05/12 02:15 to 05:46 UTC (3h 31m), users were unable to push new apps to PWS. There was an initial outage, a brief recovery, and a second (expected) outage while the fix was put in place. The y-axis of the chart below is a measure of success; "1" means pushes are succeeding, and "0" means pushes are failing.
CloudOps, which maintains PWS, mistakenly deployed a known bad version of CAPI (the Cloud Controller API).
Version 1.56.0 of CAPI introduced a bug where repeated requests to v3 endpoints result in too many open
statsd connections, which in turn exhausted the number of available open file descriptors (1024). This meant new sockets could not be established, which resulted failed requests and/or crashed CCs.
The broken CAPI release was part of a larger
cf-deployment release (1.32.0), which contained a UAA security fix. CloudOps intended to add an operations manifest file to exclude the CAPI update, but didn't.
Users were unable to push apps to PWS during the outage or otherwise interact with the Cloud Controller. Existing apps were unaffected.
As the file descriptor exhaustion caused Cloud Controller API to become a "bad actor", the error messages were legion:
Failed to perform blobstore operation after three retries.
Stats server temporarily unavailable.
The UAA service is currently unavailable
An unknown error occurred.
CloudOps deployed the old, good version (1.55.0) of CAPI. This required a manual database rollback and down migrations aren't supported.