PWS Experiencing API Issues
Incident Report for Pivotal Web Services

Summary

From approximately 2018/05/12 02:15 to 05:46 UTC (3h 31m), users were unable to push new apps to PWS. There was an initial outage, a brief recovery, and a second (expected) outage while the fix was put in place. The y-axis of the chart below is a measure of success; "1" means pushes are succeeding, and "0" means pushes are failing.

screen shot 2018-05-14 at 11 23 32 am

Root Cause

CloudOps, which maintains PWS, mistakenly deployed a known bad version of CAPI (the Cloud Controller API).

Version 1.56.0 of CAPI introduced a bug where repeated requests to v3 endpoints result in too many open statsd connections, which in turn exhausted the number of available open file descriptors (1024). This meant new sockets could not be established, which resulted failed requests and/or crashed CCs.

The broken CAPI release was part of a larger cf-deployment release (1.32.0), which contained a UAA security fix. CloudOps intended to add an operations manifest file to exclude the CAPI update, but didn't.

Impact

Users were unable to push apps to PWS during the outage or otherwise interact with the Cloud Controller. Existing apps were unaffected.

As the file descriptor exhaustion caused Cloud Controller API to become a "bad actor", the error messages were legion:

  • Failed to perform blobstore operation after three retries.
  • Stats server temporarily unavailable.
  • The UAA service is currently unavailable
  • An unknown error occurred.

Resolution

CloudOps deployed the old, good version (1.55.0) of CAPI. This required a manual database rollback and down migrations aren't supported.

Posted 2 days ago. May 18, 2018 - 13:55 PDT

Resolved
This incident has been resolved.
Posted 8 days ago. May 12, 2018 - 12:27 PDT
Monitoring
Our fix appears to have worked, we'll continue to monitor. If you continue to see any issues, please contact support@run.pivotal.io
Posted 9 days ago. May 11, 2018 - 23:01 PDT
Identified
We are rolling out a fix that will incur a slight amount of API downtime. You may experience HTTP 500 for certain requests during the update.
Posted 9 days ago. May 11, 2018 - 22:35 PDT
Update
We've made a temporary mitigation while we work on repairing the API
Posted 9 days ago. May 11, 2018 - 20:43 PDT
Update
We are continuing to investigate this issue.
Posted 9 days ago. May 11, 2018 - 20:10 PDT
Investigating
We are currently investigating an issue with the PWS API. We are observing push failures.
Posted 9 days ago. May 11, 2018 - 20:08 PDT
This incident affected: Pivotal Web Services API.