Inability to push new apps
Incident Report for Pivotal Web Services
Postmortem

On Monday April 23 from 18:07 to 19:03, PWS experienced failures with cf delete caused by a deploy without the corresponding Bits Service's BOSH DNS records.

The chart below shows a sharp increase in errors during the failed deployment and subsequent rollback:

screen shot 2018-04-23 at 3 38 14 pm

Impact

Users were unable to delete their apps. They experienced lags of several minutes to more than twenty, with an error message similar to the following:

Deleting app my-app in org my-org / space my-space as my-user...
FAILED
Server error, status code: 502, error code: 0, message: 

Root Cause

The missing DNS record caused the Cloud Controller to be unable to reach Bits Service in order to access its blobstore. It was therefore unable to perform operations requiring access to buildpacks, droplets, appstashes, packages and the buildpackcache.

Which begs the question, "Why was cf delete affected but not cf push?"

The answer rests on a number of factors:

  • cf push is handled by the API instance
  • cf delete is offloaded by the API instance to one of the four Cloud Controller (CC) worker instances
  • The deploy updated the all of the CC worker instances to use Bits Service
  • The deploy did not update the all of the API instances to use Bits Service. More precisely, one of the API instances was updated to use the Bits Service, the canary instance. The canary update failed, causing the bosh deploy to come to a screeching halt.
  • The remaining API instances were not updated to use Bits Service, and thus were not affected when handling cf push requests.
  • The CC workers were updated, were unable to reach the blobstore, and were unable to honor cf delete requests.

Resolution

We updated the BOSH DNS records to include bits-service.service.cf.internal. We then redeployed PWS with the Bits Service job.

Users would need to manually delete the apps which they were unable to delete during the outage.

Posted 3 months ago. May 01, 2018 - 14:18 PDT

Resolved
Clients should be able to push new apps. We have rolled back the new feature and are investigating the failure.
Posted 3 months ago. Apr 23, 2018 - 12:09 PDT
Identified
We are seeing failures with pushing new apps to PWS. We believe this is due to a new feature we enabled and are rolling the feature back.
Posted 3 months ago. Apr 23, 2018 - 11:45 PDT