On Monday April 23 from 18:07 to 19:03, PWS experienced failures with
cf delete caused by a deploy without the corresponding Bits Service's BOSH DNS records.
The chart below shows a sharp increase in errors during the failed deployment and subsequent rollback:
Users were unable to delete their apps. They experienced lags of several minutes to more than twenty, with an error message similar to the following:
Deleting app my-app in org my-org / space my-space as my-user... FAILED Server error, status code: 502, error code: 0, message:
The missing DNS record caused the Cloud Controller to be unable to reach Bits Service in order to access its blobstore. It was therefore unable to perform operations requiring access to buildpacks, droplets, appstashes, packages and the buildpackcache.
Which begs the question, "Why was
cf delete affected but not
The answer rests on a number of factors:
cf pushis handled by the API instance
cf deleteis offloaded by the API instance to one of the four Cloud Controller (CC) worker instances
bosh deployto come to a screeching halt.
We updated the BOSH DNS records to include
bits-service.service.cf.internal. We then redeployed PWS with the Bits Service job.
Users would need to manually delete the apps which they were unable to delete during the outage.