Degraded Pivotal Web Services API
Incident Report for Pivotal Web Services
Postmortem

Abstract

From roughly 2018/04/11 18:13 UTC until 20:40 UTC transaction time on the Cloud Controller (CC) was elevated. This caused timeouts on some requests to the CC API. At 20:40, the CC issue was resolved, but the next two hours (until 22:15), Apps Manager was stopped several times in order to make changes to avoid placing undue pressure on the Cloud Controller.

Root Cause

A change to Apps Manager introduced long-running queries to the Cloud Controller database (CCDB). These queries caused an increase in load on the CCDB and cause transaction time on the Cloud Controller to increase substantially, which in turn caused requests to the Cloud Controller API (CAPI) to timeout.

capi_latency

Resolution

Apps Manager disabled the updated feature. CloudOps and CAPI killed the remaining long-running queries on the Cloud Controller database. The load on the CC_DB began dropping at 20:39 UTC and had returned to normal by 20:44 UTC. Cloud Controller transaction time had returned to normal by 20:40 UTC.

For the next 90 minutes, Apps Manager experimented with resolving the issue, ultimately disabling events proxying and turning polling off.

Impact

  • Customers were unable to access Apps Manager.
  • API response latency spiked for all CC API requests.
  • A subset of requests to the Cloud Controller API timed out.
  • Access to Apps Manager was intermittent while the issue was being resolved.

Support Requests

At 20:02 UTC, a customer opened a ticket because they were unable to log into Apps Manager

Posted 7 months ago. May 14, 2018 - 10:06 PDT

Resolved
This incident has been resolved.
Posted 8 months ago. Apr 11, 2018 - 15:31 PDT
Investigating
Pivotal Web Services API is experiencing degraded performance, which is causing an outage for Apps Manager. Investigation underway.
Posted 8 months ago. Apr 11, 2018 - 13:40 PDT