Inability to Launch On-Demand Broker Services
Incident Report for Pivotal Web Services

Abstract

From approximately Saturday 2018/04/07 04:08 UTC until Monday 2018/04/09 14:09 UTC, the RDS Instance that provided the database for the PWS CredHub servers exhausted its disk space, causing any service requiring access to CredHub to fail. This failure affected the following:

  • PWS BOSH deploys using CredHub. This automatically includes any On-Demand Brokers that attemped creating service instances during this duration.
  • Concourse tasks/resources requiring access to CredHub.

Root Cause

The CredHub RDS instance's disk space of 100GB was exhausted. On investigation it turned out that Audit logging, specifically the tables event_audit_record and request_audit_record consume approximately 1.7 GB per day.

The image below displays the steady, nay, inexorable decline of free space on the RDS instance:

screen shot 2018-04-09 at 3 56 46 pm

Resolution

As a temporary solution, the database size was increased from 100GB to 150GB. This extra space will be exhausted within a month.

Aside from ensuring all databases are monitored for available disk space, we do not yet have a long term solution; we are exploring solutions with the CredHub team.

Impact

  • The dedicated-mysql-broker-admin deployment attempted and failed to create a deployment 17 times during the outage. Config Server failed to generate value for '/path/to/password' with type 'password'. HTTP Code '500', Error: 'An application error occurred. Please contact your CredHub administrator.'
  • Pivotal teams experienced Concourse jobs erroring; this only affected PWS Concourse tasks and resources which contained CredHub variables.

A typical error follows: Expected to find variables: production_s3_publish_aws_key_id production_s3_publish_aws_secret_key - CloudOps team lost Push SLI (Service Level Indicators) metrics. During the outage, CloudOps was partially blind to PWS's health (pushing new apps), but app availability continued to be monitored.

Customers/Accounts Affected

We are unaware of customer accounts affected.

Support Requests

We are unaware of any support requests opened as a result of this outage.

Posted about 1 month ago. Apr 10, 2018 - 20:35 PDT

Resolved
From approximately Saturday 2018/04/07 04:08 UTC until Monday 2018/04/09 14:09 UTC, PWS suffered an outage in its ability to launch On-Demand Broker Services, and affected users attempting to launch such services. This did not affect applications already consuming On-Demand Services. Please contact support@run.pivotal.io if you are experiencing difficulty launching On-Demand Broker Services.
Posted about 1 month ago. Apr 09, 2018 - 14:17 PDT