From approximately Saturday 2018/04/07 04:08 UTC until Monday 2018/04/09 14:09 UTC, the RDS Instance that provided the database for the PWS CredHub servers exhausted its disk space, causing any service requiring access to CredHub to fail. This failure affected the following:
The CredHub RDS instance's disk space of 100GB was exhausted. On investigation it turned out that Audit logging, specifically the tables
request_audit_record consume approximately 1.7 GB per day.
The image below displays the steady, nay, inexorable decline of free space on the RDS instance:
As a temporary solution, the database size was increased from 100GB to 150GB. This extra space will be exhausted within a month.
Aside from ensuring all databases are monitored for available disk space, we do not yet have a long term solution; we are exploring solutions with the CredHub team.
dedicated-mysql-broker-admindeployment attempted and failed to create a deployment 17 times during the outage.
Config Server failed to generate value for '/path/to/password' with type 'password'. HTTP Code '500', Error: 'An application error occurred. Please contact your CredHub administrator.'
A typical error follows:
Expected to find variables: production_s3_publish_aws_key_id
- CloudOps team lost Push SLI (Service Level Indicators) metrics. During the outage, CloudOps was partially blind to PWS's health (pushing new apps), but app availability continued to be monitored.
We are unaware of customer accounts affected.
We are unaware of any support requests opened as a result of this outage.