Applications may become unresponsive to incoming requests due to health check issue
Incident Report for Pivotal Web Services
Resolved
We have updated Garden and Diego with improved health checks. This issue is now resolved.
Posted 4 months ago. May 04, 2017 - 11:32 PDT
Monitoring
We believe that version 1.3.0 of Garden introduced code that prevented health checks from properly monitoring application instances. On Tuesday, April 25th, we rolled back to Garden version 1.2.0 while a permanent solution is being developed. We have not seen a return of the health check problems.
Posted 4 months ago. Apr 26, 2017 - 18:11 PDT
Identified
We are continuing to investigate the cause of the issue with the application health check processes. We have identified that it happens to a very limited number of application instances. We have put in place mitigation steps that bound the impact of the issue. However, if you determine that your applications has been impacted by this issue, please restart your application and contact support@run.pivotal.io.
Posted 4 months ago. Apr 14, 2017 - 17:08 PDT
Update
We have identified that certain check processes hang in a manner such that if a monitored application become unresponsive it will not be detected. The unresponsive application will not service requests nor will it be restarted under these circumstances. The gorouter of Pivotal Cloud Foundry will detect these non-serviced requests and reroute to other instances if available. Therefore, we recommend that applications have at least two instances to continue to service requests. To determine if your application is impacted, inspect the CPU utilization via the CLI or Apps Manager. Applications affected by this issue will have non-uniform CPU utilization between the instances. The impacted instance will show a much lower CPU utilization than the normally functioning instances. We apologize for any inconvenience that this may have caused. If you have any questions, please contact support@run.pivotal.io.
Posted 4 months ago. Apr 12, 2017 - 17:03 PDT
Investigating
We are observing that some applications are becoming unresponsive after a crash and due to a suspected issue with application health checks. This occurs for applications that are deployed with a single instance. If you encounter applications that have become unresponsive, the mitigation action is to restart the applications or scale up the number instances to two or greater. We apologize for any inconvenience that this may have caused. If you have any questions, please contact support@run.pivotal.io.
Posted 4 months ago. Apr 12, 2017 - 15:45 PDT