On November 4, 2019, Apigee Integrated Developer Portals experienced an outage for a duration of 75 minutes. We sincerely apologize, and we are taking immediate steps to improve the platform’s performance and availability.
On November 4th from 16:09 to 17:24 US/Pacific 100% of connections to any Integrated Developer Portal timed out or were rejected with Gateway Timeout Errors. From 17:24 to 18:04 US/Pacific, all Integrated Developer Portals were put into read-only mode, and changes to Portals were unavailable during this period, though the sites were reachable and viewable. This did not affect users of Drupal Developer Portals.
An unexpected spike in requests, roughly two orders of magnitude above the baseline traffic, resulted in the database thread pool being exhausted. The traffic bypassed our Denial of Service filtering. This caused portal compute instances to mark themselves as DEAD in the monitoring dashboards.
This event lasted much longer than we are satisfied with. Though the incident was identified immediately at its onset, a clear path to mitigating it was not formulated for over 45 minutes. Our engineers have identified several areas of improvement to prevent this from recurring. We are tightening our DoS protections to prevent another service disruption. We are also changing the way our servers handle database transactions so that we will not hit this thread pool exhaustion in the future. In addition to the technical issues above, we’ve made improvements to our internal communications processes, as well as improved our logging and monitoring to be able to identify the incident root cause more quickly in the future.
Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business.