Degraded performance on the public access nodes
Incident Report for Flow
Postmortem

Hello,

On Thursday last week (18th of April), there was an issue with the Access API. Clients started to experience a very high response time for most API requests for both, gRPC and REST interfaces. The response time went up by more than 4x for many of the requests.

The issue was subsequently identified and resolved. It did not affect the core protocol and was isolated only to the public access nodes.

Root Cause

The issue was caused by a recent feature that was rolled out to the public access nodes. The feature optimized the GetTransactionResult API call by enabling the access node to serve requests by reading transaction results from its local disk instead of going forward to the execution node as done in the past. However, under load, this seems to have caused resource contention and API calls started getting backed up. Further exacerbating the issue, this resulted in the access nodes falling behind syncing collections and caused other API calls to fail or be delayed. The issue has been documented in detail here: https://github.com/onflow/flow-go/issues/5747.

Solution

The feature was disabled and the response time for all API calls almost immediately improved.

Next Steps

We have identified several action items to ensure such an incident can be prevented in the future.

  1. Better alerting around the response time - This is a bit tricky since response time can vary a lot based on the type of API calls, network conditions etc. However, we and our node partners have some ideas that will allow us to put in place some form of alerting if response times are exceptionally high for a sustained period as was the case this time.
  2. Set up better channels of communication with the access node operators.
  3. Improve the feature by working through the takeaways mentioned in the issue.
Posted Apr 24, 2024 - 23:15 UTC

Resolved
The incident has been resolved.
Posted Apr 18, 2024 - 22:11 UTC
Investigating
Clients are experiencing high response times for requests submitted to the public access nodes.
Posted Apr 18, 2024 - 16:57 UTC
This incident affected: Flow mainnet Access APIs (GRPC API, GRPC Web API, REST API).