Error Handling
Handling errors in a distributed system is a complex and complicated topic. Ideally, we must understand all of the possible sources of failure and determine appropriate actions to take. This section is intended to explain various known failure modes.
Request Execution Errors
Below is a diagram that explains request execution at a high level:
Requests resulting in host errors are automatically retried on other hosts. If
no other hosts are present in the load balancing plan, a Cassandra::Errors::NoHostsAvailable
is raised that contains a map of host to host error that were seen during
request.
Additionally, if an empty load balancing plan is returned by the load balancing policy, the request will not be attempted on any hosts.
Whenever a cluster error occurs, the retry policy is used to decide whether to re-raise the error, retry the request at a different consistency or ignore the error and return empty result.
Finally, all other request errors, such as validation errors, are returned to the application without retries.
Below are top-level error classes defined in the Ruby Driver classified by host, cluster and request types:
Connection Heartbeat
In addition to the request execution errors and timeouts, Ruby Driver performs periodic heart beating of each open connection to detect network outages and prevent stale connections from gathering.
The default heartbeat interval is very conservative at 30 seconds with an idle timeout of 1 minute, but these numbers can be changed when constructing a cluster.
Upon detecting a stale connection, Ruby Driver will automatically close it and fail all outstanding requests with a host level error, which will force them to be retried on other hosts as part of a normal request execution.