You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Kevin Gallardo (Jira)" <ji...@apache.org> on 2020/04/03 19:11:00 UTC

[jira] [Comment Edited] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

    [ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074827#comment-17074827 ] 

Kevin Gallardo edited comment on CASSANDRA-15642 at 4/3/20, 7:10 PM:
---------------------------------------------------------------------

[~benedict] thanks for having a look at the ticket!

bq. There is no real significant increase in information by waiting, and it delays how quickly a client may action this information

I would argue that right now that a user wouldn't even be aware that the information is not reliable, as it's only by digging in server code that I realized that this is happening. The error message tell you "received X errors and X responses" and there is no indication externally to the user/client that the information returned in not complete/reliable.

Additionally, given these findings I don't know how the client would be able to properly action on the information, quickly or not, if the info is not reliable? I am not sure that giving partial info quickly is better than cohesive info

bq. It is possible to simply report consistent information

I am not sure how that would be possible without waiting all the responses come back or timeout, but happy to be explained if I am missing something


was (Author: newkek):
[~benedict] thanks for having a look at the ticket!

bq. There is no real significant increase in information by waiting, and it delays how quickly a client may action this information

I would argue that right now that a user wouldn't even be aware that the information is not reliable, as it's only by digging in server code that I realized that this is happening. The error message tell you "received X errors and X responses" and there is no indication externally to the user/client that the information returned in not reliable.

Additionally, given these findings I don't know how the client would be able to properly action on the information, quickly or not, if the info is not reliable? I am not sure that giving partial info quickly is better than cohesive info

bq. It is possible to simply report consistent information

I am not sure how that would be possible without waiting all the responses come back or timeout, but happy to be explained if I am missing something

> Inconsistent failure messages on distributed queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-15642
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination
>            Reporter: Kevin Gallardo
>            Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I realized the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an exception as soon as 1 failure happens, the error message that is returned to the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete first, then a successful one completes, and another fails. If the exception is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of results in the message, so you end up thinking "what happened with the rest of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned in the error message cannot be trusted in case of a failure. Only information users should extract out of it is that at least 1 node has failed.
> For a big improvement in usability, the {{ReadCallback}} and {{AbstractWriteResponseHandler}} could instead wait for all responses to come back before unblocking the wait, or let it timeout. This is way, the users will be able to have some trust around the information returned to them.
> Additionally, an error that happens first prevents a timeout to happen because it fails immediately, and so potentially it hides problems with other replicas. If we were to wait for all responses, we might get a timeout, in that case we'd also be able to tell wether failures have happened *before* that timeout, and have a more complete diagnostic where you can't detect both errors at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org