You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Peter Schuller (Created) (JIRA)" <ji...@apache.org> on 2011/10/02 14:06:34 UTC

[jira] [Created] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

a node whose TCP connection is not up should be considered down for the purpose of reads and writes
---------------------------------------------------------------------------------------------------

                 Key: CASSANDRA-3294
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Peter Schuller


Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).

A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.

Are there difficulties I'm not seeing?

I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217235#comment-13217235 ] 

Brandon Williams commented on CASSANDRA-3294:
---------------------------------------------

bq. How about we assign probability "to be alive" to each of the nodes in the ring

This sounds like reinventing the existing failure detector to me.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Peter Schuller (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Schuller reassigned CASSANDRA-3294:
-----------------------------------------

    Assignee: Peter Schuller
    
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Peter Schuller (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217415#comment-13217415 ] 

Peter Schuller edited comment on CASSANDRA-3294 at 2/27/12 7:21 PM:
--------------------------------------------------------------------

{quote}
This sounds like reinventing the existing failure detector to me.
{quote}

Except we don't use it that way at all (see CASSANDRA-3927). Even if we did though, I personally think it's totally the wrong solution to this problem since we have the *perfect* measurement - whether the TCP connection is up.

It's fine if we have other information that actively indicates we shouldn't send messages to it (whether it's the FD or the fact that we have 500 000 messages queued to the node), but if we *know* the TCP connection is down, we should just not send messages to it, period. With the only caveat being that of course we'd have to make sure TCP connections are in fact pro-actively kept up under all circumstances (I'd have to look at code to figure out what issues there are, if any, in detail).

{quote}
The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector.
{quote}

I have an unfiled ticket to suggest making the proximity sorting probabilistic to avoid the binary "either we get traffic or we dont" (or "either we get data or we get digest") situation. That would certainly help. As would least-requests-outstanding.

You can totally make it so that this ticket is irrelevant by just making the general case well-supported enough that there is no reason to special case this. This was originally filed since we had none of that, and we still don't, and it seemed like a very trivial case to handle for the TCP connection to be actively reset by the other side.

{quote}
After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter?
{quote}

I think CASSANDRA-3722's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate "X is happening"), but towards the end I start agreeing with the ticket more.

In any case, feel free to close if you want. If I ever get to actually implementing this (if at that point there is no other mechanism to remove the need) I'll just re-file or re-open with a patch. We don't need to track this if others aren't interested.
                
      was (Author: scode):
    {quote}
This sounds like reinventing the existing failure detector to me.
{quote}

Except we don't use it that way at all (see CASSANDRA-3927). Even if we did though, I personally think it's totally the wrong solution to this problem since we have the *perfect* measurement - whether the TCP connection is up.

It's fine if we have other information that actively indicates we shouldn't send messages to it (whether it's the FD or the fact that we have 500 000 messages queued to the node), but if we *know* the TCP connection is down, we should just not send messages to it, period. With the only caveat being that of course we'd have to make sure TCP connections are in fact pro-actively kept up under all circumstances (I'd have to look at code to figure out what issues there are, if any, in detail).

{quote}
The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector.
{quote}

I have an unfiled ticket to suggest making the proximity sorting probabilistic to avoid the binary "either we get traffic or we dont" (or "either we get data or we get digest") situation. That would certainly help. As would least-requests-outstanding.

You can totally make it so that this ticket is irrelevant by just making the general case well-supported enough that there is no reason to special case this. This was originally filed since we had none of that, and we still don't, and it seemed like a very trivial case to handle for the TCP connection to be actively reset by the other side.

{code}
After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter?
{code}

I think CASSANDRA-3722's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate "X is happening"), but towards the end I start agreeing with the ticket more.

In any case, feel free to close if you want. If I ever get to actually implementing this (if at that point there is no other mechanism to remove the need) I'll just re-file or re-open with a patch. We don't need to track this if others aren't interested.
                  
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119054#comment-13119054 ] 

Jonathan Ellis commented on CASSANDRA-3294:
-------------------------------------------

What do you suggest?  TCP connection death isn't synonymous with process death.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Pavel Yaskevich (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217398#comment-13217398 ] 

Pavel Yaskevich commented on CASSANDRA-3294:
--------------------------------------------

After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter?
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119083#comment-13119083 ] 

Peter Schuller commented on CASSANDRA-3294:
-------------------------------------------

Right. What I am saying is that for the purpose of picking nodes to send reads/writes to, it doesn't make sense to pick nodes that we know for a fact we cannot communicate with. As I indicate, I'm afraid that maybe this does not translate to gossip up/down state which may have other implications. But, in the most extreme case, if I'm connecting to a co-ordinator and submitting a read at CL.ONE, I don't want the co-ordinator to opt to send it to a node it knows for a fact it is currently not able to communicate with, causing RPC timeouts until rpc_timeout seconds have passed (more or less) and the dynamic snitch has figured things out.

Although... the more I think of it maybe it's better to just go for more actively weighting in outstanding RPC requests instead.

Basically, for what I've observed in high-throughput (lots of queries) clusters, most "hiccups" in day-to-day that spill over to applications tend to be one of:

* Process got stopped without disablegossip+wait, or got killed, etc.
* Process had a temporary hiccup (e.g., GC pause, networking glitch, sudden burst of streaming from another node causing a short duration of disk bottlenecking)
* Process is legitimately overloaded for a while, e.g. due to disk I/O, and despite dynamic snitching there is an annoyingly significant impact to clusters likely resulting from the periodic dynamic snitch reset.

All of these, in addition to many other "soft" failure modes that add up to "node is slow", should be helped quite a lot by significantly weighting outstanding request count when picking nodes. I'm not necessarily suggesting least-used flat out, and I'm aware that one could introduce foot shooting bere, and there are some performance vs. responsiveness concerns.

In the end, the goal is often not really to maximize throughput, but rather to keep a decent latency consistently. Not necessarily the lowest possible average latency, but avoiding extreme outliers. Particularly when application code is not good at dealing with concurrency spikes (e.g., thread limits, process limits, whatever).

If we get a read and know that we have 50 requests outstanding to the node that is closest according to the switch, but 0 outstanding to others, we shouldn't be adding another request onto that...

Thoughts? Am I trying to make the problem be simpler than what it is?

                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Pavel Yaskevich (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217259#comment-13217259 ] 

Pavel Yaskevich commented on CASSANDRA-3294:
--------------------------------------------

The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Pavel Yaskevich (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217196#comment-13217196 ] 

Pavel Yaskevich commented on CASSANDRA-3294:
--------------------------------------------

How about we assign probability "to be alive" to each of the nodes in the ring (starting from uniform distribution) and with each of the failures e.g. RPC/Gossiper communication error we would decrease probability of node being alive by constant factor and increase by other constant factor if communication was successful. That would allow us to calculate the endpoint with the highest alive (and all other sorted) probability for sub-group of SS.getLiveNaturalEndpoints(String, RingPosition), what do you think? 
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217261#comment-13217261 ] 

Brandon Williams commented on CASSANDRA-3294:
---------------------------------------------

I see.  We can do that by sorting on the current phi scores, but we'd need to respect the badness threshold for those doing replica pinning.  Sounds like we're starting to bump up against CASSANDRA-3722 here.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Stu Hood (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119104#comment-13119104 ] 

Stu Hood commented on CASSANDRA-3294:
-------------------------------------

Narrowing the window before node A notices that node B is slow/dead is a good thing, but it is not possible to remove this window entirely. Instead, speculatively requesting the data from extra nodes should be an option (if not the default.) CASSANDRA-2540 touches on this issue, but I made the mistake of conflating data vs digest with speculation: even if we never remove digest reads, we should consider adding speculative reads.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217415#comment-13217415 ] 

Peter Schuller commented on CASSANDRA-3294:
-------------------------------------------

{quote}
This sounds like reinventing the existing failure detector to me.
{quote}

Except we don't use it that way at all (see CASSANDRA-3927). Even if we did though, I personally think it's totally the wrong solution to this problem since we have the *perfect* measurement - whether the TCP connection is up.

It's fine if we have other information that actively indicates we shouldn't send messages to it (whether it's the FD or the fact that we have 500 000 messages queued to the node), but if we *know* the TCP connection is down, we should just not send messages to it, period. With the only caveat being that of course we'd have to make sure TCP connections are in fact pro-actively kept up under all circumstances (I'd have to look at code to figure out what issues there are, if any, in detail).

{quote}
The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector.
{quote}

I have an unfiled ticket to suggest making the proximity sorting probabilistic to avoid the binary "either we get traffic or we dont" (or "either we get data or we get digest") situation. That would certainly help. As would least-requests-outstanding.

You can totally make it so that this ticket is irrelevant by just making the general case well-supported enough that there is no reason to special case this. This was originally filed since we had none of that, and we still don't, and it seemed like a very trivial case to handle for the TCP connection to be actively reset by the other side.

{code}
After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter?
{code}

I think CASSANDRA-3722's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate "X is happening"), but towards the end I start agreeing with the ticket more.

In any case, feel free to close if you want. If I ever get to actually implementing this (if at that point there is no other mechanism to remove the need) I'll just re-file or re-open with a patch. We don't need to track this if others aren't interested.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217437#comment-13217437 ] 

Brandon Williams commented on CASSANDRA-3294:
---------------------------------------------

bq. I think CASSANDRA-3722's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate "X is happening"), but towards the end I start agreeing with the ticket more.

I agree; the original premise there was jumping the gun with a solution a bit, but I think ultimately we end up in very similar places.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217243#comment-13217243 ] 

Brandon Williams commented on CASSANDRA-3294:
---------------------------------------------

One thing that occurs to me here is that the FD is sort of a one-way device: we can send it hints that something is alive, but we can't send it hints that something is dead.  Thus, the only way a node can be marked down is by its phi decaying over time.  If we added the ability to negatively affect the phi directly (TCP connection isn't present, or has been refused, etc) this could speed failure detection up considerably.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira