You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Vijay (Created) (JIRA)" <ji...@apache.org> on 2011/11/26 02:54:39 UTC

[jira] [Created] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

TimeoutException when there is a firewall issue.
------------------------------------------------

                 Key: CASSANDRA-3533
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
    Affects Versions: 1.0.4
            Reporter: Vijay
            Priority: Minor


When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.

The problem with this:
1) It is hard to monitor/identify these errors.
2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.

Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424048#comment-13424048 ] 

Brandon Williams commented on CASSANDRA-3533:
---------------------------------------------

I'll also note that yes, everything can be ok and UE will be thrown (the connection just hasn't established yet, but will on OTC's next attempt) but penalizing the client ~100ms to find out instead of just failing out and letting them try another coordinator seems like an improvement.
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.1.4
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157403#comment-13157403 ] 

Jonathan Ellis commented on CASSANDRA-3533:
-------------------------------------------

Alex Feinberg from Voldemort says:

bq. We had this situation by accident in production, when nodes in the other datacenter were firewalled off from clients in one cluster.  The way we deal with it is, our failure detector is local to each client and has a thread which keeps pinging each node it marked down initially to see if it came back up. ThreadholdFailureDetector, which inherits from AsyncRecoveryFailureDetector.

                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.4
>            Reporter: Vijay
>            Priority: Minor
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Vijay (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157587#comment-13157587 ] 

Vijay commented on CASSANDRA-3533:
----------------------------------

Thats great, but in cassandra we have DSnitch which can mark nodes down too. Will it make sense for us to poll just before we mark the node up (To double check)? 
i am not sure about the additional time we will need to conform, is that reasonable?
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.4
>            Reporter: Vijay
>            Priority: Minor
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425487#comment-13425487 ] 

Brandon Williams commented on CASSANDRA-3533:
---------------------------------------------

To answer my own question, the problem is that a coordinator can be completely isolated from a given replica set and be forced to time out for any requests to it.

If we do poll via some kind of no-op message or something similar, it seems the main wrinkle will be knowing how long to wait before giving up on a reply.  We could estimate from the FD, except in cases where we have no data at all, which is most likely to be the common one.
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157394#comment-13157394 ] 

Jonathan Ellis commented on CASSANDRA-3533:
-------------------------------------------

I'd be curious if any of the other Dynamo-derived systems (Voldemort, Riak, ?) attempt to deal with this.  It's not clear to me how we should try to handle incomplete network graphs (A can talk to B and to C, but C can't talk to B).
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.4
>            Reporter: Vijay
>            Priority: Minor
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425514#comment-13425514 ] 

Jonathan Ellis commented on CASSANDRA-3533:
-------------------------------------------

bq. the problem is that a coordinator can be completely isolated from a given replica set and be forced to time out for any requests to it

Right, and attempted requests do consume MessagingService resources until they time out.  So everyone's happier if we can avoid requests to unreachable nodes.

bq. the main wrinkle will be knowing how long to wait before giving up on a reply

As long as we're not marking it live until we get one, I don't think it matters.  Wait indefinitely (asynchronously) for all I care.
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3533:
--------------------------------------

    Fix Version/s:     (was: 1.2.0)
                   1.3
    
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3533:
--------------------------------------

    Fix Version/s:     (was: 1.1.4)
                   1.2

bq. the connection just hasn't established yet, but will on OTC's next attempt

Is there anything forcing a next attempt though, besides gossip (1/N chance per round)?

bq.  Furthermore, in the case of natural, temporary partitions of this kind, there are some things we still want to retry instead of failing fast, like streaming

But you still have things like GC-based "flapping" that can cause FD to mark a node down over-pessimistically.  So I don't think I buy that this is an argument for not making FD more robust -- since we already have to deal with "FD is too pessimistic" for this case.

(Fundamentally though I don't think we'll get much mileage out of trying to second-guess FD, so I'd rather make FD as accurate as we can.  And I suspect that "StorageProxy uses FD-supplemented-by-X and the rest of the system using normal FD is going to cause weirdness.)

bq. we need to report new nodes in handleMajorStateChange that sends onJoin events, which cause the initial connection

I must be missing this -- as near as I can tell, a connection will be established when we try to send a message, and I don't see the "send message immediately on alive" [or on join] code.

bq. both of which are scary to put in a minor release

Agreed, retargetting to 1.2.
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425410#comment-13425410 ] 

Brandon Williams commented on CASSANDRA-3533:
---------------------------------------------

bq. Is there anything forcing a next attempt though, besides gossip (1/N chance per round)?

Hmm, actually, no, I was mistaken there.

bq. But you still have things like GC-based "flapping" that can cause FD to mark a node down over-pessimistically. So I don't think I buy that this is an argument for not making FD more robust – since we already have to deal with "FD is too pessimistic" for this case.

I actually don't think, at least for this example, being overly pessimistic is an issue.  On a healthy network (0.3ms ping) it takes 18-19s for the FD to mark a host down with the default phi.  If the GC flapping is so bad it can't get a gossip change out in that time, the node probably _should_ be marked down.

bq. (Fundamentally though I don't think we'll get much mileage out of trying to second-guess FD, so I'd rather make FD as accurate as we can. And I suspect that "StorageProxy uses FD-supplemented-by-X and the rest of the system using normal FD is going to cause weirdness.)

You're probably right.  Let's take a step back and examine what we're trying to solve.  Node X can talk to Y, Y can talk to Z, but X and Z are partitioned and can't communicate, but surrogate gossip traffic via Y makes them both think they can.  The fallout from this is that they'll keep attempting to send messages (and thus connect) to each other.  In practice though, from a client perspective:

* writes will get ack'd by whichever replicas respond the fastest.  Assuming RF=3 and X being the coordinator, the fact that it wrote a local copy and Y responded is enough for everything but ALL.

* reads will get attempted against Z from X, and will have to timeout.

Now let's look at the read scenario in a post-1.2 world.  The dsnitch, after CASSANDRA-3722, will penalize Z in X's eyes much faster (and thus prevent dogpiling requests while waiting for rpc timeout) than pre-1.2 and quit trying to use it (at least until the reset interval, then the process begins again.)  But this is really no different than if Z _does_ suddenly die at such a level that the network route is a black hole (like force suspending the JVM, which is how the dsnitch change was tested and worked well.)

So I suppose my question is, what is the problem here we still need to solve?
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-3533:
----------------------------------------

    Attachment: 3533.txt

bq. Thats great, but in cassandra we have DSnitch which can mark nodes down too.

We can't actually mark a node down with the dsnitch, we can just choose to not use it (for a while)

bq. Will it make sense for us to poll just before we mark the node up (To double check)? 

I looked at doing this, and frankly integrating a check here is pretty scary and messy.  For instance we need to report new nodes in handleMajorStateChange that sends onJoin events, which _cause_ the initial connection, so to poll we'd have to change that, or make an extra connection, neither of which is very desirable to put in Gossiper and both of which are scary to put in a minor release, in my opinion.  Furthermore, in the case of natural, _temporary_ partitions of this kind, there are some things we still want to retry instead of failing fast, like streaming.

Instead, in the attached patch, I took a different, more coordinator-based approach, that requires the FD report the node as alive as well as confirming there is a live outbound connection to the destination before a read/write is attempted, otherwise UE is thrown.
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.1.4
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3533:
--------------------------------------

    Affects Version/s:     (was: 1.0.4)
        Fix Version/s: 1.1.1
             Assignee: Vijay
    
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Vijay
>            Priority: Minor
>             Fix For: 1.1.1
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reassigned CASSANDRA-3533:
-----------------------------------------

    Assignee: Brandon Williams  (was: Vijay)
    
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.1.3
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3533) TimeoutException when there is a firewall issue.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260591#comment-13260591 ] 

Jonathan Ellis commented on CASSANDRA-3533:
-------------------------------------------

bq. Will it make sense for us to poll just before we mark the node up (To double check)? 

Sounds reasonable to me.  An extra round trip should be negligible.
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Priority: Minor
>             Fix For: 1.1.1
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira