You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Blake Eggleston (JIRA)" <ji...@apache.org> on 2015/05/06 00:18:00 UTC
[jira] [Updated] (CASSANDRA-6702) Upgrading node uses the wrong port in gossiping

     [ https://issues.apache.org/jira/browse/CASSANDRA-6702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Blake Eggleston updated CASSANDRA-6702:
---------------------------------------
    Attachment: C6702-2.0.txt
                C6702-1.2.txt

I've attached a patch against 1.2, and one for 2.0 when it's merged forward. 

After reading through the connection code in ITC and OTC again, and being unable to reach it in an upgrading cluster, I don't think this code can ever be reached: https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/net/IncomingTcpConnection.java#L126.

To reach that if statement, ITC would need to read the maxVersion from OTC on line 111. However, OTC performs the same check here: https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L361, and disconnects before it can send it's version to the ITC node on line 377.

I've replaced that block with an assertion in the patch, and updated the ReconnectableSnitchHelper to only skip reconnecting during an upgrade if it's talking to a node with a version < 1.2, which will fix this issue when upgrading from 1.2. The 2.0 patch removes the check in ReconnectableSnitchHelper altogether, since it doesn't support talking to pre 1.2 nodes.

> Upgrading node uses the wrong port in gossiping
> -----------------------------------------------
>
>                 Key: CASSANDRA-6702
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6702
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 1.1.7, AWS, Ec2MultiRegionSnitch
>            Reporter: Minh Do
>            Assignee: Blake Eggleston
>            Priority: Minor
>             Fix For: 2.0.x
>
>         Attachments: C6702-1.2.txt, C6702-2.0.txt
>
>
> When upgrading a node in 1.1.7 (or 1.1.11) cluster to 1.2.15 and inspecting the gossip information on port/Ip, I could see that the upgrading node (1.2 version) communicates to one other node in the same region using Public IP and non-encrypted port.
> For the rest, the upgrading node uses the correct ports and IPs to communicate in this manner:
>    Same region: private IP and non-encrypted port 
>    and
>    Different region: public IP and encrypted port
> Because there is one node like this (or 2 out of 12 nodes cluster in which nodes are split equally on 2 AWS regions), we have to modify Security Group to allow the new traffics.
> Without modifying the SG, the 95th and 99th latencies for both reads and writes in the cluster are very bad (due to RPC timeout).  Inspecting closer, that upgraded node (1.2 node) is contributing to all of the high latencies whenever it acts as a coordinator node. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)