You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Polytron Feng <li...@gmail.com> on 2013/06/19 10:34:36 UTC
Rolling upgrade from 1.1.12 to 1.2.5 visibility issue
Hi,
We are trying to roll upgrade from 1.0.12 to 1.2.5, but we found that the
1.2.5 node cannot see other old nodes.
Therefore, we tried to upgrade to 1.1.12 first, and it works.
However, we still saw the same issue when rolling upgrade from 1.1.12 to
1.2.5.
This seems to be the fixed issue as
https://issues.apache.org/jira/browse/CASSANDRA-5332 but we still saw it in
1.2.5.
Enviroment:
OS: CentOS 6
JDK: 6u31
cluster:3 nodes for testing, in EC2
Snitch: Ec2MultiRegionSnitch
NetworkTopologyStrategy: strategy_options = { ap-southeast:3 }
We have 3 nodes and we upgraded 122.248.xxx.xxx to 1.2.5 first, the other 2
nodes are still 1.1.12.
When we restarted the upgraded node, it will see the other 2 old nodes as
UP in the log.
However, after a few seconds, these 2 nodes will be marked as DOWN.
This is the ring info from 1.2.5 node - 122.248.xxx.xxx
Note: Ownership information does not include topology; for complete
information, specify a keyspace
Datacenter: ap-southeast
==========
Address Rack Status State Load Owns
Token
113427455640312821154458202477256070486
122.248.xxx.xxx 1b Up Normal 69.74 GB 33.33%
1
54.251.xxx.xxx 1b Down Normal 69.77 GB 33.33%
56713727820156410577229101238628035243
54.254.xxx.xxx 1b Down Normal 70.28 GB 33.33%
113427455640312821154458202477256070486
but Old 1.1.12 nodes can see new node:
Note: Ownership information does not include topology, please specify a
keyspace.
Address DC Rack Status State Load
Owns Token
113427455640312821154458202477256070486
122.248.xxx.xxx ap-southeast1b Up Normal 69.74 GB
33.33% 1
54.251.xxx.xxx ap-southeast1b Up Normal 69.77 GB
33.33% 56713727820156410577229101238628035243
54.254.xxx.xxx ap-southeast1b Up Normal 70.28 GB
33.33% 113427455640312821154458202477256070486
We enabled trace log level to check gossip related logs. The log below from
1.2.5 node shows that the
other 2 nodes are UP in the beginning. They seem to complete SYN/ACK/ACK2
handshake cycle.
TRACE [GossipStage:1] 2013-06-19 07:44:43,047
GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage
from /54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:43,047
GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are :
/54.254.xxx.xxx:1371617084:10967 /54.251. xxx.xxx:1371625851:2055
TRACE [GossipStage:1] 2013-06-19 07:44:43,048 Gossiper.java (line 945)
requestAll for /54.254.xxx.xxx
.....
TRACE [GossipStage:1] 2013-06-19 07:44:43,080
GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage
to /54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:43,080 MessagingService.java
(line 601) /122.248.216.142 sending GOSSIP_DIGEST_ACK to 19@/54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:43,080
GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage
from /54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:43,080
GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are :
/54.254.xxx.xxx:1371617084:10978 /54.251. xxx.xxx:1371625851:2066
.....
TRACE [GossipStage:1] 2013-06-19 07:44:43,083
GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage
to /54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:43,083 MessagingService.java
(line 601) /122.248.216.142 sending GOSSIP_DIGEST_ACK to 22@/54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:43,084
GossipDigestAck2VerbHandler.java (line 38) Received a
GossipDigestAck2Message from /54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:43,084 FailureDetector.java
(line 168) reporting /54.254.xxx.xxx
INFO [GossipStage:1] 2013-06-19 07:44:43,093 Gossiper.java (line 805)
Node /54.254.xxx.xxx is now part of the cluster
TRACE [GossipStage:1] 2013-06-19 07:44:43,094 Gossiper.java (line 808)
Adding endpoint state for /54.254.xxx.xxx
DEBUG [GossipStage:1] 2013-06-19 07:44:43,094 MessagingService.java
(line 377) Resetting pool for /54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:43,094 Gossiper.java (line 764)
marking as alive /54.254.xxx.xxx
DEBUG [GossipStage:1] 2013-06-19 07:44:43,095 Gossiper.java (line 770)
removing expire time for endpoint : /54.254.xxx.xxx
INFO [GossipStage:1] 2013-06-19 07:44:43,095 Gossiper.java (line 771)
InetAddress /54.254.xxx.xxx is now UP
After a few seconds, the log shows that the old nodes did not respond ACK2
and are marked as DOWN:
TRACE [GossipStage:1] 2013-06-19 07:44:52,121
GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage
from /54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:52,123
GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are :
/54.254.xxx.xxx:1371617084:10991 /54.251. xxx.xxx:1371625851:2078
TRACE [GossipStage:1] 2013-06-19 07:44:52,124
GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage
to /54.254.xxx.xxx
TRACE [GossipStage:1] 2013-06-19 07:44:52,124 MessagingService.java
(line 601) /122.248.xxx.xxx sending GOSSIP_DIGEST_ACK to 52@/54.254.xxx.xxx
TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 116)
My heartbeat is now 25
TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 350)
Gossip Digests are : /54.251.xxx.xxx:1371625851:2069 /122.248.xxx.
xxx:1371627882:25 /54.254.xxx.xxx:1371617084:10980
TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 505)
Sending a GossipDigestSyn to /54.254.xxx.xxx ...
TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 MessagingService.java
(line 601) /122.248.xxx.xxx sending GOSSIP_DIGEST_SYN to 54@/54.254.xxx.xxx
TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 Gossiper.java (line 154)
Performing status check ...
TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java
(line 189) PHI for /54.254.xxx.xxx : 8.05616263930532
TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java
(line 193) notifying listeners that /54.254.xxx.xxx is down
TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java
(line 194) intervals: 500.0 mean: 500.0
TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 Gossiper.java (line 781)
marking as down /54.254.xxx.xxx
INFO [GossipTasks:1] 2013-06-19 07:44:52,360 Gossiper.java (line 785)
InetAddress /54.254.xxx.xxx is now DOWN
DEBUG [GossipTasks:1] 2013-06-19 07:44:52,360 MessagingService.java
(line 377) Resetting pool for /54.254.xxx.xxx
We check that old node try to send SYN to 1.2.5 node but does not receive
any GossipDigestAckMessage from it.
Any ideas how to solve this issue?
Thank you!
Re: Rolling upgrade from 1.1.12 to 1.2.5 visibility issue
Posted by Polytron Feng <li...@gmail.com>.
Hi aaron,
Thank you for your reply. We tried to increase PHI threshold but still met
same issue. We used Ec2Snitch and PropertyFileSnitch instead and they work
without this problem. It seems only happened with Ec2MultiRegionSnitch
config. Although we can workaround this problem by PropertyFileSnitch, we
hit another bug: EOFException in
https://issues.apache.org/jira/browse/CASSANDRA-5476. We will try to
upgrade to 1.1.12 first and waiting for the fix of issue 5476.
Thank you!
On Thu, Jun 20, 2013 at 5:49 PM, aaron morton <aa...@thelastpickle.com>wrote:
> I once had something like this, looking at your logs I donot think it's
> the same thing but here is a post on it
> http://thelastpickle.com/2011/12/15/Anatomy-of-a-Cassandra-Partition/
>
> It's a little different in 1.2 but the GossipDigestAckVerbHandler (and
> ACK2) should be calling Gossiper.instance.notifyFailureDetector which will
> result in the FailureDetector being called. This will keep the remote node
> marked as up. it looks like this is happening.
>
>
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java
> (line 189) PHI for /54.254.xxx.xxx : 8.05616263930532
>
> The default phi_convict_threshold is 8, so this node thinks the other is
> just sick enough to be marked as down.
>
> As a work around try increasing the phi_convict_threshold to 12. Not sure
> why the 1.2 node thinks this, not sure if anything has changed.
>
> I used to think there was a way to dump the phi values for nodes, but I
> cannot find it. If you call dumpInterArrivalTimes on
> the org.apache.cassandra.net:type=FailureDetector MBean it will dump a
> file in the temp dir called "failuredetector-*" with the arrival times for
> messages from the other nodes. That may help.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 19/06/2013, at 8:34 PM, Polytron Feng <li...@gmail.com> wrote:
>
>
> Hi,
>
> We are trying to roll upgrade from 1.0.12 to 1.2.5, but we found that the
> 1.2.5 node cannot see other old nodes.
> Therefore, we tried to upgrade to 1.1.12 first, and it works.
> However, we still saw the same issue when rolling upgrade from 1.1.12 to
> 1.2.5.
> This seems to be the fixed issue as
> https://issues.apache.org/jira/browse/CASSANDRA-5332 but we still saw it
> in 1.2.5.
>
> Enviroment:
> OS: CentOS 6
> JDK: 6u31
> cluster:3 nodes for testing, in EC2
> Snitch: Ec2MultiRegionSnitch
> NetworkTopologyStrategy: strategy_options = { ap-southeast:3 }
>
> We have 3 nodes and we upgraded 122.248.xxx.xxx to 1.2.5 first, the other
> 2 nodes are still 1.1.12.
> When we restarted the upgraded node, it will see the other 2 old nodes as
> UP in the log.
> However, after a few seconds, these 2 nodes will be marked as DOWN.
> This is the ring info from 1.2.5 node - 122.248.xxx.xxx
>
> Note: Ownership information does not include topology; for complete
> information, specify a keyspace
>
> Datacenter: ap-southeast
> ==========
> Address Rack Status State Load Owns
> Token
>
> 113427455640312821154458202477256070486
> 122.248.xxx.xxx 1b Up Normal 69.74 GB 33.33%
> 1
> 54.251.xxx.xxx 1b Down Normal 69.77 GB 33.33%
> 56713727820156410577229101238628035243
> 54.254.xxx.xxx 1b Down Normal 70.28 GB 33.33%
> 113427455640312821154458202477256070486
>
>
> but Old 1.1.12 nodes can see new node:
>
> Note: Ownership information does not include topology, please specify
> a keyspace.
> Address DC Rack Status State Load
> Owns Token
>
> 113427455640312821154458202477256070486
> 122.248.xxx.xxx ap-southeast1b Up Normal 69.74 GB
> 33.33% 1
> 54.251.xxx.xxx ap-southeast1b Up Normal 69.77 GB
> 33.33% 56713727820156410577229101238628035243
> 54.254.xxx.xxx ap-southeast1b Up Normal 70.28 GB
> 33.33% 113427455640312821154458202477256070486
>
>
> We enabled trace log level to check gossip related logs. The log below
> from 1.2.5 node shows that the
> other 2 nodes are UP in the beginning. They seem to complete SYN/ACK/ACK2
> handshake cycle.
>
> TRACE [GossipStage:1] 2013-06-19 07:44:43,047
> GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage
> from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,047
> GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are :
> /54.254.xxx.xxx:1371617084:10967 /54.251. xxx.xxx:1371625851:2055
> TRACE [GossipStage:1] 2013-06-19 07:44:43,048 Gossiper.java (line 945)
> requestAll for /54.254.xxx.xxx
> .....
>
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080
> GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage
> to /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080 MessagingService.java
> (line 601) /122.248.216.142 sending GOSSIP_DIGEST_ACK to 19@
> /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080
> GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage
> from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080
> GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are :
> /54.254.xxx.xxx:1371617084:10978 /54.251. xxx.xxx:1371625851:2066
>
> .....
>
> TRACE [GossipStage:1] 2013-06-19 07:44:43,083
> GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage
> to /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,083 MessagingService.java
> (line 601) /122.248.216.142 sending GOSSIP_DIGEST_ACK to 22@
> /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,084
> GossipDigestAck2VerbHandler.java (line 38) Received a
> GossipDigestAck2Message from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,084 FailureDetector.java
> (line 168) reporting /54.254.xxx.xxx
> INFO [GossipStage:1] 2013-06-19 07:44:43,093 Gossiper.java (line 805)
> Node /54.254.xxx.xxx is now part of the cluster
> TRACE [GossipStage:1] 2013-06-19 07:44:43,094 Gossiper.java (line 808)
> Adding endpoint state for /54.254.xxx.xxx
> DEBUG [GossipStage:1] 2013-06-19 07:44:43,094 MessagingService.java
> (line 377) Resetting pool for /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,094 Gossiper.java (line 764)
> marking as alive /54.254.xxx.xxx
> DEBUG [GossipStage:1] 2013-06-19 07:44:43,095 Gossiper.java (line 770)
> removing expire time for endpoint : /54.254.xxx.xxx
> INFO [GossipStage:1] 2013-06-19 07:44:43,095 Gossiper.java (line 771)
> InetAddress /54.254.xxx.xxx is now UP
>
>
>
> After a few seconds, the log shows that the old nodes did not respond ACK2
> and are marked as DOWN:
>
> TRACE [GossipStage:1] 2013-06-19 07:44:52,121
> GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage
> from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:52,123
> GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are :
> /54.254.xxx.xxx:1371617084:10991 /54.251. xxx.xxx:1371625851:2078
> TRACE [GossipStage:1] 2013-06-19 07:44:52,124
> GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage
> to /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:52,124 MessagingService.java
> (line 601) /122.248.xxx.xxx sending GOSSIP_DIGEST_ACK to 52@
> /54.254.xxx.xxx
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 116)
> My heartbeat is now 25
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 350)
> Gossip Digests are : /54.251.xxx.xxx:1371625851:2069 /122.248.xxx.
> xxx:1371627882:25 /54.254.xxx.xxx:1371617084:10980
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 505)
> Sending a GossipDigestSyn to /54.254.xxx.xxx ...
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 MessagingService.java
> (line 601) /122.248.xxx.xxx sending GOSSIP_DIGEST_SYN to 54@
> /54.254.xxx.xxx
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 Gossiper.java (line 154)
> Performing status check ...
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java
> (line 189) PHI for /54.254.xxx.xxx : 8.05616263930532
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java
> (line 193) notifying listeners that /54.254.xxx.xxx is down
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java
> (line 194) intervals: 500.0 mean: 500.0
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 Gossiper.java (line 781)
> marking as down /54.254.xxx.xxx
> INFO [GossipTasks:1] 2013-06-19 07:44:52,360 Gossiper.java (line 785)
> InetAddress /54.254.xxx.xxx is now DOWN
> DEBUG [GossipTasks:1] 2013-06-19 07:44:52,360 MessagingService.java
> (line 377) Resetting pool for /54.254.xxx.xxx
>
>
> We check that old node try to send SYN to 1.2.5 node but does not receive
> any GossipDigestAckMessage from it.
>
> Any ideas how to solve this issue?
>
> Thank you!
>
>
>
Re: Rolling upgrade from 1.1.12 to 1.2.5 visibility issue
Posted by aaron morton <aa...@thelastpickle.com>.
I once had something like this, looking at your logs I donot think it's the same thing but here is a post on it http://thelastpickle.com/2011/12/15/Anatomy-of-a-Cassandra-Partition/
It's a little different in 1.2 but the GossipDigestAckVerbHandler (and ACK2) should be calling Gossiper.instance.notifyFailureDetector which will result in the FailureDetector being called. This will keep the remote node marked as up. it looks like this is happening.
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java (line 189) PHI for /54.254.xxx.xxx : 8.05616263930532
The default phi_convict_threshold is 8, so this node thinks the other is just sick enough to be marked as down.
As a work around try increasing the phi_convict_threshold to 12. Not sure why the 1.2 node thinks this, not sure if anything has changed.
I used to think there was a way to dump the phi values for nodes, but I cannot find it. If you call dumpInterArrivalTimes on the org.apache.cassandra.net:type=FailureDetector MBean it will dump a file in the temp dir called "failuredetector-*" with the arrival times for messages from the other nodes. That may help.
Cheers
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand
@aaronmorton
http://www.thelastpickle.com
On 19/06/2013, at 8:34 PM, Polytron Feng <li...@gmail.com> wrote:
>
> Hi,
>
> We are trying to roll upgrade from 1.0.12 to 1.2.5, but we found that the 1.2.5 node cannot see other old nodes.
> Therefore, we tried to upgrade to 1.1.12 first, and it works.
> However, we still saw the same issue when rolling upgrade from 1.1.12 to 1.2.5.
> This seems to be the fixed issue as https://issues.apache.org/jira/browse/CASSANDRA-5332 but we still saw it in 1.2.5.
>
> Enviroment:
> OS: CentOS 6
> JDK: 6u31
> cluster:3 nodes for testing, in EC2
> Snitch: Ec2MultiRegionSnitch
> NetworkTopologyStrategy: strategy_options = { ap-southeast:3 }
>
> We have 3 nodes and we upgraded 122.248.xxx.xxx to 1.2.5 first, the other 2 nodes are still 1.1.12.
> When we restarted the upgraded node, it will see the other 2 old nodes as UP in the log.
> However, after a few seconds, these 2 nodes will be marked as DOWN.
> This is the ring info from 1.2.5 node - 122.248.xxx.xxx
>
> Note: Ownership information does not include topology; for complete information, specify a keyspace
>
> Datacenter: ap-southeast
> ==========
> Address Rack Status State Load Owns Token
> 113427455640312821154458202477256070486
> 122.248.xxx.xxx 1b Up Normal 69.74 GB 33.33% 1
> 54.251.xxx.xxx 1b Down Normal 69.77 GB 33.33% 56713727820156410577229101238628035243
> 54.254.xxx.xxx 1b Down Normal 70.28 GB 33.33% 113427455640312821154458202477256070486
>
>
> but Old 1.1.12 nodes can see new node:
>
> Note: Ownership information does not include topology, please specify a keyspace.
> Address DC Rack Status State Load Owns Token
> 113427455640312821154458202477256070486
> 122.248.xxx.xxx ap-southeast1b Up Normal 69.74 GB 33.33% 1
> 54.251.xxx.xxx ap-southeast1b Up Normal 69.77 GB 33.33% 56713727820156410577229101238628035243
> 54.254.xxx.xxx ap-southeast1b Up Normal 70.28 GB 33.33% 113427455640312821154458202477256070486
>
>
> We enabled trace log level to check gossip related logs. The log below from 1.2.5 node shows that the
> other 2 nodes are UP in the beginning. They seem to complete SYN/ACK/ACK2 handshake cycle.
>
> TRACE [GossipStage:1] 2013-06-19 07:44:43,047 GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,047 GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : /54.254.xxx.xxx:1371617084:10967 /54.251. xxx.xxx:1371625851:2055
> TRACE [GossipStage:1] 2013-06-19 07:44:43,048 Gossiper.java (line 945) requestAll for /54.254.xxx.xxx
> .....
>
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080 GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage to /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080 MessagingService.java (line 601) /122.248.216.142 sending GOSSIP_DIGEST_ACK to 19@/54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080 GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080 GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : /54.254.xxx.xxx:1371617084:10978 /54.251. xxx.xxx:1371625851:2066
>
> .....
>
> TRACE [GossipStage:1] 2013-06-19 07:44:43,083 GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage to /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,083 MessagingService.java (line 601) /122.248.216.142 sending GOSSIP_DIGEST_ACK to 22@/54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,084 GossipDigestAck2VerbHandler.java (line 38) Received a GossipDigestAck2Message from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,084 FailureDetector.java (line 168) reporting /54.254.xxx.xxx
> INFO [GossipStage:1] 2013-06-19 07:44:43,093 Gossiper.java (line 805) Node /54.254.xxx.xxx is now part of the cluster
> TRACE [GossipStage:1] 2013-06-19 07:44:43,094 Gossiper.java (line 808) Adding endpoint state for /54.254.xxx.xxx
> DEBUG [GossipStage:1] 2013-06-19 07:44:43,094 MessagingService.java (line 377) Resetting pool for /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,094 Gossiper.java (line 764) marking as alive /54.254.xxx.xxx
> DEBUG [GossipStage:1] 2013-06-19 07:44:43,095 Gossiper.java (line 770) removing expire time for endpoint : /54.254.xxx.xxx
> INFO [GossipStage:1] 2013-06-19 07:44:43,095 Gossiper.java (line 771) InetAddress /54.254.xxx.xxx is now UP
>
>
>
> After a few seconds, the log shows that the old nodes did not respond ACK2 and are marked as DOWN:
>
> TRACE [GossipStage:1] 2013-06-19 07:44:52,121 GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:52,123 GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : /54.254.xxx.xxx:1371617084:10991 /54.251. xxx.xxx:1371625851:2078
> TRACE [GossipStage:1] 2013-06-19 07:44:52,124 GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage to /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:52,124 MessagingService.java (line 601) /122.248.xxx.xxx sending GOSSIP_DIGEST_ACK to 52@/54.254.xxx.xxx
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 116) My heartbeat is now 25
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 350) Gossip Digests are : /54.251.xxx.xxx:1371625851:2069 /122.248.xxx. xxx:1371627882:25 /54.254.xxx.xxx:1371617084:10980
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 Gossiper.java (line 505) Sending a GossipDigestSyn to /54.254.xxx.xxx ...
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,358 MessagingService.java (line 601) /122.248.xxx.xxx sending GOSSIP_DIGEST_SYN to 54@/54.254.xxx.xxx
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 Gossiper.java (line 154) Performing status check ...
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java (line 189) PHI for /54.254.xxx.xxx : 8.05616263930532
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java (line 193) notifying listeners that /54.254.xxx.xxx is down
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java (line 194) intervals: 500.0 mean: 500.0
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 Gossiper.java (line 781) marking as down /54.254.xxx.xxx
> INFO [GossipTasks:1] 2013-06-19 07:44:52,360 Gossiper.java (line 785) InetAddress /54.254.xxx.xxx is now DOWN
> DEBUG [GossipTasks:1] 2013-06-19 07:44:52,360 MessagingService.java (line 377) Resetting pool for /54.254.xxx.xxx
>
>
> We check that old node try to send SYN to 1.2.5 node but does not receive any GossipDigestAckMessage from it.
>
> Any ideas how to solve this issue?
>
> Thank you!
>