You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Bernardino Mota <be...@knowledgeworks.pt> on 2016/01/21 14:09:52 UTC

Nodes fail to reconnect after several hours of network failure.

Using Cassandra 2.2.4 on Ubuntu.

We have a cluster with two nodes that during several hours failed to connect with each other due to network problems. The database continued to be used in one of the nodes with writes being stored in the Hints file as supposed.

But now that the network is OK again and each machine can communicate we see that each node indicates the other is DOWN and does not replicates. 

When the network came up we started to see in log files "Convicting /192.168.1.102 with status NORMAL - alive false"

It seems each node evictions each other and later failing to reconnect.

Is there some configuration that we might be missing ? Any help would be much appreciated.

 

- NODE 192.168.1.10 - "nodetool status” 

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
DN  192.168.1.102  12.02 MB   256          ?       ff906882-8224-40ac-8cdb-98f5e725814d  rack1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  192.168.1.10   41.87 MB   256          ?       51650afd-84dd-4e25-a6f0-13627858d5dc  rack1



- NODE 192.168.1.102  - “nodetool status"

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  192.168.1.102  12.4 MB    256          ?       ff906882-8224-40ac-8cdb-98f5e725814d  rack1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
DN  192.168.1.10   26.31 MB   256          ?       51650afd-84dd-4e25-a6f0-13627858d5dc  rack1

Re: Nodes fail to reconnect after several hours of network failure.

Posted by Mark Curtis <ma...@datastax.com>.

Its worth checking your connectivity on each node to see if the connections
are established:

For example:

# netstat -ant | awk 'NR==2;/7001/'
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 172.31.10.93:7001       0.0.0.0:*               LISTEN
tcp        0      0 172.31.10.93:56771      172.31.10.93:7001
ESTABLISHED
tcp        0      0 172.31.10.93:7001       54.183.204.110:42231
 ESTABLISHED
tcp        0      0 172.31.10.93:52031      54.183.204.110:7001
ESTABLISHED
tcp        0      0 172.31.10.93:50759      54.183.204.110:7001
ESTABLISHED
tcp        0      0 172.31.10.93:38986      172.31.10.93:7001
ESTABLISHED
tcp        0      0 172.31.10.93:7001       172.31.10.93:42408
 ESTABLISHED
tcp        0      0 172.31.10.93:7001       172.31.10.93:38986
 ESTABLISHED
tcp        0      0 172.31.10.93:42408      172.31.10.93:7001
ESTABLISHED
tcp        0      0 172.31.10.93:7001       172.31.10.93:56771
 ESTABLISHED
tcp        0      0 172.31.10.93:7001       54.183.204.110:37491
 ESTABLISHED

Note i'm using 7001 here because my cluster uses SSL but you can use 7000
for the standard gossip port


Thanks


Mark

On 21 January 2016 at 14:08, Bernardino Mota <
bernardino.mota@knowledgeworks.pt> wrote:

> In the logs nothing strange but “nodetool gossipinfo” seems OK
>
>  ./nodetool gossipinfo
> /192.168.1.10
>   generation:1453316804
>   heartbeat:206518
>   STATUS:18:NORMAL,-1003341236369672970
>   LOAD:206420:4.3533596E7
>   SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
>   DC:8:DC2
>   RACK:10:rack1
>   RELEASE_VERSION:4:2.2.4
>   INTERNAL_IP:6:192.168.1.10
>   RPC_ADDRESS:3:127.0.0.1
>   SEVERITY:206517:0.0
>   NET_VERSION:1:9
>   HOST_ID:2:51650afd-84dd-4e25-a6f0-13627858d5dc
>   RPC_READY:49:true
>   TOKENS:17:<hidden>
> /192.168.1.102
>   generation:1453316986
>   heartbeat:84622
>   STATUS:28:NORMAL,-1085177681742913545
>   LOAD:84535:1.2606418E7
>   SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
>   DC:8:DC1
>   RACK:10:rack1
>   RELEASE_VERSION:4:2.2.4
>   INTERNAL_IP:6:10.0.2.10
>   RPC_ADDRESS:3:127.0.0.1
>   SEVERITY:84624:0.0
>   NET_VERSION:1:9
>   HOST_ID:2:ff906882-8224-40ac-8cdb-98f5e725814d
>   RPC_READY:98:true
>   TOKENS:27:<hidden>
>
>
>
>
> On 21 Jan 2016, at 13:17, Adil <ad...@gmail.com> wrote:
>
> Hi,
> do you see any message related to gossip info?
>
>
> 2016-01-21 14:09 GMT+01:00 Bernardino Mota <
> bernardino.mota@knowledgeworks.pt>:
>
>> Using Cassandra 2.2.4 on Ubuntu.
>>
>> We have a cluster with two nodes that during several hours failed to
>> connect with each other due to network problems. The database continued to
>> be used in one of the nodes with writes being stored in the Hints file as
>> supposed.
>>
>> But now that the network is OK again and each machine can communicate we
>> see that each node indicates the other is DOWN and does not replicates.
>>
>> When the network came up we started to see in log files "Convicting /
>> 192.168.1.102 with status NORMAL - alive false"
>>
>> It seems each node evictions each other and later failing to reconnect.
>>
>> Is there some configuration that we might be missing ? Any help would be
>> much appreciated.
>>
>>
>>
>> - NODE 192.168.1.10 - "nodetool status”
>>
>> Datacenter: DC1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns    Host ID
>>                Rack
>> DN  192.168.1.102  12.02 MB   256          ?
>>  ff906882-8224-40ac-8cdb-98f5e725814d  rack1
>> Datacenter: DC2
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns    Host ID
>>                Rack
>> UN  192.168.1.10   41.87 MB   256          ?
>>  51650afd-84dd-4e25-a6f0-13627858d5dc  rack1
>>
>>
>>
>> - NODE 192.168.1.102  - “nodetool status"
>>
>> Datacenter: DC1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns    Host ID
>>                Rack
>> UN  192.168.1.102  12.4 MB    256          ?
>>  ff906882-8224-40ac-8cdb-98f5e725814d  rack1
>> Datacenter: DC2
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns    Host ID
>>                Rack
>> DN  192.168.1.10   26.31 MB   256          ?
>>  51650afd-84dd-4e25-a6f0-13627858d5dc  rack1
>>
>>
>>
>
>

Re: Nodes fail to reconnect after several hours of network failure.

Posted by Bernardino Mota <be...@knowledgeworks.pt>.

In the logs nothing strange but “nodetool gossipinfo” seems OK

 ./nodetool gossipinfo
/192.168.1.10
  generation:1453316804
  heartbeat:206518
  STATUS:18:NORMAL,-1003341236369672970
  LOAD:206420:4.3533596E7
  SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
  DC:8:DC2
  RACK:10:rack1
  RELEASE_VERSION:4:2.2.4
  INTERNAL_IP:6:192.168.1.10
  RPC_ADDRESS:3:127.0.0.1
  SEVERITY:206517:0.0
  NET_VERSION:1:9
  HOST_ID:2:51650afd-84dd-4e25-a6f0-13627858d5dc
  RPC_READY:49:true
  TOKENS:17:<hidden>
/192.168.1.102
  generation:1453316986
  heartbeat:84622
  STATUS:28:NORMAL,-1085177681742913545
  LOAD:84535:1.2606418E7
  SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
  DC:8:DC1
  RACK:10:rack1
  RELEASE_VERSION:4:2.2.4
  INTERNAL_IP:6:10.0.2.10
  RPC_ADDRESS:3:127.0.0.1
  SEVERITY:84624:0.0
  NET_VERSION:1:9
  HOST_ID:2:ff906882-8224-40ac-8cdb-98f5e725814d
  RPC_READY:98:true
  TOKENS:27:<hidden>
  
 


> On 21 Jan 2016, at 13:17, Adil <ad...@gmail.com> wrote:
> 
> Hi,
> do you see any message related to gossip info?
> 
> 
> 2016-01-21 14:09 GMT+01:00 Bernardino Mota <bernardino.mota@knowledgeworks.pt <ma...@knowledgeworks.pt>>:
> Using Cassandra 2.2.4 on Ubuntu.
> 
> We have a cluster with two nodes that during several hours failed to connect with each other due to network problems. The database continued to be used in one of the nodes with writes being stored in the Hints file as supposed.
> 
> But now that the network is OK again and each machine can communicate we see that each node indicates the other is DOWN and does not replicates.
> 
> When the network came up we started to see in log files "Convicting /192.168.1.102 <http://192.168.1.102/> with status NORMAL - alive false"
> 
> It seems each node evictions each other and later failing to reconnect.
> 
> Is there some configuration that we might be missing ? Any help would be much appreciated.
> 
> 
> 
> - NODE 192.168.1.10 - "nodetool status”
> 
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID                               Rack
> DN  192.168.1.102  12.02 MB   256          ?       ff906882-8224-40ac-8cdb-98f5e725814d  rack1
> Datacenter: DC2
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID                               Rack
> UN  192.168.1.10   41.87 MB   256          ?       51650afd-84dd-4e25-a6f0-13627858d5dc  rack1
> 
> 
> 
> - NODE 192.168.1.102  - “nodetool status"
> 
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID                               Rack
> UN  192.168.1.102  12.4 MB    256          ?       ff906882-8224-40ac-8cdb-98f5e725814d  rack1
> Datacenter: DC2
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID                               Rack
> DN  192.168.1.10   26.31 MB   256          ?       51650afd-84dd-4e25-a6f0-13627858d5dc  rack1
> 
> 
>

Re: Nodes fail to reconnect after several hours of network failure.

Posted by Adil <ad...@gmail.com>.

Hi,
do you see any message related to gossip info?


2016-01-21 14:09 GMT+01:00 Bernardino Mota <
bernardino.mota@knowledgeworks.pt>:

> Using Cassandra 2.2.4 on Ubuntu.
>
> We have a cluster with two nodes that during several hours failed to
> connect with each other due to network problems. The database continued to
> be used in one of the nodes with writes being stored in the Hints file as
> supposed.
>
> But now that the network is OK again and each machine can communicate we
> see that each node indicates the other is DOWN and does not replicates.
>
> When the network came up we started to see in log files "Convicting /
> 192.168.1.102 with status NORMAL - alive false"
>
> It seems each node evictions each other and later failing to reconnect.
>
> Is there some configuration that we might be missing ? Any help would be
> much appreciated.
>
>
>
> - NODE 192.168.1.10 - "nodetool status”
>
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID
>                Rack
> DN  192.168.1.102  12.02 MB   256          ?
>  ff906882-8224-40ac-8cdb-98f5e725814d  rack1
> Datacenter: DC2
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID
>                Rack
> UN  192.168.1.10   41.87 MB   256          ?
>  51650afd-84dd-4e25-a6f0-13627858d5dc  rack1
>
>
>
> - NODE 192.168.1.102  - “nodetool status"
>
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID
>                Rack
> UN  192.168.1.102  12.4 MB    256          ?
>  ff906882-8224-40ac-8cdb-98f5e725814d  rack1
> Datacenter: DC2
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns    Host ID
>                Rack
> DN  192.168.1.10   26.31 MB   256          ?
>  51650afd-84dd-4e25-a6f0-13627858d5dc  rack1
>
>
>