You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bernardino Mota <be...@knowledgeworks.pt> on 2016/01/21 14:09:52 UTC
Nodes fail to reconnect after several hours of network failure.
Using Cassandra 2.2.4 on Ubuntu.
We have a cluster with two nodes that during several hours failed to connect with each other due to network problems. The database continued to be used in one of the nodes with writes being stored in the Hints file as supposed.
But now that the network is OK again and each machine can communicate we see that each node indicates the other is DOWN and does not replicates.
When the network came up we started to see in log files "Convicting /192.168.1.102 with status NORMAL - alive false"
It seems each node evictions each other and later failing to reconnect.
Is there some configuration that we might be missing ? Any help would be much appreciated.
- NODE 192.168.1.10 - "nodetool status”
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.1.102 12.02 MB 256 ? ff906882-8224-40ac-8cdb-98f5e725814d rack1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.10 41.87 MB 256 ? 51650afd-84dd-4e25-a6f0-13627858d5dc rack1
- NODE 192.168.1.102 - “nodetool status"
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.102 12.4 MB 256 ? ff906882-8224-40ac-8cdb-98f5e725814d rack1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.1.10 26.31 MB 256 ? 51650afd-84dd-4e25-a6f0-13627858d5dc rack1
Re: Nodes fail to reconnect after several hours of network failure.
Posted by Mark Curtis <ma...@datastax.com>.
Its worth checking your connectivity on each node to see if the connections
are established:
For example:
# netstat -ant | awk 'NR==2;/7001/'
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 172.31.10.93:7001 0.0.0.0:* LISTEN
tcp 0 0 172.31.10.93:56771 172.31.10.93:7001
ESTABLISHED
tcp 0 0 172.31.10.93:7001 54.183.204.110:42231
ESTABLISHED
tcp 0 0 172.31.10.93:52031 54.183.204.110:7001
ESTABLISHED
tcp 0 0 172.31.10.93:50759 54.183.204.110:7001
ESTABLISHED
tcp 0 0 172.31.10.93:38986 172.31.10.93:7001
ESTABLISHED
tcp 0 0 172.31.10.93:7001 172.31.10.93:42408
ESTABLISHED
tcp 0 0 172.31.10.93:7001 172.31.10.93:38986
ESTABLISHED
tcp 0 0 172.31.10.93:42408 172.31.10.93:7001
ESTABLISHED
tcp 0 0 172.31.10.93:7001 172.31.10.93:56771
ESTABLISHED
tcp 0 0 172.31.10.93:7001 54.183.204.110:37491
ESTABLISHED
Note i'm using 7001 here because my cluster uses SSL but you can use 7000
for the standard gossip port
Thanks
Mark
On 21 January 2016 at 14:08, Bernardino Mota <
bernardino.mota@knowledgeworks.pt> wrote:
> In the logs nothing strange but “nodetool gossipinfo” seems OK
>
> ./nodetool gossipinfo
> /192.168.1.10
> generation:1453316804
> heartbeat:206518
> STATUS:18:NORMAL,-1003341236369672970
> LOAD:206420:4.3533596E7
> SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
> DC:8:DC2
> RACK:10:rack1
> RELEASE_VERSION:4:2.2.4
> INTERNAL_IP:6:192.168.1.10
> RPC_ADDRESS:3:127.0.0.1
> SEVERITY:206517:0.0
> NET_VERSION:1:9
> HOST_ID:2:51650afd-84dd-4e25-a6f0-13627858d5dc
> RPC_READY:49:true
> TOKENS:17:<hidden>
> /192.168.1.102
> generation:1453316986
> heartbeat:84622
> STATUS:28:NORMAL,-1085177681742913545
> LOAD:84535:1.2606418E7
> SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
> DC:8:DC1
> RACK:10:rack1
> RELEASE_VERSION:4:2.2.4
> INTERNAL_IP:6:10.0.2.10
> RPC_ADDRESS:3:127.0.0.1
> SEVERITY:84624:0.0
> NET_VERSION:1:9
> HOST_ID:2:ff906882-8224-40ac-8cdb-98f5e725814d
> RPC_READY:98:true
> TOKENS:27:<hidden>
>
>
>
>
> On 21 Jan 2016, at 13:17, Adil <ad...@gmail.com> wrote:
>
> Hi,
> do you see any message related to gossip info?
>
>
> 2016-01-21 14:09 GMT+01:00 Bernardino Mota <
> bernardino.mota@knowledgeworks.pt>:
>
>> Using Cassandra 2.2.4 on Ubuntu.
>>
>> We have a cluster with two nodes that during several hours failed to
>> connect with each other due to network problems. The database continued to
>> be used in one of the nodes with writes being stored in the Hints file as
>> supposed.
>>
>> But now that the network is OK again and each machine can communicate we
>> see that each node indicates the other is DOWN and does not replicates.
>>
>> When the network came up we started to see in log files "Convicting /
>> 192.168.1.102 with status NORMAL - alive false"
>>
>> It seems each node evictions each other and later failing to reconnect.
>>
>> Is there some configuration that we might be missing ? Any help would be
>> much appreciated.
>>
>>
>>
>> - NODE 192.168.1.10 - "nodetool status”
>>
>> Datacenter: DC1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> -- Address Load Tokens Owns Host ID
>> Rack
>> DN 192.168.1.102 12.02 MB 256 ?
>> ff906882-8224-40ac-8cdb-98f5e725814d rack1
>> Datacenter: DC2
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> -- Address Load Tokens Owns Host ID
>> Rack
>> UN 192.168.1.10 41.87 MB 256 ?
>> 51650afd-84dd-4e25-a6f0-13627858d5dc rack1
>>
>>
>>
>> - NODE 192.168.1.102 - “nodetool status"
>>
>> Datacenter: DC1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> -- Address Load Tokens Owns Host ID
>> Rack
>> UN 192.168.1.102 12.4 MB 256 ?
>> ff906882-8224-40ac-8cdb-98f5e725814d rack1
>> Datacenter: DC2
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> -- Address Load Tokens Owns Host ID
>> Rack
>> DN 192.168.1.10 26.31 MB 256 ?
>> 51650afd-84dd-4e25-a6f0-13627858d5dc rack1
>>
>>
>>
>
>
Re: Nodes fail to reconnect after several hours of network failure.
Posted by Bernardino Mota <be...@knowledgeworks.pt>.
In the logs nothing strange but “nodetool gossipinfo” seems OK
./nodetool gossipinfo
/192.168.1.10
generation:1453316804
heartbeat:206518
STATUS:18:NORMAL,-1003341236369672970
LOAD:206420:4.3533596E7
SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
DC:8:DC2
RACK:10:rack1
RELEASE_VERSION:4:2.2.4
INTERNAL_IP:6:192.168.1.10
RPC_ADDRESS:3:127.0.0.1
SEVERITY:206517:0.0
NET_VERSION:1:9
HOST_ID:2:51650afd-84dd-4e25-a6f0-13627858d5dc
RPC_READY:49:true
TOKENS:17:<hidden>
/192.168.1.102
generation:1453316986
heartbeat:84622
STATUS:28:NORMAL,-1085177681742913545
LOAD:84535:1.2606418E7
SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
DC:8:DC1
RACK:10:rack1
RELEASE_VERSION:4:2.2.4
INTERNAL_IP:6:10.0.2.10
RPC_ADDRESS:3:127.0.0.1
SEVERITY:84624:0.0
NET_VERSION:1:9
HOST_ID:2:ff906882-8224-40ac-8cdb-98f5e725814d
RPC_READY:98:true
TOKENS:27:<hidden>
> On 21 Jan 2016, at 13:17, Adil <ad...@gmail.com> wrote:
>
> Hi,
> do you see any message related to gossip info?
>
>
> 2016-01-21 14:09 GMT+01:00 Bernardino Mota <bernardino.mota@knowledgeworks.pt <ma...@knowledgeworks.pt>>:
> Using Cassandra 2.2.4 on Ubuntu.
>
> We have a cluster with two nodes that during several hours failed to connect with each other due to network problems. The database continued to be used in one of the nodes with writes being stored in the Hints file as supposed.
>
> But now that the network is OK again and each machine can communicate we see that each node indicates the other is DOWN and does not replicates.
>
> When the network came up we started to see in log files "Convicting /192.168.1.102 <http://192.168.1.102/> with status NORMAL - alive false"
>
> It seems each node evictions each other and later failing to reconnect.
>
> Is there some configuration that we might be missing ? Any help would be much appreciated.
>
>
>
> - NODE 192.168.1.10 - "nodetool status”
>
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID Rack
> DN 192.168.1.102 12.02 MB 256 ? ff906882-8224-40ac-8cdb-98f5e725814d rack1
> Datacenter: DC2
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID Rack
> UN 192.168.1.10 41.87 MB 256 ? 51650afd-84dd-4e25-a6f0-13627858d5dc rack1
>
>
>
> - NODE 192.168.1.102 - “nodetool status"
>
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID Rack
> UN 192.168.1.102 12.4 MB 256 ? ff906882-8224-40ac-8cdb-98f5e725814d rack1
> Datacenter: DC2
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID Rack
> DN 192.168.1.10 26.31 MB 256 ? 51650afd-84dd-4e25-a6f0-13627858d5dc rack1
>
>
>
Re: Nodes fail to reconnect after several hours of network failure.
Posted by Adil <ad...@gmail.com>.
Hi,
do you see any message related to gossip info?
2016-01-21 14:09 GMT+01:00 Bernardino Mota <
bernardino.mota@knowledgeworks.pt>:
> Using Cassandra 2.2.4 on Ubuntu.
>
> We have a cluster with two nodes that during several hours failed to
> connect with each other due to network problems. The database continued to
> be used in one of the nodes with writes being stored in the Hints file as
> supposed.
>
> But now that the network is OK again and each machine can communicate we
> see that each node indicates the other is DOWN and does not replicates.
>
> When the network came up we started to see in log files "Convicting /
> 192.168.1.102 with status NORMAL - alive false"
>
> It seems each node evictions each other and later failing to reconnect.
>
> Is there some configuration that we might be missing ? Any help would be
> much appreciated.
>
>
>
> - NODE 192.168.1.10 - "nodetool status”
>
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID
> Rack
> DN 192.168.1.102 12.02 MB 256 ?
> ff906882-8224-40ac-8cdb-98f5e725814d rack1
> Datacenter: DC2
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID
> Rack
> UN 192.168.1.10 41.87 MB 256 ?
> 51650afd-84dd-4e25-a6f0-13627858d5dc rack1
>
>
>
> - NODE 192.168.1.102 - “nodetool status"
>
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID
> Rack
> UN 192.168.1.102 12.4 MB 256 ?
> ff906882-8224-40ac-8cdb-98f5e725814d rack1
> Datacenter: DC2
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID
> Rack
> DN 192.168.1.10 26.31 MB 256 ?
> 51650afd-84dd-4e25-a6f0-13627858d5dc rack1
>
>
>