You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Adil <ad...@gmail.com> on 2016/01/12 15:56:20 UTC

electricity outage problem

Hi,

we have two DC with 5 nodes in each cluster, yesterday there was an
electricity outage causing all nodes down, we restart the clusters but when
we run nodetool status on DC1 it results that some nodes are DN, the
strange thing is that running the command from diffrent node in DC1 doesn't
give the same node in DC as own, we have noticed this message in the log
"received an invalid gossip generation for peer", does anyone know how to
resolve this problem? should we purge the gossip?

thanks

Adil

Re: electricity outage problem

Posted by Adil <ad...@gmail.com>.

our case is not about accepting connection, some nodes receives gossip
generation number greater the local one, a looked at the tables peers and
local and can't found where local one is stored.

2016-01-15 17:54 GMT+01:00 daemeon reiydelle <da...@gmail.com>:

> Nodes need about 60-90 second delay before it can start accepting
> connections as a seed node. Also a seed node needs time to accept a node
> starting up, and syncing to other nodes (on 10gbit the max new nodes is
> only 1 or 2, on 1gigabit it can handle at least 3-4 new nodes connecting).
> In a large cluster (500 nodes) I see this wierd condition where nodetool
> status shows overlapping subsets of nodes, and the problem does not go away
> after even an hour on a 10 gigabit network).
>
>
>
> *.......*
>
>
>
>
>
>
> *“Life should not be a journey to the grave with the intention of arriving
> safely in apretty and well preserved body, but rather to skid in broadside
> in a cloud of smoke,thoroughly used up, totally worn out, and loudly
> proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
> (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Fri, Jan 15, 2016 at 9:17 AM, Adil <ad...@gmail.com> wrote:
>
>> Hi,
>> we did full restart of the cluster but nodetool status still giving
>> incoerent info from different nodes, some nodes appers UP from a node but
>> appers DOWN from another, and in the log as is said still having the
>> message "received an invalid gossip generation for peer /x.x.x.x"
>> cassandra version is 2.1.2, we want to execute the purge operation as
>> explained here
>> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_gossip_purge.html
>> but we don't found the peers folder, should we do it via cql deleting the
>> peers content? should we do it for all nodes?
>>
>> thanks
>>
>>
>> 2016-01-12 17:42 GMT+01:00 Jack Krupansky <ja...@gmail.com>:
>>
>>> Sometimes you may have to clear out the saved Gossip state:
>>>
>>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html
>>>
>>> Note the instruction about bringing up the seed nodes first. Normally
>>> seed nodes are only relevant when initially joining a node to a cluster
>>> (and then the Gossip state will be persisted locally), but if you clear te
>>> persisted Gossip state the seed nodes will again be needed to find the rest
>>> of the cluster.
>>>
>>> I'm not sure whether a power outage is the same as stopping and
>>> restarting an instance (AWS) in terms of whether the restarted instance
>>> retains its current public IP address.
>>>
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydelle <da...@gmail.com>
>>> wrote:
>>>
>>>> This happens when there is insufficient time for nodes coming up to
>>>> join a network. It takes a few seconds for a node to come up, e.g. your
>>>> seed node. If you tell a node to join a cluster you can get this scenario
>>>> because of high network utilization as well. I wait 90 seconds after the
>>>> first (i.e. my first seed) node comes up to start the next one. Any nodes
>>>> that are seeds need some 60 seconds, so the additional 30 seconds is a
>>>> buffer. Additional nodes each wait 60 seconds before joining (although this
>>>> is a parallel tree for large clusters).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *.......*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *“Life should not be a journey to the grave with the intention of
>>>> arriving safely in apretty and well preserved body, but rather to skid in
>>>> broadside in a cloud of smoke,thoroughly used up, totally worn out, and
>>>> loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M.
>>>> ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0)
>>>> 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>>
>>>> On Tue, Jan 12, 2016 at 6:56 AM, Adil <ad...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> we have two DC with 5 nodes in each cluster, yesterday there was an
>>>>> electricity outage causing all nodes down, we restart the clusters but when
>>>>> we run nodetool status on DC1 it results that some nodes are DN, the
>>>>> strange thing is that running the command from diffrent node in DC1 doesn't
>>>>> give the same node in DC as own, we have noticed this message in the log
>>>>> "received an invalid gossip generation for peer", does anyone know how to
>>>>> resolve this problem? should we purge the gossip?
>>>>>
>>>>> thanks
>>>>>
>>>>> Adil
>>>>>
>>>>
>>>>
>>>
>>
>

Re: electricity outage problem

Posted by daemeon reiydelle <da...@gmail.com>.

Nodes need about 60-90 second delay before it can start accepting
connections as a seed node. Also a seed node needs time to accept a node
starting up, and syncing to other nodes (on 10gbit the max new nodes is
only 1 or 2, on 1gigabit it can handle at least 3-4 new nodes connecting).
In a large cluster (500 nodes) I see this wierd condition where nodetool
status shows overlapping subsets of nodes, and the problem does not go away
after even an hour on a 10 gigabit network).



*.......*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Jan 15, 2016 at 9:17 AM, Adil <ad...@gmail.com> wrote:

> Hi,
> we did full restart of the cluster but nodetool status still giving
> incoerent info from different nodes, some nodes appers UP from a node but
> appers DOWN from another, and in the log as is said still having the
> message "received an invalid gossip generation for peer /x.x.x.x"
> cassandra version is 2.1.2, we want to execute the purge operation as
> explained here
> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_gossip_purge.html
> but we don't found the peers folder, should we do it via cql deleting the
> peers content? should we do it for all nodes?
>
> thanks
>
>
> 2016-01-12 17:42 GMT+01:00 Jack Krupansky <ja...@gmail.com>:
>
>> Sometimes you may have to clear out the saved Gossip state:
>>
>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html
>>
>> Note the instruction about bringing up the seed nodes first. Normally
>> seed nodes are only relevant when initially joining a node to a cluster
>> (and then the Gossip state will be persisted locally), but if you clear te
>> persisted Gossip state the seed nodes will again be needed to find the rest
>> of the cluster.
>>
>> I'm not sure whether a power outage is the same as stopping and
>> restarting an instance (AWS) in terms of whether the restarted instance
>> retains its current public IP address.
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydelle <da...@gmail.com>
>> wrote:
>>
>>> This happens when there is insufficient time for nodes coming up to join
>>> a network. It takes a few seconds for a node to come up, e.g. your seed
>>> node. If you tell a node to join a cluster you can get this scenario
>>> because of high network utilization as well. I wait 90 seconds after the
>>> first (i.e. my first seed) node comes up to start the next one. Any nodes
>>> that are seeds need some 60 seconds, so the additional 30 seconds is a
>>> buffer. Additional nodes each wait 60 seconds before joining (although this
>>> is a parallel tree for large clusters).
>>>
>>>
>>>
>>>
>>>
>>> *.......*
>>>
>>>
>>>
>>>
>>>
>>>
>>> *“Life should not be a journey to the grave with the intention of
>>> arriving safely in apretty and well preserved body, but rather to skid in
>>> broadside in a cloud of smoke,thoroughly used up, totally worn out, and
>>> loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M.
>>> ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0)
>>> 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Tue, Jan 12, 2016 at 6:56 AM, Adil <ad...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> we have two DC with 5 nodes in each cluster, yesterday there was an
>>>> electricity outage causing all nodes down, we restart the clusters but when
>>>> we run nodetool status on DC1 it results that some nodes are DN, the
>>>> strange thing is that running the command from diffrent node in DC1 doesn't
>>>> give the same node in DC as own, we have noticed this message in the log
>>>> "received an invalid gossip generation for peer", does anyone know how to
>>>> resolve this problem? should we purge the gossip?
>>>>
>>>> thanks
>>>>
>>>> Adil
>>>>
>>>
>>>
>>
>

Re: electricity outage problem

Posted by Adil <ad...@gmail.com>.

Hi,
we did full restart of the cluster but nodetool status still giving
incoerent info from different nodes, some nodes appers UP from a node but
appers DOWN from another, and in the log as is said still having the
message "received an invalid gossip generation for peer /x.x.x.x"
cassandra version is 2.1.2, we want to execute the purge operation as
explained here
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_gossip_purge.html
but we don't found the peers folder, should we do it via cql deleting the
peers content? should we do it for all nodes?

thanks


2016-01-12 17:42 GMT+01:00 Jack Krupansky <ja...@gmail.com>:

> Sometimes you may have to clear out the saved Gossip state:
>
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html
>
> Note the instruction about bringing up the seed nodes first. Normally seed
> nodes are only relevant when initially joining a node to a cluster (and
> then the Gossip state will be persisted locally), but if you clear te
> persisted Gossip state the seed nodes will again be needed to find the rest
> of the cluster.
>
> I'm not sure whether a power outage is the same as stopping and restarting
> an instance (AWS) in terms of whether the restarted instance retains its
> current public IP address.
>
>
>
> -- Jack Krupansky
>
> On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydelle <da...@gmail.com>
> wrote:
>
>> This happens when there is insufficient time for nodes coming up to join
>> a network. It takes a few seconds for a node to come up, e.g. your seed
>> node. If you tell a node to join a cluster you can get this scenario
>> because of high network utilization as well. I wait 90 seconds after the
>> first (i.e. my first seed) node comes up to start the next one. Any nodes
>> that are seeds need some 60 seconds, so the additional 30 seconds is a
>> buffer. Additional nodes each wait 60 seconds before joining (although this
>> is a parallel tree for large clusters).
>>
>>
>>
>>
>>
>> *.......*
>>
>>
>>
>>
>>
>>
>> *“Life should not be a journey to the grave with the intention of
>> arriving safely in apretty and well preserved body, but rather to skid in
>> broadside in a cloud of smoke,thoroughly used up, totally worn out, and
>> loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M.
>> ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0)
>> 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Tue, Jan 12, 2016 at 6:56 AM, Adil <ad...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> we have two DC with 5 nodes in each cluster, yesterday there was an
>>> electricity outage causing all nodes down, we restart the clusters but when
>>> we run nodetool status on DC1 it results that some nodes are DN, the
>>> strange thing is that running the command from diffrent node in DC1 doesn't
>>> give the same node in DC as own, we have noticed this message in the log
>>> "received an invalid gossip generation for peer", does anyone know how to
>>> resolve this problem? should we purge the gossip?
>>>
>>> thanks
>>>
>>> Adil
>>>
>>
>>
>

Re: electricity outage problem

Posted by Jack Krupansky <ja...@gmail.com>.

Sometimes you may have to clear out the saved Gossip state:
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html

Note the instruction about bringing up the seed nodes first. Normally seed
nodes are only relevant when initially joining a node to a cluster (and
then the Gossip state will be persisted locally), but if you clear te
persisted Gossip state the seed nodes will again be needed to find the rest
of the cluster.

I'm not sure whether a power outage is the same as stopping and restarting
an instance (AWS) in terms of whether the restarted instance retains its
current public IP address.



-- Jack Krupansky

On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydelle <da...@gmail.com>
wrote:

> This happens when there is insufficient time for nodes coming up to join a
> network. It takes a few seconds for a node to come up, e.g. your seed node.
> If you tell a node to join a cluster you can get this scenario because of
> high network utilization as well. I wait 90 seconds after the first (i.e.
> my first seed) node comes up to start the next one. Any nodes that are
> seeds need some 60 seconds, so the additional 30 seconds is a buffer.
> Additional nodes each wait 60 seconds before joining (although this is a
> parallel tree for large clusters).
>
>
>
>
>
> *.......*
>
>
>
>
>
>
> *“Life should not be a journey to the grave with the intention of arriving
> safely in apretty and well preserved body, but rather to skid in broadside
> in a cloud of smoke,thoroughly used up, totally worn out, and loudly
> proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
> (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Tue, Jan 12, 2016 at 6:56 AM, Adil <ad...@gmail.com> wrote:
>
>> Hi,
>>
>> we have two DC with 5 nodes in each cluster, yesterday there was an
>> electricity outage causing all nodes down, we restart the clusters but when
>> we run nodetool status on DC1 it results that some nodes are DN, the
>> strange thing is that running the command from diffrent node in DC1 doesn't
>> give the same node in DC as own, we have noticed this message in the log
>> "received an invalid gossip generation for peer", does anyone know how to
>> resolve this problem? should we purge the gossip?
>>
>> thanks
>>
>> Adil
>>
>
>

Re: electricity outage problem

Posted by daemeon reiydelle <da...@gmail.com>.

This happens when there is insufficient time for nodes coming up to join a
network. It takes a few seconds for a node to come up, e.g. your seed node.
If you tell a node to join a cluster you can get this scenario because of
high network utilization as well. I wait 90 seconds after the first (i.e.
my first seed) node comes up to start the next one. Any nodes that are
seeds need some 60 seconds, so the additional 30 seconds is a buffer.
Additional nodes each wait 60 seconds before joining (although this is a
parallel tree for large clusters).

*.......*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Jan 12, 2016 at 6:56 AM, Adil <ad...@gmail.com> wrote:

> Hi,
>
> we have two DC with 5 nodes in each cluster, yesterday there was an
> electricity outage causing all nodes down, we restart the clusters but when
> we run nodetool status on DC1 it results that some nodes are DN, the
> strange thing is that running the command from diffrent node in DC1 doesn't
> give the same node in DC as own, we have noticed this message in the log
> "received an invalid gossip generation for peer", does anyone know how to
> resolve this problem? should we purge the gossip?
>
> thanks
>
> Adil
>