You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Andrew Jorgensen <an...@andrewjorgensen.com> on 2017/05/16 14:07:30 UTC

Re: Non-zero nodes are marked as down after restarting cassandra process

Thanks for the info!

When you say "overall stability problems due to some bugs", can you
elaborate on if those were bugs in cassandra that were fixed due to an
upgrade or bugs in your own code and how you used cassandra. If the latter
would  it be possible to highlight what the most impactful fix was from the
usage side.

As far as I can tell there are no dropped messages, there are some pending
Compactions and a few Native-Transport_Request in the all time blocked
column.

Thanks!

Andrew Jorgensen
@ajorgensen

On Wed, Mar 1, 2017 at 12:58 PM, benjamin roth <br...@gmail.com> wrote:

> You should always drain nodes before stopping the daemon whenever
> possible. This avoids commitlog replay on startup. This can take a while.
> But according to your description commit log replay seems not to be the
> cause.
>
> I once had a similar effect. Some nodes appeared down for some other nodes
> and up for others. At that time the cluster had overall stability problems
> due to some bugs. After those bugs have gone, I haven't seen this effect
> any more.
>
> If that happens again to you, you could check your logs or "nodetool
> tpstats" for dropped messages, watch out for suspicious network-related
> logs and the load of your nodes in general.
>
> 2017-03-01 17:36 GMT+01:00 Ben Dalling <b....@locp.co.uk>:
>
>> Hi Andrew,
>>
>> We were having problems with gossip TCP connections being held open and
>> changed our SOP for stopping cassandra to being:
>>
>> nodetool disablegossip
>> nodetool drain
>> service cassandra stop
>>
>> This seemed to close down the gossip cleanly (the nodetool drain is
>> advised as well) and meant that the node rejoined the cluster fine after
>> issuing "service cassandra start".
>>
>> *Ben*
>>
>> On 1 March 2017 at 16:29, Andrew Jorgensen <an...@andrewjorgensen.com>
>> wrote:
>>
>>> Helllo,
>>>
>>> I have a cassandra cluster running on cassandra 3.0.3 and am seeing some
>>> strange behavior that I cannot explain when restarting cassandra nodes. The
>>> cluster is currently setup in a single datacenter and consists of 55 nodes.
>>> I am currently in the process of restarting nodes in the cluster but have
>>> noticed that after restarting the cassandra process with `service cassandra
>>> start; service cassandra stop` when the node comes back and I run `nodetool
>>> status` there is usually a non-zero number of nodes in the rest of the
>>> cluster that are marked as DN. If I got to another node in the cluster,
>>> from its perspective all nodes included the restarted one are marked as UN.
>>> It seems to take ~15 to 20 minutes before the restarted node is updated to
>>> show all nodes as UN. During the 15 minutes writes and reads . to the
>>> cluster appear to be degraded and do not recover unless I stop the
>>> cassandra process again or wait for all nodes to be marked as UN. The
>>> cluster also has 3 seed nodes which during this process are up and
>>> available the whole time.
>>>
>>> I have also tried doing `gossipinfo` on the restarted node and according
>>> to the output all nodes have a status of NORMAL. Has anyone seen this
>>> before and is there anything I can do to fix/reduce the impact of running a
>>> restart on a cassandra node?
>>>
>>> Thanks,
>>> Andrew Jorgensen
>>> @ajorgensen
>>>
>>
>>
>

Re: Non-zero nodes are marked as down after restarting cassandra process

Posted by Jeff Jirsa <jj...@apache.org>.

On 2017-05-16 07:07 (-0700), Andrew Jorgensen <an...@andrewjorgensen.com> wrote: 
> Thanks for the info!
> 
> When you say "overall stability problems due to some bugs", can you
> elaborate on if those were bugs in cassandra that were fixed due to an
> upgrade or bugs in your own code and how you used cassandra. If the latter
> would  it be possible to highlight what the most impactful fix was from the
> usage side.

For what it's worth, there have been HUNDREDS of bugs fixed in 3.0 since your 3.0.3 release, many of which are fairly important - while it's unlikely to fix the behavior you describe, upgrading to latest 3.0 is probably a good idea.

Anecdotally, the behavior you describe is similar to a condition I saw once at a previous employer on a very different (much older) version of cassandra, and it was accompanied by a few thousand bytes in a tcp send queue that lasted long after I'd have expected it to be closed. Never really investigated, but if you see it happen again, capturing the output of 'netstat -n' and 'lsof' on the servers involved would help understand what's going on (open a jira, upload the output).

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org