You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Joel Samuelsson <sa...@gmail.com> on 2016/02/23 15:08:47 UTC

Nodes go down periodically

Our nodes go down periodically, around 1-2 times each day. Downtime is from
<1 second to 30 or so seconds.

INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
InetAddress /109.74.13.67 is now DOWN
 INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
(line 978) InetAddress /109.74.13.67 is now UP

I find nothing odd in the logs around the same time. I logged a ping with
timestamp and checked during the same time and saw nothing weird (ping is
less than 2ms at all times).

Does anyone have any suggestions as to why this might happen?

Best regards,
Joel

Re: Nodes go down periodically

Posted by Joel Samuelsson <sa...@gmail.com>.

"Is it only one node at a time that goes down, and at widely dispersed
times?"
It is a two node cluster so both nodes consider the other node down at the
same time.

These are the times the latest few days:
INFO [GossipTasks:1] 2016-02-19 05:06:21,087 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-19 14:33:38,424 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-20 07:21:25,626 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-20 11:34:46,766 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-21 08:00:07,518 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-21 10:36:58,788 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-22 07:10:40,304 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-23 08:59:05,392 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-23 12:22:59,562 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN


2016-02-23 18:01 GMT+01:00 daemeon reiydelle <da...@gmail.com>:

> If you can, do a few (short, maybe 10m records, delete the default schema
> between executions) run of Cassandra Stress test against your production
> cluster (replication=3, force quorum to 3). Look for latency max in the 10s
> of SECONDS. If your devops team is running a monitoring tool that looks at
> the network, look for timeout/retries/errors/lost packets, etc. during the
> run (worst case you need to do netstats runs against the relevant nic e.g.
> every 10 seconds on the CassStress node, look for jumps in this count (if
> monitoring is enabled, look at the monitor's results for ALL of your nodes.
> At least one is having some issues.
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> The reality of modern distributed systems is that connectivity between
>> nodes is never guaranteed and distributed software must be able to cope
>> with occasional absence of connectivity. GC and network connectivity are
>> the two issues that a lot of us are most familiar with. There may be others
>> - but most technical problems on a node would be clearly logged on that
>> node. If you see a lapse of connectivity no more than once or twice a day,
>> consider yourselves lucky.
>>
>> Is it only one node at a time that goes down, and at widely dispersed
>> times?
>>
>> How many nodes?
>>
>> -- Jack Krupansky
>>
>> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <
>> samuelsson.joel@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Version is 2.0.17.
>>> Yes, these are VMs in the cloud though I'm fairly certain they are on a
>>> LAN rather than WAN. They are both in the same data centre physically. The
>>> phi_convict_threshold is set to default. I'd rather find the root cause of
>>> the problem than just hiding it by not convicting a node if it isn't
>>> responding though. If pings are <2 ms without a single ping missed in
>>> several days, I highly doubt that network is the reason for the downtime.
>>>
>>> Best regards,
>>> Joel
>>>
>>> 2016-02-23 16:39 GMT+01:00 <SE...@homedepot.com>:
>>>
>>>> You didn’t mention version, but I saw this kind of thing very often in
>>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>>>> increase it to reduce the UP/DOWN flapping behavior.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sean Durity
>>>>
>>>>
>>>>
>>>> *From:* Joel Samuelsson [mailto:samuelsson.joel@gmail.com]
>>>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Nodes go down periodically
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Thanks for your reply.
>>>>
>>>>
>>>>
>>>> I have debug logging on and see no GC pauses that are that long. GC
>>>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>>>
>>>> Do I need to enable GC log options to see the pauses?
>>>>
>>>> I see plenty of these lines:
>>>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>>>> 118) GC for ParNew: 24 ms for 1 collections
>>>>
>>>> as well as a few CMS GC log lines.
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Joel
>>>>
>>>>
>>>>
>>>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hk...@gmail.com>:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Those are probably GC pauses. Memory tuning is probably needed. Check
>>>> the parameters that you already have customised if they make sense.
>>>>
>>>>
>>>>
>>>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>>>
>>>>
>>>>
>>>> Hannu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 23 Feb 2016, at 16:08, Joel Samuelsson <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>>>> from <1 second to 30 or so seconds.
>>>>
>>>>
>>>>
>>>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>>>> InetAddress /109.74.13.67 is now DOWN
>>>>
>>>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>>>> (line 978) InetAddress /109.74.13.67 is now UP
>>>>
>>>>
>>>>
>>>> I find nothing odd in the logs around the same time. I logged a ping
>>>> with timestamp and checked during the same time and saw nothing weird (ping
>>>> is less than 2ms at all times).
>>>>
>>>>
>>>>
>>>> Does anyone have any suggestions as to why this might happen?
>>>>
>>>>
>>>>
>>>> Best regards,
>>>> Joel
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> The information in this Internet Email is confidential and may be
>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>> Email by anyone else is unauthorized. If you are not the intended
>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>> When addressed to our clients any opinions or advice contained in this
>>>> Email are subject to the terms and conditions expressed in any applicable
>>>> governing The Home Depot terms of business or client engagement letter. The
>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>> content of this attachment and for any damages or losses arising from any
>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>> items of a destructive nature, which may be contained in this attachment
>>>> and shall not be liable for direct, indirect, consequential or special
>>>> damages in connection with this e-mail message or its attachment.
>>>>
>>>
>>>
>>
>

Re: Nodes go down periodically

Posted by daemeon reiydelle <da...@gmail.com>.

If you can, do a few (short, maybe 10m records, delete the default schema
between executions) run of Cassandra Stress test against your production
cluster (replication=3, force quorum to 3). Look for latency max in the 10s
of SECONDS. If your devops team is running a monitoring tool that looks at
the network, look for timeout/retries/errors/lost packets, etc. during the
run (worst case you need to do netstats runs against the relevant nic e.g.
every 10 seconds on the CassStress node, look for jumps in this count (if
monitoring is enabled, look at the monitor's results for ALL of your nodes.
At least one is having some issues.


*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> The reality of modern distributed systems is that connectivity between
> nodes is never guaranteed and distributed software must be able to cope
> with occasional absence of connectivity. GC and network connectivity are
> the two issues that a lot of us are most familiar with. There may be others
> - but most technical problems on a node would be clearly logged on that
> node. If you see a lapse of connectivity no more than once or twice a day,
> consider yourselves lucky.
>
> Is it only one node at a time that goes down, and at widely dispersed
> times?
>
> How many nodes?
>
> -- Jack Krupansky
>
> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <
> samuelsson.joel@gmail.com> wrote:
>
>> Hi,
>>
>> Version is 2.0.17.
>> Yes, these are VMs in the cloud though I'm fairly certain they are on a
>> LAN rather than WAN. They are both in the same data centre physically. The
>> phi_convict_threshold is set to default. I'd rather find the root cause of
>> the problem than just hiding it by not convicting a node if it isn't
>> responding though. If pings are <2 ms without a single ping missed in
>> several days, I highly doubt that network is the reason for the downtime.
>>
>> Best regards,
>> Joel
>>
>> 2016-02-23 16:39 GMT+01:00 <SE...@homedepot.com>:
>>
>>> You didn’t mention version, but I saw this kind of thing very often in
>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>>> increase it to reduce the UP/DOWN flapping behavior.
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Joel Samuelsson [mailto:samuelsson.joel@gmail.com]
>>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Nodes go down periodically
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> Thanks for your reply.
>>>
>>>
>>>
>>> I have debug logging on and see no GC pauses that are that long. GC
>>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>>
>>> Do I need to enable GC log options to see the pauses?
>>>
>>> I see plenty of these lines:
>>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>>> 118) GC for ParNew: 24 ms for 1 collections
>>>
>>> as well as a few CMS GC log lines.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Joel
>>>
>>>
>>>
>>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hk...@gmail.com>:
>>>
>>> Hi,
>>>
>>>
>>>
>>> Those are probably GC pauses. Memory tuning is probably needed. Check
>>> the parameters that you already have customised if they make sense.
>>>
>>>
>>>
>>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>>
>>>
>>>
>>> Hannu
>>>
>>>
>>>
>>>
>>>
>>> On 23 Feb 2016, at 16:08, Joel Samuelsson <sa...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>>> from <1 second to 30 or so seconds.
>>>
>>>
>>>
>>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>>> InetAddress /109.74.13.67 is now DOWN
>>>
>>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>>> (line 978) InetAddress /109.74.13.67 is now UP
>>>
>>>
>>>
>>> I find nothing odd in the logs around the same time. I logged a ping
>>> with timestamp and checked during the same time and saw nothing weird (ping
>>> is less than 2ms at all times).
>>>
>>>
>>>
>>> Does anyone have any suggestions as to why this might happen?
>>>
>>>
>>>
>>> Best regards,
>>> Joel
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of business or client engagement letter. The
>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>> content of this attachment and for any damages or losses arising from any
>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>> items of a destructive nature, which may be contained in this attachment
>>> and shall not be liable for direct, indirect, consequential or special
>>> damages in connection with this e-mail message or its attachment.
>>>
>>
>>
>

Re: Nodes go down periodically

Posted by Jack Krupansky <ja...@gmail.com>.

The reality of modern distributed systems is that connectivity between
nodes is never guaranteed and distributed software must be able to cope
with occasional absence of connectivity. GC and network connectivity are
the two issues that a lot of us are most familiar with. There may be others
- but most technical problems on a node would be clearly logged on that
node. If you see a lapse of connectivity no more than once or twice a day,
consider yourselves lucky.

Is it only one node at a time that goes down, and at widely dispersed times?

How many nodes?

-- Jack Krupansky

On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <samuelsson.joel@gmail.com
> wrote:

> Hi,
>
> Version is 2.0.17.
> Yes, these are VMs in the cloud though I'm fairly certain they are on a
> LAN rather than WAN. They are both in the same data centre physically. The
> phi_convict_threshold is set to default. I'd rather find the root cause of
> the problem than just hiding it by not convicting a node if it isn't
> responding though. If pings are <2 ms without a single ping missed in
> several days, I highly doubt that network is the reason for the downtime.
>
> Best regards,
> Joel
>
> 2016-02-23 16:39 GMT+01:00 <SE...@homedepot.com>:
>
>> You didn’t mention version, but I saw this kind of thing very often in
>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>> increase it to reduce the UP/DOWN flapping behavior.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Joel Samuelsson [mailto:samuelsson.joel@gmail.com]
>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Nodes go down periodically
>>
>>
>>
>> Hi,
>>
>>
>>
>> Thanks for your reply.
>>
>>
>>
>> I have debug logging on and see no GC pauses that are that long. GC
>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>
>> Do I need to enable GC log options to see the pauses?
>>
>> I see plenty of these lines:
>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>> 118) GC for ParNew: 24 ms for 1 collections
>>
>> as well as a few CMS GC log lines.
>>
>>
>>
>> Best regards,
>>
>> Joel
>>
>>
>>
>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hk...@gmail.com>:
>>
>> Hi,
>>
>>
>>
>> Those are probably GC pauses. Memory tuning is probably needed. Check the
>> parameters that you already have customised if they make sense.
>>
>>
>>
>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>
>>
>>
>> Hannu
>>
>>
>>
>>
>>
>> On 23 Feb 2016, at 16:08, Joel Samuelsson <sa...@gmail.com>
>> wrote:
>>
>>
>>
>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>> from <1 second to 30 or so seconds.
>>
>>
>>
>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>> InetAddress /109.74.13.67 is now DOWN
>>
>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>> (line 978) InetAddress /109.74.13.67 is now UP
>>
>>
>>
>> I find nothing odd in the logs around the same time. I logged a ping with
>> timestamp and checked during the same time and saw nothing weird (ping is
>> less than 2ms at all times).
>>
>>
>>
>> Does anyone have any suggestions as to why this might happen?
>>
>>
>>
>> Best regards,
>> Joel
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
>

Re: Nodes go down periodically

Posted by Joel Samuelsson <sa...@gmail.com>.

Hi,

Version is 2.0.17.
Yes, these are VMs in the cloud though I'm fairly certain they are on a LAN
rather than WAN. They are both in the same data centre physically. The
phi_convict_threshold is set to default. I'd rather find the root cause of
the problem than just hiding it by not convicting a node if it isn't
responding though. If pings are <2 ms without a single ping missed in
several days, I highly doubt that network is the reason for the downtime.

Best regards,
Joel

2016-02-23 16:39 GMT+01:00 <SE...@homedepot.com>:

> You didn’t mention version, but I saw this kind of thing very often in the
> 1.1 line. Often this is connected to network flakiness. Are these VMs? In
> the cloud? Connected over a WAN? You mention that ping seems fine. Take a
> look at the phi_convict_threshold in c assandra.yaml. You may need to
> increase it to reduce the UP/DOWN flapping behavior.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Joel Samuelsson [mailto:samuelsson.joel@gmail.com]
> *Sent:* Tuesday, February 23, 2016 9:41 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Nodes go down periodically
>
>
>
> Hi,
>
>
>
> Thanks for your reply.
>
>
>
> I have debug logging on and see no GC pauses that are that long. GC pauses
> are all well below 1s and 99 times out of 100 below 100ms.
>
> Do I need to enable GC log options to see the pauses?
>
> I see plenty of these lines:
> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
> 118) GC for ParNew: 24 ms for 1 collections
>
> as well as a few CMS GC log lines.
>
>
>
> Best regards,
>
> Joel
>
>
>
> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hk...@gmail.com>:
>
> Hi,
>
>
>
> Those are probably GC pauses. Memory tuning is probably needed. Check the
> parameters that you already have customised if they make sense.
>
>
>
> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>
>
>
> Hannu
>
>
>
>
>
> On 23 Feb 2016, at 16:08, Joel Samuelsson <sa...@gmail.com>
> wrote:
>
>
>
> Our nodes go down periodically, around 1-2 times each day. Downtime is
> from <1 second to 30 or so seconds.
>
>
>
> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
> InetAddress /109.74.13.67 is now DOWN
>
>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
> (line 978) InetAddress /109.74.13.67 is now UP
>
>
>
> I find nothing odd in the logs around the same time. I logged a ping with
> timestamp and checked during the same time and saw nothing weird (ping is
> less than 2ms at all times).
>
>
>
> Does anyone have any suggestions as to why this might happen?
>
>
>
> Best regards,
> Joel
>
>
>
>
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

RE: Nodes go down periodically

Posted by SE...@homedepot.com.

You didn’t mention version, but I saw this kind of thing very often in the 1.1 line. Often this is connected to network flakiness. Are these VMs? In the cloud? Connected over a WAN? You mention that ping seems fine. Take a look at the phi_convict_threshold in c assandra.yaml. You may need to increase it to reduce the UP/DOWN flapping behavior.

Sean Durity

From: Joel Samuelsson [mailto:samuelsson.joel@gmail.com]
Sent: Tuesday, February 23, 2016 9:41 AM
To: user@cassandra.apache.org
Subject: Re: Nodes go down periodically

Hi,

Thanks for your reply.

I have debug logging on and see no GC pauses that are that long. GC pauses are all well below 1s and 99 times out of 100 below 100ms.
Do I need to enable GC log options to see the pauses?
I see plenty of these lines:
DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line 118) GC for ParNew: 24 ms for 1 collections
as well as a few CMS GC log lines.

Best regards,
Joel

2016-02-23 15:14 GMT+01:00 Hannu Kröger <hk...@gmail.com>>:
Hi,

Those are probably GC pauses. Memory tuning is probably needed. Check the parameters that you already have customised if they make sense.

http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html

Hannu

On 23 Feb 2016, at 16:08, Joel Samuelsson <sa...@gmail.com>> wrote:

Our nodes go down periodically, around 1-2 times each day. Downtime is from <1 second to 30 or so seconds.

INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) InetAddress /109.74.13.67<http://109.74.13.67/> is now DOWN
 INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java (line 978) InetAddress /109.74.13.67<http://109.74.13.67/> is now UP

I find nothing odd in the logs around the same time. I logged a ping with timestamp and checked during the same time and saw nothing weird (ping is less than 2ms at all times).

Does anyone have any suggestions as to why this might happen?

Best regards,
Joel

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: Nodes go down periodically

Posted by Joel Samuelsson <sa...@gmail.com>.

Hi,

Thanks for your reply.

I have debug logging on and see no GC pauses that are that long. GC pauses
are all well below 1s and 99 times out of 100 below 100ms.
Do I need to enable GC log options to see the pauses?
I see plenty of these lines:
DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
118) GC for ParNew: 24 ms for 1 collections
as well as a few CMS GC log lines.

Best regards,
Joel

2016-02-23 15:14 GMT+01:00 Hannu Kröger <hk...@gmail.com>:

> Hi,
>
> Those are probably GC pauses. Memory tuning is probably needed. Check the
> parameters that you already have customised if they make sense.
>
> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>
> Hannu
>
>
> On 23 Feb 2016, at 16:08, Joel Samuelsson <sa...@gmail.com>
> wrote:
>
> Our nodes go down periodically, around 1-2 times each day. Downtime is
> from <1 second to 30 or so seconds.
>
> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
> InetAddress /109.74.13.67 is now DOWN
>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
> (line 978) InetAddress /109.74.13.67 is now UP
>
> I find nothing odd in the logs around the same time. I logged a ping with
> timestamp and checked during the same time and saw nothing weird (ping is
> less than 2ms at all times).
>
> Does anyone have any suggestions as to why this might happen?
>
> Best regards,
> Joel
>
>
>

Re: Nodes go down periodically

Posted by Hannu Kröger <hk...@gmail.com>.

Hi,

Those are probably GC pauses. Memory tuning is probably needed. Check the parameters that you already have customised if they make sense.

http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html <http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html>

Hannu


> On 23 Feb 2016, at 16:08, Joel Samuelsson <sa...@gmail.com> wrote:
> 
> Our nodes go down periodically, around 1-2 times each day. Downtime is from <1 second to 30 or so seconds.
> 
> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) InetAddress /109.74.13.67 <http://109.74.13.67/> is now DOWN
>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java (line 978) InetAddress /109.74.13.67 <http://109.74.13.67/> is now UP
> 
> I find nothing odd in the logs around the same time. I logged a ping with timestamp and checked during the same time and saw nothing weird (ping is less than 2ms at all times).
> 
> Does anyone have any suggestions as to why this might happen?
> 
> Best regards,
> Joel