You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Vasileios Vlachos <va...@gmail.com> on 2016/03/24 12:45:12 UTC

StatusLogger output

Hello,

Environment:
- Cassandra 2.0.17, 8 nodes, 4 per DC
- Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)

Every node seems to be dropping messages (anywhere from 10 to 300) twice a
day. I don't know it this has always been the case, but has definitely been
going for the past month or so. Whenever that happens we get
StatusLogger.java output in the log, which is the state of the node at the
time it dropped messages. This output contains information
similar/identical to nodetool tpstats, but further from that, information
regarding system CF follows as can be seen here: http://ur1.ca/ooan6

How can we use this information to find out what the problem was? I am
specifically referring to the information regarding the system CF. I had a
look in the system tables but I cannot draw anything from that. The output
in the log seems to contain two values (comma separated). What are these
numbers?

I wasn't able to find anything on the web/DataStax docs. Any help would be
greatly appreciated!

Thanks,
Vasilis

Re: StatusLogger output

Posted by Vasileios Vlachos <va...@gmail.com>.

Anyone else any idea on how to interpret StatusLogger output? As Sean said,
this may not help in determining the problem, but it would definitely help
my general understanding.

Thanks,
Bill

On Thu, Mar 24, 2016 at 5:24 PM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

> Thanks for your help Sean,
>
> The reason StatusLogger messages appear in the logs is usually, as you
> said, a GC pause (ParNew or CMS, I have seen both), or dropped messages. In
> our case dropped messages are always (so far) due to internal timeouts, not
> due to cross node timeouts (like the sample output in the link I provided
> earlier). I have seen StatusLogger output during low traffic times and I
> cannot say that we seem to have more logs during high-traffic hours.
>
> We use Nagios for monitoring and have several checks for cassandra (we use
> the JMX console for each node). However, most graphs are averaged out. I
> can see some spikes at the times, however, these spikes only go around
> 20-30% of the load we get during high-traffic times. The only time we have
> seen nodes marked down in the logs is when there is some severe cross-DC
> VPN issue, which is not something that happens often and does not correlate
> with StatusLogger output either.
>
> Regarding GC, we only see up to 10 GC pauses per day in the logs (I ofc
> mean over 200ms which is the threshold for logging GC events by default).
> We are actually experimenting with GC these days on one of the nodes, but I
> cannot say this has made things worse/better.
>
> I was hoping that by understanding the StatusLogger output better I'd be
> able to investigate further. We monitor metrics like hints, pending
> tasks, reads/writes per CF, read/write latency/CF, compactions,
> connections/node. If there is anything from the JMX console that you would
> suggest I should be monitoring, please let me know. I was thinking
> compactions may be the reason for this (so, I/O could be the bottleneck),
> but looking at the graphs I can see that when a node compacts its CPU usage
> would only max at around 20-30% and would only add 2-5ms of read/write
> latency per CF (if any).
>
> Thanks,
> Vasilis
>
> On Thu, Mar 24, 2016 at 3:31 PM, <SE...@homedepot.com> wrote:
>
>> I am not sure the status logger output helps determine the problem.
>> However, the dropped mutations and the status logger output is what I see
>> when there is too high of a load on one or more Cassandra nodes. It could
>> be long GC pauses, something reading too much data (a large row or a
>> multi-partition query), or just too many requests for the number of nodes
>> you have. Are you using OpsCenter to monitor the rings? Do you have read or
>> write spikes at the time? Any GC messages in the log. Any nodes going down
>> at the time?
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Vasileios Vlachos [mailto:vasileiosvlachos@gmail.com]
>> *Sent:* Thursday, March 24, 2016 8:13 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: StatusLogger output
>>
>>
>>
>> Just to clarify, I can see line 29 which seems to explain the format
>> (first number ops, second is data), however I don't know they actually
>> mean.
>>
>>
>>
>> Thanks,
>>
>> Vasilis
>>
>>
>>
>> On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
>> vasileiosvlachos@gmail.com> wrote:
>>
>> Hello,
>>
>>
>>
>> Environment:
>>
>> - Cassandra 2.0.17, 8 nodes, 4 per DC
>>
>> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>>
>>
>>
>> Every node seems to be dropping messages (anywhere from 10 to 300) twice
>> a day. I don't know it this has always been the case, but has definitely
>> been going for the past month or so. Whenever that happens we get
>> StatusLogger.java output in the log, which is the state of the node at
>> the time it dropped messages. This output contains information
>> similar/identical to nodetool tpstats, but further from that,
>> information regarding system CF follows as can be seen here:
>> http://ur1.ca/ooan6
>>
>>
>>
>> How can we use this information to find out what the problem was? I am
>> specifically referring to the information regarding the system CF. I had a
>> look in the system tables but I cannot draw anything from that. The output
>> in the log seems to contain two values (comma separated). What are these
>> numbers?
>>
>>
>>
>> I wasn't able to find anything on the web/DataStax docs. Any help would
>> be greatly appreciated!
>>
>>
>>
>> Thanks,
>>
>> Vasilis
>>
>>
>>
>> ------------------------------
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
>

Re: StatusLogger output

Posted by Vasileios Vlachos <va...@gmail.com>.

Thanks for your help Sean,

The reason StatusLogger messages appear in the logs is usually, as you
said, a GC pause (ParNew or CMS, I have seen both), or dropped messages. In
our case dropped messages are always (so far) due to internal timeouts, not
due to cross node timeouts (like the sample output in the link I provided
earlier). I have seen StatusLogger output during low traffic times and I
cannot say that we seem to have more logs during high-traffic hours.

We use Nagios for monitoring and have several checks for cassandra (we use
the JMX console for each node). However, most graphs are averaged out. I
can see some spikes at the times, however, these spikes only go around
20-30% of the load we get during high-traffic times. The only time we have
seen nodes marked down in the logs is when there is some severe cross-DC
VPN issue, which is not something that happens often and does not correlate
with StatusLogger output either.

Regarding GC, we only see up to 10 GC pauses per day in the logs (I ofc
mean over 200ms which is the threshold for logging GC events by default).
We are actually experimenting with GC these days on one of the nodes, but I
cannot say this has made things worse/better.

I was hoping that by understanding the StatusLogger output better I'd be
able to investigate further. We monitor metrics like hints, pending tasks,
reads/writes per CF, read/write latency/CF, compactions, connections/node.
If there is anything from the JMX console that you would suggest I should
be monitoring, please let me know. I was thinking compactions may be the
reason for this (so, I/O could be the bottleneck), but looking at the
graphs I can see that when a node compacts its CPU usage would only max at
around 20-30% and would only add 2-5ms of read/write latency per CF (if
any).

Thanks,
Vasilis

On Thu, Mar 24, 2016 at 3:31 PM, <SE...@homedepot.com> wrote:

> I am not sure the status logger output helps determine the problem.
> However, the dropped mutations and the status logger output is what I see
> when there is too high of a load on one or more Cassandra nodes. It could
> be long GC pauses, something reading too much data (a large row or a
> multi-partition query), or just too many requests for the number of nodes
> you have. Are you using OpsCenter to monitor the rings? Do you have read or
> write spikes at the time? Any GC messages in the log. Any nodes going down
> at the time?
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Vasileios Vlachos [mailto:vasileiosvlachos@gmail.com]
> *Sent:* Thursday, March 24, 2016 8:13 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: StatusLogger output
>
>
>
> Just to clarify, I can see line 29 which seems to explain the format
> (first number ops, second is data), however I don't know they actually
> mean.
>
>
>
> Thanks,
>
> Vasilis
>
>
>
> On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
> vasileiosvlachos@gmail.com> wrote:
>
> Hello,
>
>
>
> Environment:
>
> - Cassandra 2.0.17, 8 nodes, 4 per DC
>
> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>
>
>
> Every node seems to be dropping messages (anywhere from 10 to 300) twice a
> day. I don't know it this has always been the case, but has definitely been
> going for the past month or so. Whenever that happens we get
> StatusLogger.java output in the log, which is the state of the node at
> the time it dropped messages. This output contains information
> similar/identical to nodetool tpstats, but further from that, information
> regarding system CF follows as can be seen here: http://ur1.ca/ooan6
>
>
>
> How can we use this information to find out what the problem was? I am
> specifically referring to the information regarding the system CF. I had a
> look in the system tables but I cannot draw anything from that. The output
> in the log seems to contain two values (comma separated). What are these
> numbers?
>
>
>
> I wasn't able to find anything on the web/DataStax docs. Any help would be
> greatly appreciated!
>
>
>
> Thanks,
>
> Vasilis
>
>
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

RE: StatusLogger output

Posted by SE...@homedepot.com.

I am not sure the status logger output helps determine the problem. However, the dropped mutations and the status logger output is what I see when there is too high of a load on one or more Cassandra nodes. It could be long GC pauses, something reading too much data (a large row or a multi-partition query), or just too many requests for the number of nodes you have. Are you using OpsCenter to monitor the rings? Do you have read or write spikes at the time? Any GC messages in the log. Any nodes going down at the time?

Sean Durity

From: Vasileios Vlachos [mailto:vasileiosvlachos@gmail.com]
Sent: Thursday, March 24, 2016 8:13 AM
To: user@cassandra.apache.org
Subject: Re: StatusLogger output

Just to clarify, I can see line 29 which seems to explain the format (first number ops, second is data), however I don't know they actually mean.

Thanks,
Vasilis

On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <va...@gmail.com>> wrote:
Hello,

Environment:
- Cassandra 2.0.17, 8 nodes, 4 per DC
- Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)

Every node seems to be dropping messages (anywhere from 10 to 300) twice a day. I don't know it this has always been the case, but has definitely been going for the past month or so. Whenever that happens we get StatusLogger.java output in the log, which is the state of the node at the time it dropped messages. This output contains information similar/identical to nodetool tpstats, but further from that, information regarding system CF follows as can be seen here: http://ur1.ca/ooan6

How can we use this information to find out what the problem was? I am specifically referring to the information regarding the system CF. I had a look in the system tables but I cannot draw anything from that. The output in the log seems to contain two values (comma separated). What are these numbers?

I wasn't able to find anything on the web/DataStax docs. Any help would be greatly appreciated!

Thanks,
Vasilis

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: StatusLogger output

Posted by Vasileios Vlachos <va...@gmail.com>.

Just to clarify, I can see line 29 which seems to explain the format (first
number ops, second is data), however I don't know they actually mean.

Thanks,
Vasilis

On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

> Hello,
>
> Environment:
> - Cassandra 2.0.17, 8 nodes, 4 per DC
> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>
> Every node seems to be dropping messages (anywhere from 10 to 300) twice a
> day. I don't know it this has always been the case, but has definitely been
> going for the past month or so. Whenever that happens we get
> StatusLogger.java output in the log, which is the state of the node at
> the time it dropped messages. This output contains information
> similar/identical to nodetool tpstats, but further from that, information
> regarding system CF follows as can be seen here: http://ur1.ca/ooan6
>
> How can we use this information to find out what the problem was? I am
> specifically referring to the information regarding the system CF. I had a
> look in the system tables but I cannot draw anything from that. The output
> in the log seems to contain two values (comma separated). What are these
> numbers?
>
> I wasn't able to find anything on the web/DataStax docs. Any help would be
> greatly appreciated!
>
> Thanks,
> Vasilis
>