You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Björn Hachmann <bj...@metrigo.de> on 2015/09/23 16:28:17 UTC

Huge amounts of hinted handoffs for counter table

Today I realized that one of the nodes in our Cassandra cluster (2.1.7) is
storing a lot of hints (>80GB) and I fail to see a convincing way to deal
with them.

>From the system.log:
INFO  [ScheduledTasks:1] 2015-09-23 14:27:06,692 StatusLogger.java:115 -
system.hints                      276,1010945
INFO  [ScheduledTasks:1] 2015-09-23 14:38:06,722 StatusLogger.java:115 -
system.hints                      968,2968163
INFO  [ScheduledTasks:1] 2015-09-23 14:38:41,742 StatusLogger.java:115 -
system.hints                     1317,3799471
INFO  [ScheduledTasks:1] 2015-09-23 14:49:16,775 StatusLogger.java:115 -
system.hints                     1519,4399905
INFO  [ScheduledTasks:1] 2015-09-23 14:49:36,793 StatusLogger.java:115 -
system.hints                     2247,6514649
INFO  [ScheduledTasks:1] 2015-09-23 14:49:41,811 StatusLogger.java:115 -
system.hints                     2247,6514649
INFO  [ScheduledTasks:1] 2015-09-23 14:49:51,830 StatusLogger.java:115 -
system.hints                     2368,6733293
INFO  [ScheduledTasks:1] 2015-09-23 15:00:41,885 StatusLogger.java:115 -
system.hints                    283,450166810
INFO  [ScheduledTasks:1] 2015-09-23 15:12:16,919 StatusLogger.java:115 -
system.hints                       232,970964
INFO  [ScheduledTasks:1] 2015-09-23 15:12:31,934 StatusLogger.java:115 -
system.hints                      581,2034388
INFO  [ScheduledTasks:1] 2015-09-23 15:23:46,973 StatusLogger.java:115 -
system.hints                       234,321566
INFO  [ScheduledTasks:1] 2015-09-23 15:24:01,988 StatusLogger.java:115 -
system.hints                       368,935634
INFO  [ScheduledTasks:1] 2015-09-23 15:35:12,039 StatusLogger.java:115 -
system.hints                       264,636164

The state of the cluster seems stable, at least we do not have any
downtimes (sometimes the load on one of the nodes is quite high).

We had a look into the table system.hints and from there we learnt that
most hints
are for one of the nodes in our 2nd datacenter and most of the mutations
are
increments to one of our counter tables which are very frequent.

There seem to be no other suspicious log messages in the log apart from a
few dropped events.

We have several questions:
- What could be the reason that only one of the nodes has hints for only
one target node, altough every other node should be coordinator for these
queries sometimes also?
- Is there a way to turn of hinted handoff on a table level or on data
center level?
- What could we do to investigate the cause of this issue deeper?

Thank you!
Kind regards
Björn Hachmann

Re: Huge amounts of hinted handoffs for counter table

Posted by Venkatesh Arivazhagan <ve...@gmail.com>.

What is your replication factor and write consistency? :)
On Sep 23, 2015 7:28 AM, "Björn Hachmann" <bj...@metrigo.de>
wrote:

> Today I realized that one of the nodes in our Cassandra cluster (2.1.7) is
> storing a lot of hints (>80GB) and I fail to see a convincing way to deal
> with them.
>
> From the system.log:
> INFO  [ScheduledTasks:1] 2015-09-23 14:27:06,692 StatusLogger.java:115 -
> system.hints                      276,1010945
> INFO  [ScheduledTasks:1] 2015-09-23 14:38:06,722 StatusLogger.java:115 -
> system.hints                      968,2968163
> INFO  [ScheduledTasks:1] 2015-09-23 14:38:41,742 StatusLogger.java:115 -
> system.hints                     1317,3799471
> INFO  [ScheduledTasks:1] 2015-09-23 14:49:16,775 StatusLogger.java:115 -
> system.hints                     1519,4399905
> INFO  [ScheduledTasks:1] 2015-09-23 14:49:36,793 StatusLogger.java:115 -
> system.hints                     2247,6514649
> INFO  [ScheduledTasks:1] 2015-09-23 14:49:41,811 StatusLogger.java:115 -
> system.hints                     2247,6514649
> INFO  [ScheduledTasks:1] 2015-09-23 14:49:51,830 StatusLogger.java:115 -
> system.hints                     2368,6733293
> INFO  [ScheduledTasks:1] 2015-09-23 15:00:41,885 StatusLogger.java:115 -
> system.hints                    283,450166810
> INFO  [ScheduledTasks:1] 2015-09-23 15:12:16,919 StatusLogger.java:115 -
> system.hints                       232,970964
> INFO  [ScheduledTasks:1] 2015-09-23 15:12:31,934 StatusLogger.java:115 -
> system.hints                      581,2034388
> INFO  [ScheduledTasks:1] 2015-09-23 15:23:46,973 StatusLogger.java:115 -
> system.hints                       234,321566
> INFO  [ScheduledTasks:1] 2015-09-23 15:24:01,988 StatusLogger.java:115 -
> system.hints                       368,935634
> INFO  [ScheduledTasks:1] 2015-09-23 15:35:12,039 StatusLogger.java:115 -
> system.hints                       264,636164
>
> The state of the cluster seems stable, at least we do not have any
> downtimes (sometimes the load on one of the nodes is quite high).
>
> We had a look into the table system.hints and from there we learnt that
> most hints
> are for one of the nodes in our 2nd datacenter and most of the mutations
> are
> increments to one of our counter tables which are very frequent.
>
> There seem to be no other suspicious log messages in the log apart from a
> few dropped events.
>
> We have several questions:
> - What could be the reason that only one of the nodes has hints for only
> one target node, altough every other node should be coordinator for these
> queries sometimes also?
> - Is there a way to turn of hinted handoff on a table level or on data
> center level?
> - What could we do to investigate the cause of this issue deeper?
>
> Thank you!
> Kind regards
> Björn Hachmann
>

Re: Huge amounts of hinted handoffs for counter table

Posted by Björn Hachmann <bj...@metrigo.de>.

Thank you for your time!

Our replication factor is 'DC1': '2', 'DC2': '2'.
Consistency is set to LOCAL_ONE for these queries.

Indeed timeouts might be a problem as some of the nodes in DC2 are under
high load from time to time.
Is there some counter (eg. JMX or so) I could monitor to verify this
assumption.


>
>> - What could we do to investigate the cause of this issue deeper?
>>
>
> Are the hints being successfully delivered? It sounds like not..
>

No, I do not think so. Actually we are not really interested in this data
at DC2, we only replicate them because this table is in that keyspace for
historic reasons.
Seems like we need to migrate that table to a different keyspace. doesn't
it?

Kind regards
Björn



2015-09-23 22:56 GMT+02:00 Robert Coli <rc...@eventbrite.com>:

> On Wed, Sep 23, 2015 at 7:28 AM, Björn Hachmann <
> bjoern.hachmann@metrigo.de> wrote:
>
>> Today I realized that one of the nodes in our Cassandra cluster (2.1.7)
>> is storing a lot of hints (>80GB) and I fail to see a convincing way to
>> deal with them.
>> ...
>> We had a look into the table system.hints and from there we learnt that
>> most hints
>> are for one of the nodes in our 2nd datacenter and most of the mutations
>> are
>> increments to one of our counter tables which are very frequent.
>>
>
> This is probably timeouts on the increment creating your hints.
>
>
>> We have several questions:
>> - What could be the reason that only one of the nodes has hints for only
>> one target node, altough every other node should be coordinator for these
>> queries sometimes also?
>>
>
> That sounds unexpected, I don't have a good answer.
>
>
>> - Is there a way to turn of hinted handoff on a table level or on data
>> center level?
>>
>
> No.
> 
>
>> - What could we do to investigate the cause of this issue deeper?
>>
>
> Are the hints being successfully delivered? It sounds like not..
>
> =Rob
>
>

Re: Huge amounts of hinted handoffs for counter table

Posted by Robert Coli <rc...@eventbrite.com>.

On Wed, Sep 23, 2015 at 7:28 AM, Björn Hachmann <bj...@metrigo.de>
wrote:

> Today I realized that one of the nodes in our Cassandra cluster (2.1.7) is
> storing a lot of hints (>80GB) and I fail to see a convincing way to deal
> with them.
> ...
> We had a look into the table system.hints and from there we learnt that
> most hints
> are for one of the nodes in our 2nd datacenter and most of the mutations
> are
> increments to one of our counter tables which are very frequent.
>

This is probably timeouts on the increment creating your hints.


> We have several questions:
> - What could be the reason that only one of the nodes has hints for only
> one target node, altough every other node should be coordinator for these
> queries sometimes also?
>

That sounds unexpected, I don't have a good answer.


> - Is there a way to turn of hinted handoff on a table level or on data
> center level?
>

No.


> - What could we do to investigate the cause of this issue deeper?
>

Are the hints being successfully delivered? It sounds like not..

=Rob