You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Sergey Olefir <so...@gmail.com> on 2013/01/22 14:03:18 UTC

node down = log explosion?

I have Cassandra 1.1.7 cluster with 4 nodes in 2 datacenters (2+2).
Replication is configured as DC1:2,DC2:2 (i.e. every node holds the entire
data).

I am load-testing counter increments at the rate of about 10k per second.
All writes are directed to two nodes in DC1 (DC2 nodes are basically
backup). In total there's 100 separate clients executing 1-2 batch updates
per second.

We wanted to test what happens if one node goes down, so we brought one node
down in DC1 (i.e. the node that was handling half of the incoming writes).

This led to a complete explosion of logs on the remaining alive node in DC1.

There are hundreds of megabytes of logs within an hour all basically saying
the same thing:
ERROR [ReplicateOnWriteStage:5653390] 2013-01-22 12:44:33,611
AbstractCassandraDaemon.java (line 135) Exception in thread
Thread[ReplicateOnWriteStage:5653390,5,main]
java.lang.RuntimeException: java.util.concurrent.TimeoutException
at
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1275)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.TimeoutException
at
org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:311)
at
org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.java:585)
at
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1271)
... 3 more

The logs are completely swamped with this and are thus unusable. Of course
logs should report errors, but we don't need hundred of megabytes of this :)
Is there anything that can be done to reduce the amount of this spam? In
addition to making logs unusable I strongly suspect this spam makes server
unable to accept as many increments as it otherwise could.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: node down = log explosion?

Posted by aaron morton <aa...@thelastpickle.com>.

> and would it really hurt anything to add something like "can't
> handle load" to the exception message?
Feel free to add a ticket with your experience. 
The event you triggered is a safety valve to stop the server failing. 

> - My total replication factor is 4 over two DCs -- I suppose you mean 3
> replicas in each DC?
Yes, 3 in each DC. 

Right now the cluster QUOURM is 3 and you cannot achieve that in one DC. So if you code uses QUOURM it will fail if there is a partition between the two DC's. 

Normally people use LOCAL_QUOURM or maybe EACH_QUOURM for writes. The LOCAL_QUOURM for a DC with RF 2 is 2. So if your code uses either of those you do not have any redundancy. 

If you have RF 3 in each of the 3 DC's, using LOCAL_QUORUM or EACH_QUOURM means you can handle one down node in each DC and each DC can operate independently if needed.

> - Does that mean I'll have to run at least 4 nodes in each DC? (3 for RF:3
> and one additional in case one fails)
3 nodes and RF 3 is ok. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/01/2013, at 8:35 PM, Sergey Olefir <so...@gmail.com> wrote:

> Thanks!
> 
> Node writing to log because it cannot handle load is much different than
> node writing to log "just because". Although the amount of logging is still
> excessive and would it really hurt anything to add something like "can't
> handle load" to the exception message?
> 
> On the subject of RF:3 -- could you please elaborate?
> - Why RF:3 is important? (vs e.g. 2)
> - My total replication factor is 4 over two DCs -- I suppose you mean 3
> replicas in each DC?
> - Does that mean I'll have to run at least 4 nodes in each DC? (3 for RF:3
> and one additional in case one fails)
> 
> (and again -- thanks Aaron! You've been helping me A LOT on this list.)
> Best regards,
> Sergey
> 
> 
> aaron morton wrote
>>> Replication is configured as DC1:2,DC2:2 (i.e. every node holds the
>>> entire
>>> data).
>> I really recommend using RF 3. 
>> 
>> 
>> The error is the coordinator node protecting it's self. 
>> 
>> Basically it cannot handle the volume of local writes + the writes for HH. 
>> The number of in flight hints is greater than…
>> 
>>    private static volatile int maxHintsInProgress = 1024 *
>> Runtime.getRuntime().availableProcessors();
>> 
>> You may be able to work around this by reducing the max_hint_window_in_ms
>> in the yaml file so that hints are recorded if say the node has been down
>> for more than 1 minute. 
>> 
>> Anyways I would say your test showed that the current cluster does not
>> have sufficient capacity to handle the write load with one node down and
>> HH enabled at the current level. You can either add more nodes, use nodes
>> with more cores, adjust the HH settings, or reduce the throughput. 
>> 
>> 
>>>> On the subject of bug report -- I probably will -- but I'll wait a bit
>>>> for
>> 
>> perhaps the excessive logging could be handled better, please add a ticket
>> when you have time. 
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 23/01/2013, at 2:12 PM, Rob Coli &lt;
> 
>> rcoli@
> 
>> &gt; wrote:
>> 
>>> On Tue, Jan 22, 2013 at 2:57 PM, Sergey Olefir &lt;
> 
>> solf.lists@
> 
>> &gt; wrote:
>>>> Do you have a suggestion as to what could be a better fit for counters?
>>>> Something that can also replicate across DCs and survive link breakdown
>>>> between nodes (across DCs)? (and no, I don't need 100.00% precision
>>>> (although it would be nice obviously), I just need to be "pretty close"
>>>> for
>>>> the values of "pretty")
>>> 
>>> In that case, Cassandra counters are probably fine.
>>> 
>>>> On the subject of bug report -- I probably will -- but I'll wait a bit
>>>> for
>>>> more info here, perhaps there's some configuration or something that I
>>>> just
>>>> don't know about.
>>> 
>>> Excepting on replicateOnWrite stage seems pretty unambiguous to me,
>>> and unexpected. YMMV?
>>> 
>>> =Rob
>>> 
>>> -- 
>>> =Robert Coli
>>> AIM&GTALK - 
> 
>> rcoli@
> 
>>> YAHOO - rcoli.palominob
>>> SKYPE - rcoli_palominodb
> 
> 
> 
> 
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932p7584960.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: node down = log explosion?

Posted by Sergey Olefir <so...@gmail.com>.

Thanks!

Node writing to log because it cannot handle load is much different than
node writing to log "just because". Although the amount of logging is still
excessive and would it really hurt anything to add something like "can't
handle load" to the exception message?

On the subject of RF:3 -- could you please elaborate?
- Why RF:3 is important? (vs e.g. 2)
- My total replication factor is 4 over two DCs -- I suppose you mean 3
replicas in each DC?
- Does that mean I'll have to run at least 4 nodes in each DC? (3 for RF:3
and one additional in case one fails)

(and again -- thanks Aaron! You've been helping me A LOT on this list.)
Best regards,
Sergey


aaron morton wrote
>> Replication is configured as DC1:2,DC2:2 (i.e. every node holds the
>> entire
>> data).
> I really recommend using RF 3. 
> 
> 
> The error is the coordinator node protecting it's self. 
> 
> Basically it cannot handle the volume of local writes + the writes for HH. 
> The number of in flight hints is greater than…
> 
>     private static volatile int maxHintsInProgress = 1024 *
> Runtime.getRuntime().availableProcessors();
> 
> You may be able to work around this by reducing the max_hint_window_in_ms
> in the yaml file so that hints are recorded if say the node has been down
> for more than 1 minute. 
> 
> Anyways I would say your test showed that the current cluster does not
> have sufficient capacity to handle the write load with one node down and
> HH enabled at the current level. You can either add more nodes, use nodes
> with more cores, adjust the HH settings, or reduce the throughput. 
> 
> 
>>> On the subject of bug report -- I probably will -- but I'll wait a bit
>>> for
> 
> perhaps the excessive logging could be handled better, please add a ticket
> when you have time. 
> 
> Cheers
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/01/2013, at 2:12 PM, Rob Coli &lt;

> rcoli@

> &gt; wrote:
> 
>> On Tue, Jan 22, 2013 at 2:57 PM, Sergey Olefir &lt;

> solf.lists@

> &gt; wrote:
>>> Do you have a suggestion as to what could be a better fit for counters?
>>> Something that can also replicate across DCs and survive link breakdown
>>> between nodes (across DCs)? (and no, I don't need 100.00% precision
>>> (although it would be nice obviously), I just need to be "pretty close"
>>> for
>>> the values of "pretty")
>> 
>> In that case, Cassandra counters are probably fine.
>> 
>>> On the subject of bug report -- I probably will -- but I'll wait a bit
>>> for
>>> more info here, perhaps there's some configuration or something that I
>>> just
>>> don't know about.
>> 
>> Excepting on replicateOnWrite stage seems pretty unambiguous to me,
>> and unexpected. YMMV?
>> 
>> =Rob
>> 
>> -- 
>> =Robert Coli
>> AIM&GTALK - 

> rcoli@

>> YAHOO - rcoli.palominob
>> SKYPE - rcoli_palominodb





--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932p7584960.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: node down = log explosion?

Posted by aaron morton <aa...@thelastpickle.com>.

> Replication is configured as DC1:2,DC2:2 (i.e. every node holds the entire
> data).
I really recommend using RF 3. 

The error is the coordinator node protecting it's self. 

Basically it cannot handle the volume of local writes + the writes for HH.  The number of in flight hints is greater than…

    private static volatile int maxHintsInProgress = 1024 * Runtime.getRuntime().availableProcessors();

You may be able to work around this by reducing the max_hint_window_in_ms in the yaml file so that hints are recorded if say the node has been down for more than 1 minute. 

Anyways I would say your test showed that the current cluster does not have sufficient capacity to handle the write load with one node down and HH enabled at the current level. You can either add more nodes, use nodes with more cores, adjust the HH settings, or reduce the throughput. 

>> On the subject of bug report -- I probably will -- but I'll wait a bit for

perhaps the excessive logging could be handled better, please add a ticket when you have time. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/01/2013, at 2:12 PM, Rob Coli <rc...@palominodb.com> wrote:

> On Tue, Jan 22, 2013 at 2:57 PM, Sergey Olefir <so...@gmail.com> wrote:
>> Do you have a suggestion as to what could be a better fit for counters?
>> Something that can also replicate across DCs and survive link breakdown
>> between nodes (across DCs)? (and no, I don't need 100.00% precision
>> (although it would be nice obviously), I just need to be "pretty close" for
>> the values of "pretty")
> 
> In that case, Cassandra counters are probably fine.
> 
>> On the subject of bug report -- I probably will -- but I'll wait a bit for
>> more info here, perhaps there's some configuration or something that I just
>> don't know about.
> 
> Excepting on replicateOnWrite stage seems pretty unambiguous to me,
> and unexpected. YMMV?
> 
> =Rob
> 
> -- 
> =Robert Coli
> AIM&GTALK - rcoli@palominodb.com
> YAHOO - rcoli.palominob
> SKYPE - rcoli_palominodb

Re: node down = log explosion?

Posted by Rob Coli <rc...@palominodb.com>.

On Tue, Jan 22, 2013 at 2:57 PM, Sergey Olefir <so...@gmail.com> wrote:
> Do you have a suggestion as to what could be a better fit for counters?
> Something that can also replicate across DCs and survive link breakdown
> between nodes (across DCs)? (and no, I don't need 100.00% precision
> (although it would be nice obviously), I just need to be "pretty close" for
> the values of "pretty")

In that case, Cassandra counters are probably fine.

> On the subject of bug report -- I probably will -- but I'll wait a bit for
> more info here, perhaps there's some configuration or something that I just
> don't know about.

Excepting on replicateOnWrite stage seems pretty unambiguous to me,
and unexpected. YMMV?

=Rob

-- 
=Robert Coli
AIM&GTALK - rcoli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

Re: node down = log explosion?

Posted by Sergey Olefir <so...@gmail.com>.

Do you have a suggestion as to what could be a better fit for counters?
Something that can also replicate across DCs and survive link breakdown
between nodes (across DCs)? (and no, I don't need 100.00% precision
(although it would be nice obviously), I just need to be "pretty close" for
the values of "pretty")

On the subject of bug report -- I probably will -- but I'll wait a bit for
more info here, perhaps there's some configuration or something that I just
don't know about. 


Rob Coli wrote
> On Tue, Jan 22, 2013 at 5:03 AM, Sergey Olefir &lt;

> solf.lists@

> &gt; wrote:
>> I am load-testing counter increments at the rate of about 10k per second.
> 
> Do you need highly performant counters that count accurately, without
> meaningful chance of over-count? If so, Cassandra's counters are
> probably not ideal.
> 
>> We wanted to test what happens if one node goes down, so we brought one
>> node
>> down in DC1 (i.e. the node that was handling half of the incoming
>> writes).
>> ...
>> This led to a complete explosion of logs on the remaining alive node in
>> DC1.
> 
> I agree, this level of exception logging during replicateOnWrite
> (which is called every time a counter is incremented) seems like a
> bug. I would file a bug at the Apache JIRA.
> 
> =Rob
> 
> -- 
> =Robert Coli
> AIM&GTALK - 

> rcoli@

> YAHOO - rcoli.palominob
> SKYPE - rcoli_palominodb





--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932p7584954.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: node down = log explosion?

Posted by Rob Coli <rc...@palominodb.com>.

On Tue, Jan 22, 2013 at 5:03 AM, Sergey Olefir <so...@gmail.com> wrote:
> I am load-testing counter increments at the rate of about 10k per second.

Do you need highly performant counters that count accurately, without
meaningful chance of over-count? If so, Cassandra's counters are
probably not ideal.

> We wanted to test what happens if one node goes down, so we brought one node
> down in DC1 (i.e. the node that was handling half of the incoming writes).
> ...
> This led to a complete explosion of logs on the remaining alive node in DC1.

I agree, this level of exception logging during replicateOnWrite
(which is called every time a counter is incremented) seems like a
bug. I would file a bug at the Apache JIRA.

=Rob

-- 
=Robert Coli
AIM&GTALK - rcoli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb