You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Dave Brosius (JIRA)" <ji...@apache.org> on 2013/06/23 04:32:20 UTC

[jira] [Commented] (CASSANDRA-5218) Log explosion when another cluster node is down and remaining node is overloaded.

    [ https://issues.apache.org/jira/browse/CASSANDRA-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691326#comment-13691326 ] 

Dave Brosius commented on CASSANDRA-5218:
-----------------------------------------

if we switched to using logback, it supports duplicate message filtering built in with their DuplicateMessageFilter class. logback obviously has other benefits over log4j, and works seemlessly with slf4j.
                
> Log explosion when another cluster node is down and remaining node is overloaded.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5218
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5218
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.7
>            Reporter: Sergey Olefir
>
> I have Cassandra 1.1.7 cluster with 4 nodes in 2 datacenters (2+2). Replication is configured as DC1:2,DC2:2 (i.e. every node holds the entire data). 
> I am load-testing counter increments at the rate of about 10k per second. All writes are directed to two nodes in DC1 (DC2 nodes are basically backup). In total there's 100 separate clients executing 1-2 batch updates per second. 
> We wanted to test what happens if one node goes down, so we brought one node down in DC1 (i.e. the node that was handling half of the incoming writes). 
> This led to a complete explosion of logs on the remaining alive node in DC1. 
> There are hundreds of megabytes of logs within an hour all basically saying the same thing: 
> ERROR [ReplicateOnWriteStage:5653390] 2013-01-22 12:44:33,611 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[ReplicateOnWriteStage:5653390,5,main] 
> java.lang.RuntimeException: java.util.concurrent.TimeoutException 
>         at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1275) 
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) 
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) 
>         at java.lang.Thread.run(Thread.java:662) 
> Caused by: java.util.concurrent.TimeoutException 
>         at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:311) 
>         at org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.java:585) 
>         at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1271) 
>         ... 3 more 
> The logs are completely swamped with this and are thus unusable. It may also negatively impact the node performance.
> According to Aaron Morton:
> {quote}The error is the coordinator node protecting it's self. 
> Basically it cannot handle the volume of local writes + the writes for HH.  The number of in flight hints is greater than…
>     private static volatile int maxHintsInProgress = 1024 * Runtime.getRuntime().availableProcessors();{quote}
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932p7584957.html
> I think there are two issues here:
> (a) the same exception occurring for the same reason doesn't need to be spammed into log many times per second;
> (b) exception message ought to be more clear about cause -- i.e. in this case some message about "overload" or "load shedding" might be appropriate.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira