You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Dominik Keil (JIRA)" <ji...@apache.org> on 2016/02/17 22:05:18 UTC

[jira] [Commented] (CASSANDRA-10389) Repair session exception Validation failed

    [ https://issues.apache.org/jira/browse/CASSANDRA-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151180#comment-15151180 ] 

Dominik Keil commented on CASSANDRA-10389:
------------------------------------------

I think we're seeing this issue as well. Running Cassandra 2.2.5. Haven't tried restarting all nodes but will do that now.

We're running incremental repairs (now default, eh?) and while testing this before we put that into production we already found that repairing a whole keyspace will create a massive amount of open filehandles / "anti-compacted" sstables even though the repair will still only work one CF at a time. This caused some problems so we're now running repairs one CF at a time and on only one node at a time.

We did not have this issue in our testing but seing it in production now, nevertheless. What's interesting is that the node, on which the repair runs, at some point suddenly thrashes its heap (i.e. full heap usage, 65%-85% GC!!!) while at the same time produces huge amounts of tiny, concurrent reads, leading to really bad read latency from disk and a lot of I/O wait.

The bad thing is: This (Cassandra) node becomes so unresponsive that it significantly impacts the performance of the whole cluster (a total of 9 machines, rf 5 / quorum for most reads/writes, rf 2 / one for less important bulk data). So neither the java driver nor the other nodes, when being coordinator, manage to just leave this node alone for a while. As soon as I disable gossip on this node, the rest of the cluster is fine again.

[~slebresne]: I applaud you for your very useful comment.

> Repair session exception Validation failed
> ------------------------------------------
>
>                 Key: CASSANDRA-10389
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10389
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Debian 8, Java 1.8.0_60, Cassandra 2.2.1 (datastax compilation)
>            Reporter: Jędrzej Sieracki
>             Fix For: 2.2.x
>
>
> I'm running a repair on a ring of nodes, that was recently extented from 3 to 13 nodes. The extension was done two days ago, the repair was attempted yesterday.
> {quote}
> [2015-09-22 11:55:55,266] Starting repair command #9, repairing keyspace perspectiv with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 517)
> [2015-09-22 11:55:58,043] Repair session 1f7c50c0-6110-11e5-b992-9f13fa8664c8 for range (-5927186132136652665,-5917344746039874798] failed with error [repair #1f7c50c0-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, (-5927186132136652665,-5917344746039874798]] Validation failed in cblade1.XXX/XXX (progress: 0%)
> {quote}
> BTW, I am ignoring the LEAK errors for now, that's outside of the scope of the main issue:
> {quote}
> ERROR [Reference-Reaper:1] 2015-09-22 11:58:27,843 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@4d25ad8f) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@896826067:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-73-big was not released before the reference was garbage collected
> {quote}
> I scrubbed the sstable with failed validation on cblade1 with nodetool scrub perspectiv stock_increment_agg:
> {quote}
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:31,615 OutputHandler.java:42 - Scrubbing BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db') (345466609 bytes)
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:31,615 OutputHandler.java:42 - Scrubbing BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db') (60496378 bytes)
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@4ca8951e) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@114161559:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-48-big was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@eeb6383) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1612685364:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@1de90543) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@2058626950:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-49-big was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@15616385) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1386628428:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-47-big was not released before the reference was garbage collected
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:35,098 OutputHandler.java:42 - Scrub of BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db') complete: 51397 rows in new sstable and 0 empty (tombstoned) rows dropped
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:47,605 OutputHandler.java:42 - Scrub of BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db') complete: 292600 rows in new sstable and 0 empty (tombstoned) rows dropped
> {quote}
> Now, after scrubbing, another repair was attempted, it did finish, but with lots of errors from other nodes:
> {quote}
> [2015-09-22 12:01:18,020] Repair session db476b51-6110-11e5-b992-9f13fa8664c8 for range (5019296454787813261,5021512586040808168] failed with error [repair #db476b51-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, (5019296454787813261,5021512586040808168]] Validation failed in /10.YYY (progress: 91%)
> [2015-09-22 12:01:18,079] Repair session db482ea1-6110-11e5-b992-9f13fa8664c8 for range (-3660233266780784242,-3638577078894365342] failed with error [repair #db482ea1-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, (-3660233266780784242,-3638577078894365342]] Validation failed in /10.XXX (progress: 92%)
> [2015-09-22 12:01:18,276] Repair session db4a0361-6110-11e5-b992-9f13fa8664c8 for range (9158857758535272856,9167427882441871745] failed with error [repair #db4a0361-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, (9158857758535272856,9167427882441871745]] Validation failed in /10.YYY (progress: 95%)
> {quote}
> After scrubbing stock_increment_agg on all nodes, just to be sure, the repair still failed, this time with the following exception:
> {quote}
> INFO  [Repair#16:50] 2015-09-22 12:08:47,471 RepairJob.java:181 - [repair #ea123bf3-6111-11e5-b992-9f13fa8664c8] Requesting merkle trees for stock_increment_agg (to [/10.60.77.202, cblade1.XXX/XXX])
> ERROR [RepairJobTask:1] 2015-09-22 12:08:47,471 RepairSession.java:290 - [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8] Session completed with the following error
> org.apache.cassandra.exceptions.RepairException: [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, (355657753119264326,366309649129068298]] Validation failed in cblade1.
>         at org.apache.cassandra.repair.ValidationTask.treeReceived(ValidationTask.java:64) ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183) ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:399) ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:158) ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)