You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Robert Stupp (JIRA)" <ji...@apache.org> on 2016/12/06 12:19:59 UTC
[jira] [Updated] (CASSANDRA-9491) Inefficient sequential repairs against vnode clusters

     [ https://issues.apache.org/jira/browse/CASSANDRA-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Stupp updated CASSANDRA-9491:
------------------------------------
    Assignee:     (was: Robert Stupp)

> Inefficient sequential repairs against vnode clusters
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9491
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9491
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Priority: Minor
>
> I've got a cluster with vnodes enabled. People regularly run sequential repairs against that cluster.
> During such a sequential repair (just {{nodetool -pr}}, statistics show:
> * huge increase of live-sstable-count (approx doubling the amount),
> * huge amount of memtable-switches (approx 1200 per node per minute),
> * huge number of flushed (approx 25 per node per minute)
> * memtable-data-size drops to (nearly) 0
> * huge amount of compaction-completed-tasks (60k per minute) and compacted-bytes (25GB per minute)
> These numbers do not match the real, tiny workload that the cluster really has.
> The reason for these (IMO crazy) numbers is the way how sequential repairs work on vnode clusters:
> Starting at {{StorageService.forceRepairAsync}} (from {{nodetool -pr}}, a repair on the ranges from {{getLocalPrimaryRanges(keyspace)}} is initiated. I'll express the schema in pseudo-code:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
> 	foreach columnFamily
> 	{
> 		start async RepairJob
> 		{
> 			if sequentialRepair:
> 				start SnapshotTask against each endpoint (including self)
> 				send tree requests if snapshot successful
> 			else // if parallel repair
> 				send tree requests
> 		}
> 	}
> }
> {code}
> This means, that for each sequential repair, a snapshot (including all its implications like flushes, tiny sstables, followup-compactions) is taken for every range. That means 256 snapshots per column-family per repair on each (involved) endpoint. For about 20 tables, this could mean 5120 snapshots within a very short period of time. You do not realize that amount on the file system, since the _tag_ for the snapshot is always the same - so all snapshots end in the same directory.
> IMO it would be sufficient to snapshot only once per column-family. Or do I miss something?
> So basically changing the pseudo-code to:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach columnFamily
> {
> 	if sequentialRepair:
> 		start SnapshotTask against each endpoint (including self)
> }
> foreach range in ranges:
> {
> 	start async RepairJob
> 	{
> 		send tree requests (if snapshot successful)
> 	}
> }
> {code}
> NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, trunk)
> EDIT: corrected target pseudo-code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)