You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Robert Stupp (JIRA)" <ji...@apache.org> on 2015/08/02 23:16:05 UTC
[jira] [Commented] (CASSANDRA-9491) Inefficient sequential repairs against vnode clusters

    [ https://issues.apache.org/jira/browse/CASSANDRA-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651187#comment-14651187 ] 

Robert Stupp commented on CASSANDRA-9491:
-----------------------------------------

TL;DR after some digging, I tend to close this as 'won't fix' in favor of CASSANDRA-5220

Just "snapshot every N minutes" looked nice in the sketch in the description of this ticket. But it doesn't work. Snapshots are created per RepairSession (i.e. per range). To solve that, snapshots would need to be created per column-family (requires a protocol change). I think that solving this in 2.x would be too invasive - so tending to resolve this as 'won't fix'.

Current problems with seq repair and vnodes (assuming 256 vnodes in the examples):
* huge amount of very tiny sstables (lots of flushes, see below)
* more compaction runs than necessary due to the amount of tiny sstables
* heavy write I/O
** one flushe + snapshot per range (vnode) per column-family
** # of flushes/snapshots = # of ranges (vnodes) * # of column-families (So 2560 flushes for 10 column-families per repair. A repair on 10 nodes with 10 tables would cause 25600 flushes)
* heavy read I/O
** merkle-tree is calculated per-table and per-range. (So 2560 merkle-tree calculations for 10 column-families per repair)
** merkle-tree is calculated by scanning the snapshotted sstables - big sstables probably contain all ranges and would be scanned for each range (so 256 times)

The current implementation works as follows:
# One repair (as issued via _nodetool repair_) starts a RepairRunnable (parent session)
# For each individual range to repair, a RepairSession (session) is started asynchronously
## Each RepairSession iterates over the column families to repair and starts an asynchronous RepairJob
### Each RepairJob creates a snapshot with the ID equal to the ID of the RepairSession. A snapshot involves a flush.
### When snapshot tasks have completed, validation starts
### Validation scans all the snapshotted sstables to build the merkle tree
### At last, sync (actual repair) runs.

One possible approach could be:
# One repair (as issued via _nodetool repair_) starts a RepairRunnable (parent session)
# A (new) RepairSnapshotController is used to manage snapshots of a parent-session.
# RepairRunnable issues asynchronous snapshot requests (with _optional_ flush) on all ranges to repair via RepairSnapshotController.
# For each range to repair, a RepairSession is started asynchronously
## For each column families to repair, a RepairJob is started asynchronously
### The RepairJob asks RepairSnapshotController whether the snapshot is complete and up-to-date (when the user demands that a snapshot expires)
### Validation scans all the snapshotted sstables to build the merkle tree
### At last, sync (actual repair) runs.

But the former approach still requires to scan the sstables for each range to repair. Big sstables are likely to be scanned by every RepairJob, which leads to the following proposal:

# One repair (as issued via _nodetool repair_) starts a RepairRunnable (parent session)
# For each _column family_ to repair, an asynchronous RepairSession is started.
## RepairSession issues a (new) validation-request against the involved endpoints.
### Such a new validation-request performs a snapshot with an optional flush (enabled by default).
### The validation-request-handler on the endpoints scans the whole snapshot but builds one merkle-tree per range to repair.
## RepairSession now iterates over the ranges to repair and starts an asynchronous RepairJob per range to repair
### RepairJob synchronization runs.


> Inefficient sequential repairs against vnode clusters
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9491
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9491
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Robert Stupp
>            Priority: Minor
>             Fix For: 2.2.x
>
>
> I've got a cluster with vnodes enabled. People regularly run sequential repairs against that cluster.
> During such a sequential repair (just {{nodetool -pr}}, statistics show:
> * huge increase of live-sstable-count (approx doubling the amount),
> * huge amount of memtable-switches (approx 1200 per node per minute),
> * huge number of flushed (approx 25 per node per minute)
> * memtable-data-size drops to (nearly) 0
> * huge amount of compaction-completed-tasks (60k per minute) and compacted-bytes (25GB per minute)
> These numbers do not match the real, tiny workload that the cluster really has.
> The reason for these (IMO crazy) numbers is the way how sequential repairs work on vnode clusters:
> Starting at {{StorageService.forceRepairAsync}} (from {{nodetool -pr}}, a repair on the ranges from {{getLocalPrimaryRanges(keyspace)}} is initiated. I'll express the schema in pseudo-code:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
> 	foreach columnFamily
> 	{
> 		start async RepairJob
> 		{
> 			if sequentialRepair:
> 				start SnapshotTask against each endpoint (including self)
> 				send tree requests if snapshot successful
> 			else // if parallel repair
> 				send tree requests
> 		}
> 	}
> }
> {code}
> This means, that for each sequential repair, a snapshot (including all its implications like flushes, tiny sstables, followup-compactions) is taken for every range. That means 256 snapshots per column-family per repair on each (involved) endpoint. For about 20 tables, this could mean 5120 snapshots within a very short period of time. You do not realize that amount on the file system, since the _tag_ for the snapshot is always the same - so all snapshots end in the same directory.
> IMO it would be sufficient to snapshot only once per column-family. Or do I miss something?
> So basically changing the pseudo-code to:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach columnFamily
> {
> 	if sequentialRepair:
> 		start SnapshotTask against each endpoint (including self)
> }
> foreach range in ranges:
> {
> 	start async RepairJob
> 	{
> 		send tree requests (if snapshot successful)
> 	}
> }
> {code}
> NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, trunk)
> EDIT: corrected target pseudo-code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)