You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Carl Yeksigian (JIRA)" <ji...@apache.org> on 2015/12/07 23:03:11 UTC
[jira] [Commented] (CASSANDRA-6696) Partition sstables by token range

    [ https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15045837#comment-15045837 ] 

Carl Yeksigian commented on CASSANDRA-6696:
-------------------------------------------

Overall, this feature looks really good. The biggest concern I have is that currently we have a limit of a partition size by the disk size; with this, we will be limiting the token range containing the largest partition, which is a change we should be calling out in NEWS.

- What purpose does perDiskFlushExecutor serve for us? In the past, the number of flush executors has determined how many concurrent executors we can run, while this seems like it would restrict the parallelism. Also, if we were using a different partitioner, we would only ever be running through a single flush executor.

- If we have an imbalanced distribution of data, "rebalance disks" isn't going to do anything for us, so it seems misnamed; we are just formulaically reassigning the sstables based on {{token range size / # sstables}}.

- In CompactionManager.rebalanceDisks, we will throw an assertion error instead of returning null if the partition doesn't define splitting behavior.

- Todo in CompactionStrategyManager.groupSSTablesForAntiCompaction; doesn't look implemented, but we probably want to make sure that we keep each disk's group of sstables separate.

- If we run a rebalance, it is possible that we'll move all of disk 1 to disk 2 before any of disk 2's sstables move somewhere else, causing out of space issues. Might be better to make actively mix up the order of disks from which we are pulling sstables.

- In CompactionAwareWriter.getWriteDirectory, we aren't checking to make sure that we have enough disk space to run the compaction.

- Should we provide a default implementation of getMaximumToken, since this will be introduced mid-3.0 cycle, and mark it for removal at 4.0?

- Added methods in SSTableTxnWriter don't seem used; they should be removed, or SSTableTxnWriter should inherit from SSTableMultiWriter.

- RangeAwareSSTableWriter should be called something else. For CASSANDRA-10540, we'll want to add a new one which splits the incoming stream by each token range, this one is splitting by disk assignments.

- RangeAwareSSTableWriter seems to be violating the transactional contract (there is also a TODO in commit). precommit and commit do nothing for the finished writers.

- We should make sure that we have a splitter defined for the partition. We should make sure that the splitter is defined, otherwise we'll be creating a leveling where we are sending a huge number of sstables back to level 0 for overlap.

nits:

- perDiskFlushExecutors should be initialized to an array of ExecutorService instead of JMXEnabledThreadPoolExecutor, formatting space
- write sorted contents log message should include the range, as we'll be writing the same memtable many times
- comment why we need maybeReload in CompactionStrategyManager.handleNotification
- comment how getCompactionStrategyIndex handles going from randomly partitioned sstables to sorting into the correct strategy - took me a little bit to realize that it should only fall into the binary search case if there is already data saved
- in CompactionAwareWriter, "panicLocation" needs a different name; maybe "defaultLocation" would be better since this is expected to happen on, for instance, system tables
- getMaximumToken needs a javadoc
- spurious line deleted in RandomPartitioner at L219


> Partition sstables by token range
> ---------------------------------
>
>                 Key: CASSANDRA-6696
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Marcus Eriksson
>              Labels: compaction, correctness, dense-storage, jbod-aware-compaction, performance
>             Fix For: 3.2
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B since it got compacted with the actual data. So there is no trace of this row column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)