You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Yuki Morishita (JIRA)" <ji...@apache.org> on 2014/07/01 17:56:25 UTC

[jira] [Commented] (CASSANDRA-6455) Improve concurrency of repair process

    [ https://issues.apache.org/jira/browse/CASSANDRA-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048988#comment-14048988 ] 

Yuki Morishita commented on CASSANDRA-6455:
-------------------------------------------

Pushed latest version: https://github.com/yukim/cassandra/tree/6455-v3

bq. Seems the rebase lost CASSANDRA-3569 - we need to unregister from the FD once all validation messages have arrived.

Added.

bq. We should probably cap how big X we can have in -j X - really easy to OOM the nodes involved if you put a big X in.

Right. I thought about the right number to cap and came up with 4 because we don't want to push too much anyway. I also updated command option description to clarify.

bq. Should we make the taskExecutor in RepairSession static?

The reason I made taskExecutor local to RepairSession instance is to cancel all submitted tasks when session faild. I left this as is in the latest version.

bq. Why do we add ourselves as a no-op StreamEventHandler in LocalSyncTask/StreamingRepairTask when creating the StreamPlan?

handleStreamEvent as well as onSuccess/onFailure is part of StreamEventHandler. We don't need to handle only on success/failure but not other events like progress.

> Improve concurrency of repair process
> -------------------------------------
>
>                 Key: CASSANDRA-6455
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6455
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Yuki Morishita
>            Assignee: Yuki Morishita
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: 6455-3.0.txt, 6455.txt
>
>
> Currently, most of the repair tasks (taking snapshots, send/receiving merkle tree, compute MT difference, etc) are done on single threaded AntiEntropyStage.
> This causes a problem like CASSANDRA-6415 and likely to cause unnecessary wait.
> Also, repair is done one CF at the time. I think we can parallelize this(concurrency is configurable by a user based on # of CF and load of the nodes) for faster processing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)