You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ashu Pachauri (JIRA)" <ji...@apache.org> on 2017/06/08 06:22:18 UTC

[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

     [ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashu Pachauri updated HBASE-18192:
----------------------------------
    Attachment: HBASE-18192.branch-1.3.001.patch

> Replication drops recovered queues on region server shutdown
> ------------------------------------------------------------
>
>                 Key: HBASE-18192
>                 URL: https://issues.apache.org/jira/browse/HBASE-18192
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.6
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
>         Attachments: HBASE-18192.branch-1.3.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the recovered queue is completely dropped on a region server shutdown. This will happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and the only one finishes. This will cause the recovered queue to get deleted without a regionserver shutdown. This can happen on deployments without fix for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
>         // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
>         // use synchronize to make sure one last thread will clean the queue
>         synchronized (workerThreads) {
>           Threads.sleep(100);// wait a short while for other worker thread to fully exit
>           boolean allOtherTaskDone = true;
>           for (ReplicationSourceWorkerThread worker : workerThreads.values()) {
>             if (!worker.equals(this) && worker.isAlive()) {
>               allOtherTaskDone = false;
>               break;
>             }
>           }
>           if (allOtherTaskDone) {
>             manager.closeRecoveredQueue(this.source);
>             LOG.info("Finished recovering queue " + peerClusterZnode
>                 + " with the following stats: " + getStats());
>           }
>         }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is currently running or not and it's being used as a proxy for whether a worker has finished it's work. But, in fact, "Should a worker should exit?" and "Has completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)