You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2018/03/21 22:09:14 UTC
[jira] [Updated] (HBASE-18192) Replication drops recovered queues
on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-18192:
--------------------------
Fix Version/s: (was: 3.0.0)
> Replication drops recovered queues on region server shutdown
> ------------------------------------------------------------
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 1.3.1, 1.2.6
> Reporter: Ashu Pachauri
> Assignee: Ashu Pachauri
> Priority: Blocker
> Fix For: 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2, 2.0.0
>
> Attachments: HBASE-18192.branch-1.001.patch, HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the recovered queue is completely dropped on a region server shutdown. This will happen in situation when
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and the only one finishes. This will cause the recovered queue to get deleted without a regionserver shutdown. This can happen on deployments without fix for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
> Threads.sleep(100);// wait a short while for other worker thread to fully exit
> boolean allOtherTaskDone = true;
> for (ReplicationSourceWorkerThread worker : workerThreads.values()) {
> if (!worker.equals(this) && worker.isAlive()) {
> allOtherTaskDone = false;
> break;
> }
> }
> if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
> }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is currently running or not and it's being used as a proxy for whether a worker has finished it's work. But, in fact, "Should a worker should exit?" and "Has completed it's work?" are two different questions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)