You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2016/12/21 18:13:58 UTC

[jira] [Comment Edited] (HBASE-17341) Add a timeout during replication endpoint termination

    [ https://issues.apache.org/jira/browse/HBASE-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767723#comment-15767723 ] 

Andrew Purtell edited comment on HBASE-17341 at 12/21/16 6:13 PM:
------------------------------------------------------------------

Since Ted committed this I will pick to 0.98 now.

I missed it if there was an announcement that branch-1.3 is closed. I committed another of Vincent's replication fixes there yesterday. We should probably commit this one too now that the deed has been done. 


was (Author: apurtell):
Since Ted committed this I will pick to 0.98 now and resolve. 

I missed it if there was an announcement that branch-1.3 is closed. I committed another of Vincent's replication fixes there yesterday. We should probably commit this one too now that the deed has been done. 

> Add a timeout during replication endpoint termination
> -----------------------------------------------------
>
>                 Key: HBASE-17341
>                 URL: https://issues.apache.org/jira/browse/HBASE-17341
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4
>            Reporter: Vincent Poon
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0
>
>         Attachments: HBASE-17341.branch-1.1.v1.patch, HBASE-17341.branch-1.1.v2.patch, HBASE-17341.master.v1.patch, HBASE-17341.master.v2.patch
>
>
> In ReplicationSource#terminate(), a Future is obtained from ReplicationEndpoint#stop().  Future.get() is then called, but can potentially hang there if something went wrong in the endpoint stop().
> Hanging there has serious implications, because the thread could potentially be the ZK event thread (e.g. watcher calls ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() -> blocked).  This means no other events in the ZK event queue will get processed, which for HBase means other ZK watches such as replication watch notifications, snapshot watch notifications, even RegionServer shutdown will all get blocked.
> The short term fix addressed here is to simply add a timeout for Future.get().  But the severe consequences seen here perhaps suggest a broader refactoring of the ZKWatcher usage in HBase is in order, to protect against situations like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)