You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bharath Vissapragada (Jira)" <ji...@apache.org> on 2021/05/06 00:24:00 UTC

[jira] [Updated] (HBASE-25741) Deadlock during peer cleanup with NoNodeException

     [ https://issues.apache.org/jira/browse/HBASE-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bharath Vissapragada updated HBASE-25741:
-----------------------------------------
    Summary: Deadlock during peer cleanup with NoNodeException  (was: Replication Source still having the replication metrics for peer ID which doesn't exist.)

> Deadlock during peer cleanup with NoNodeException
> -------------------------------------------------
>
>                 Key: HBASE-25741
>                 URL: https://issues.apache.org/jira/browse/HBASE-25741
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.7.0
>            Reporter: Sandeep Pal
>            Assignee: Sandeep Pal
>            Priority: Major
>              Labels: regression
>             Fix For: 1.7.0
>
>
> We have observed that replication source metrics for peer exists on some region servers even though peer has been removed.  This is because when we encounter the NoNodeException in ReplicationSource, it calls the `peerRemoved` workflow which should eventually terminate the source and removes the source from the source manager. Now, the problem is ReplicationSource thread terminates itself and thus the action to removePeer is not complete leaving the metrics there forever for source. This is the flow, replication source trying to clean wals [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L801] and on NoNodeException it calls the [peerRemoved|https://github.com/apache/hbase/blob/b231dd620f107b488b88599e16dc846eb856972c/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java#L244] and terminate the source (itself), leaving the terminated source there in sourcemanager and not clearing it's [metrics|https://github.com/apache/hbase/blob/b231dd620f107b488b88599e16dc846eb856972c/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java#L645].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)