You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "churro morales (JIRA)" <ji...@apache.org> on 2016/06/28 20:04:57 UTC

[jira] [Commented] (HBASE-16138) Cannot open regions after non-graceful shutdown due to deadlock with Replication Table

    [ https://issues.apache.org/jira/browse/HBASE-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353625#comment-15353625 ] 

churro morales commented on HBASE-16138:
----------------------------------------

The cluster shutdown situation currently has problems as well.  When you shutdown currently you need to make sure you that replication table is the last to go down.  I think both the cluster startup and the cluster shutdown situations need more discussion and design.  Maybe now that we have more system tables in hbase land, we should have a larger discussion about design and what these tables mean to the reliability of hbase.  

side note, we already have a SystemTableWALEntryFilter that removes these entries from replication.  Maybe #2 might not be so bad after all.  But with more and more system tables popping up maybe this particular problem casts a wider net than just this feature.  



> Cannot open regions after non-graceful shutdown due to deadlock with Replication Table
> --------------------------------------------------------------------------------------
>
>                 Key: HBASE-16138
>                 URL: https://issues.apache.org/jira/browse/HBASE-16138
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Replication
>            Reporter: Joseph
>            Assignee: Joseph
>            Priority: Critical
>
> If we shutdown an entire HBase cluster and attempt to start it back up, we have to run the WAL pre-log roll that occurs before opening up a region. Yet this pre-log roll must record the new WAL inside of ReplicationQueues. This method call ends up blocking on TableBasedReplicationQueues.getOrBlockOnReplicationTable(), because the Replication Table is not up yet. And we cannot assign the Replication Table because we cannot open any regions. This ends up deadlocking the entire cluster whenever we lose Replication Table availability. 
> There are a few options that we can do, but none of them seem very good:
> 1. Depend on Zookeeper-based Replication until the Replication Table becomes available
> 2. Have a separate WAL for System Tables that does not perform any replication
> 3. Record the WAL log in the ReplicationQueue asynchronously (don't block opening a region on this event), which could lead to inconsistent Replication state
> Does anyone have any suggestions/ideas/feedback?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)