You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "churro morales (JIRA)" <ji...@apache.org> on 2015/07/21 22:07:04 UTC

[jira] [Created] (HBASE-14129) If any regionserver gets shutdown uncleanly during full cluster restart, locality looks to be lost

churro morales created HBASE-14129:
--------------------------------------

             Summary: If any regionserver gets shutdown uncleanly during full cluster restart, locality looks to be lost
                 Key: HBASE-14129
                 URL: https://issues.apache.org/jira/browse/HBASE-14129
             Project: HBase
          Issue Type: Bug
            Reporter: churro morales


We were doing a cluster restart the other day.  Some regionservers did not shut down cleanly.  Upon restart our locality went from 99% to 5%.  Upon looking at the AssignmentManager.joinCluster() code it calls AssignmentManager.processDeadServersAndRegionsInTransition().
If the failover flag gets set for any reason it seems we don't call assignAllUserRegions().  Then it looks like the balancer does the work in assigning those regions, we don't use a locality aware balancer and we lost our region locality.

I don't have a solid grasp on the reasoning for these checks but there could be some potential workarounds here.

1. After shutting down your cluster, move your WALs aside (replay later).  
2. Clean up your zNodes 

That seems to work, but requires a lot of manual labor.  Another solution which I prefer would be to have a flag for ./start-hbase.sh --clean 

If we start master with that flag then we do a check in AssignmentManager.processDeadServersAndRegionsInTransition()  thus if this flag is set we call: assignAllUserRegions() regardless of the failover state.

I have a patch for the later solution, that is if I am understanding the logic correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)