You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2014/12/24 20:41:13 UTC

[jira] [Resolved] (HBASE-2486) Add simple "anti-entropy" for region assignment

     [ https://issues.apache.org/jira/browse/HBASE-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell resolved HBASE-2486.
-----------------------------------
      Resolution: Incomplete
        Assignee:     (was: Eugene Koontz)
    Release Note:   (was: Adds new property "hbase.master.sanitychecking" which determines how master should handle situations where the master believes a region is hosted by a certain regionserver, but that regionserver indicates by throwing a 'No Such Region' exception, that it does not serve that region:

lax - mark region as unassigned and continue
paranoid - shut down master)

Incomplete dead issue, superseded by one or two master rewrites by now 

> Add simple "anti-entropy" for region assignment
> -----------------------------------------------
>
>                 Key: HBASE-2486
>                 URL: https://issues.apache.org/jira/browse/HBASE-2486
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.5
>            Reporter: Todd Lipcon
>              Labels: moved_from_0_20_5
>         Attachments: hbase2486.diff, hbase2486.diff
>
>
> We've seen a number of bugs where a region server thinks it should not be serving a region, but the master and META think it should be. I'd like to propose a very simple way of fixing this issue:
> 1) whenever a regionserver throws a NotServingRegionException, it also marks that region id in an RS-wide Set
> 2) when a region sends a heartbeat, include a message for each of these regions, MSG_REPORT_NSRE or somesuch, and then clear the set
> 3) when the master receives MSG_REPORT_NSRE, it does the following checks:
> a) if the region is assigned elsewhere according to META, the NSRE was due to a stale client, ignore
> b) if the region is in transition, ignore
> c) otherwise, we have an inconsistency, and we should take some steps to resolve (eg mark the region unassigned, or exit the master if we are in "paranoid mode")
> Whatever we do, we need to make sure that this is loudly logged, and causes unit tests to fail, when it's detected. This should *not* happen, but when it does, it would be good to recover without addtable.rb, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)