You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Mikhail Bautin (Created) (JIRA)" <ji...@apache.org> on 2012/02/07 03:50:59 UTC

[jira] [Created] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

[89-fb] Scan unassigned region directory on master failover
-----------------------------------------------------------

                 Key: HBASE-5344
                 URL: https://issues.apache.org/jira/browse/HBASE-5344
             Project: HBase
          Issue Type: Bug
            Reporter: Mikhail Bautin
            Assignee: Mikhail Bautin


In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Mikhail Bautin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mikhail Bautin resolved HBASE-5344.
-----------------------------------

    Resolution: Fixed

Ian: yes, I am closing this JIRA. The patch was committed to the 89-fb branch. There is no other JIRA to port this feature to trunk yet (feel free to create one if you are interested), and due to significant differences between master code in 0.89-fb and trunk we will not focus on porting this feature to the trunk at this time. I think would be more appropriate for someone more experienced with the trunk master code to do the port (or maybe it would be easier to reimplement the same idea on trunk).
                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: ASF.LICENSE.NOT.GRANTED--D1605.1.patch, ASF.LICENSE.NOT.GRANTED--D1605.2.patch, ASF.LICENSE.NOT.GRANTED--D1605.3.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209128#comment-13209128 ] 

Phabricator commented on HBASE-5344:
------------------------------------

Karthik has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  @stack -

  << So, the '/unassigned' dir will change to be named '/regions' or some such? On region open, we'll update its state in zk to be OPENED? And leave it there (and update .META. too). We don't really need .META. then? Smile. >>

  Not totally... the idea is to change /unassigned to /regions/<tablename> and it will not be deleted on the region being opened. This will only track assignment of regions though. We still need meta to figure out start/stop keys and other information like preferred regionservers for the regions. In order to truly eliminate meta, we should be willing to store more permanent data in ZK, but thats probably not needed unless we are going all the way to rip out root and meta special-casing, which would be a huge change.

  << "We would like to rely on ZK and (for now) on META instead to recover the region assignment on master startup/failure." The hard part in here is on failover, what if the .META. is on a crashed server? You'll need to process its logs and get .META. back online before you can proceed w/ failover. To get it online, you'll need to listen for events (though I suppose you could filter and only process .META. events). Or what if the the server carrrying .META. crashes during onlining. >>

  The idea here is that the master can read unassigned (while continuing to queue events). It will split all logs (via distributed log splitting) and process opening of the meta region out of band (not go though the standard heartbeat based assignment - we do this for root upon startup, we need something similar for meta).


REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Ian Varley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444107#comment-13444107 ] 

Ian Varley commented on HBASE-5344:
-----------------------------------

Hey Mikhail, I see that the original phabricator review was abandoned, and that there's a new one (D2085) that shows to be committed on 8/5 (presumably to the FB-89 branch). Should this JIRA be closed? Is there another JIRA to include porting the work in review D2085 to trunk? (I searched and couldn't find one, but maybe it's under another name.)
                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: ASF.LICENSE.NOT.GRANTED--D1605.1.patch, ASF.LICENSE.NOT.GRANTED--D1605.2.patch, ASF.LICENSE.NOT.GRANTED--D1605.3.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208196#comment-13208196 ] 

Phabricator commented on HBASE-5344:
------------------------------------

mbautin has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java:254 Yes, that is to figure out whether it is fresh cluster startup. We do that by scanning the regionservers directory in ZK and checking if it is empty. I will try to unify cluster startup and failover cases in further iterations.
  src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java:171 Makes a lot of sense. Thanks!

REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266204#comment-13266204 ] 

Phabricator commented on HBASE-5344:
------------------------------------

mbautin has abandoned the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  Abandoning this revision. The useful part of this will be included in the full master failover diff (https://reviews.facebook.net/D2085).

REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch, D1605.2.patch, D1605.3.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Phabricator updated HBASE-5344:
-------------------------------

    Attachment: D1605.3.patch

mbautin updated the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".
Reviewers: Kannan, Karthik, Liyin, JIRA, stack

  Attached the wrong diff to this revision. Restoring its old state rebased on recent changes.

REVISION DETAIL
  https://reviews.facebook.net/D1605

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/executor/RegionTransitionEventData.java
  src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java
  src/main/java/org/apache/hadoop/hbase/master/BaseScanner.java
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  src/main/java/org/apache/hadoop/hbase/master/ProcessRegionOpen.java
  src/main/java/org/apache/hadoop/hbase/master/RegionManager.java
  src/main/java/org/apache/hadoop/hbase/master/RootScanner.java
  src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
  src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java
  src/main/java/org/apache/hadoop/hbase/master/handler/MasterOpenRegionHandler.java
  src/test/java/org/apache/hadoop/hbase/master/TestRegionStateOnMasterFailure.java

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch, D1605.2.patch, D1605.3.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Phabricator updated HBASE-5344:
-------------------------------

    Attachment: D1605.2.patch

mbautin updated the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".
Reviewers: Kannan, Karthik, Liyin, JIRA, stack

  Deleting a bunch of files that should be deleted in this patch and fixing a compilation error in TestMiniClusterLoadSequential.

REVISION DETAIL
  https://reviews.facebook.net/D1605

AFFECTED FILES
  pom.xml
  src/main/java/org/apache/hadoop/hbase/EmptyWatcher.java
  src/test/java/org/apache/hadoop/hbase/EmptyWatcher.java
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  src/main/java/org/apache/hadoop/hbase/util/AbstractHBaseTool.java
  src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java
  src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
  src/test/java/org/apache/hadoop/hbase/manual/HBaseTest.java
  src/test/java/org/apache/hadoop/hbase/manual/RestartMetaTest.java
  src/test/java/org/apache/hadoop/hbase/manual/utils/HBaseUtils.java
  src/test/java/org/apache/hadoop/hbase/manual/utils/KillProcessesAndVerify.java
  src/test/java/org/apache/hadoop/hbase/manual/utils/MultiThreadedAction.java
  src/test/java/org/apache/hadoop/hbase/manual/utils/MultiThreadedReader.java
  src/test/java/org/apache/hadoop/hbase/manual/utils/MultiThreadedWriter.java
  src/test/java/org/apache/hadoop/hbase/manual/utils/ProcessBasedLocalHBaseCluster.java
  src/test/java/org/apache/hadoop/hbase/util/LoadTestTool.java
  src/test/java/org/apache/hadoop/hbase/util/MultiThreadedAction.java
  src/test/java/org/apache/hadoop/hbase/util/MultiThreadedReader.java
  src/test/java/org/apache/hadoop/hbase/util/MultiThreadedWriter.java
  src/test/java/org/apache/hadoop/hbase/util/ProcessBasedLocalHBaseCluster.java
  src/test/java/org/apache/hadoop/hbase/util/RestartMetaTest.java
  src/test/java/org/apache/hadoop/hbase/util/TestLoadTestKVGenerator.java
  src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadEncoded.java
  src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadParallel.java
  src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch, D1605.2.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208018#comment-13208018 ] 

Phabricator commented on HBASE-5344:
------------------------------------

stack has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  Whats the state on this patch Mikhail?  You going to apply to 0.89fb?  If it goes into 0.89fb, I'd then like to forward port it.  It looks like it could take care of some trunk issues we see.

  Is it possible that querying the regionservers would return state that is different to what is up in .META.? (I suppose if it does, we have bigger issues?)

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java:56 Should get via Configuration?
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java:68 This does not do retries (and it looks like down in the code you are not doing retrying of Callable).  In TRUNK we use an HTable instance -- i.e. a Callable w/ retries -- so we get retying (thats a big change in trunk -- doing retries rather than one-time HConnection calls)
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java:51 FYI, in trunk, hbck needs what this class does over in HBaseFSCK#processRegionServers.  It could use this class one day.  Currently it asks master for this cluster status (which wouldn't work where this is needed on master failover)
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java:96 What is this?  It seems fb particular?  If no regionservers in zk, then its a cluster startup which means?  Does it mean cluster is starting?  What if there was a a regionserver up and running already but it had not yet been assigned any regions?  Wouldn't this be a clean cluster startup too?
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java:107 Yeah, this stuff does not retry which maybe ok on startup here.
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java:235 Nice utility
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java:160 Misspelled
  src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java:50 We don't have this class in TRUNK.  Was it added to 0.89fb?
  src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java:88 Why delete it?  In case it has unassigned znodes?  I suppose this legit if the isClusterStartup means no regionservers up on cluster.
  src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java:128 ZKUtil.joinZNode does this.

  So we are going through each of the unassigned znodes and we are going to update .META.?  I see that in the loop, if we trip over .META., then we'll just return.  Whats that about?  Is it that .META. is not assigned?  Is .META. and -ROOT- assigned before this method is called?

REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208180#comment-13208180 ] 

Phabricator commented on HBASE-5344:
------------------------------------

Karthik has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  Looks awesome! Couple of comments:

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java:254 Why do we rescanRSDirectory() here? Is it only to figure out if this is a fresh cluster start or not? Could we rename to checkIfFreshClusterStart()?
  src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java:171 We should set this to false only after draining the queue, otherwise an older event could overwrite a newer one?

  Might need something like:

  while (true) {
    synchronized (deferredLock) {
      if (bufferedEvents.isEmpty()) {
        deferRegionEventProcessing = false;
        break;
      }
      else {
        event = bufferedEvents.take();
      }
    }
    process(event);
  }

  Need the lock on the queueing side as well.


REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208103#comment-13208103 ] 

Phabricator commented on HBASE-5344:
------------------------------------

stack has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  Thanks for the explaination Mikhail.  Let me tag along as a reviewer so I can follow whats going on and so can help w/ the forward port.


  "We have a plan to move towards always having the full assignment in ZK (the UNASSIGNED directory will change its meaning then) to help guarantee that we never have a duplicate assignment and to have only one source of truth for assignment."

  So, the '/unassigned' dir will change to be named '/regions' or some such?  On region open, we'll update its state in zk to be OPENED? And leave it there (and update .META. too).  We don't really need .META. then?  Smile.

  "We would like to rely on ZK and (for now) on META instead to recover the region assignment on master startup/failure."  The hard part in here is on failover, what if the .META. is on a crashed server?  You'll need to process its logs and get .META. back online before you can proceed w/ failover.  To get it online, you'll need to listen for events (though I suppose you could filter and only process .META. events).  Or what if the the server carrrying .META. crashes during onlining.

  "Also, by the way, we are planning to unify master startup on a fresh cluster start and failover and everything in between, and use the same logic to build a coherent picture of region assignment."

  Sweet.

REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208029#comment-13208029 ] 

Phabricator commented on HBASE-5344:
------------------------------------

mbautin has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  @stack: I am working on a new patch that would avoid directly taking to all RSs and will focus on the state in ZK. It will also assign ROOT and META out-of-band if they are not online.

REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208089#comment-13208089 ] 

Phabricator commented on HBASE-5344:
------------------------------------

mbautin has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  The bigger context of our current/planned changes in 89-fb master is as follows. In 89-fb, region assignments happen as responses to RS -> master RPC, and RSs communicate success of region open operations back to the master through ZK. The master then writes the new assignments to META. ZK is the only piece in the picture that could be considered a trusted highly-available source of truth for the region assignment, if only it had all assignments. Currently the region assignment can be obtained from the combination of META and ZK's UNASSIGNED directory. We have a plan to move towards always having the full assignment in ZK (the UNASSIGNED directory will change its meaning then) to help guarantee that we never have a duplicate assignment and to have only one source of truth for assignment. We will also keep writing the region assignment to META for client backward-compatibility. Even though the master failover fix does not depend on those planned changes, I thought it would b
 e useful to mention them here.

  Contacting all regionservers directly to get the region assignment is probably useful as a sanity-check, but it is not scalable, and is subject to unpredictable timeouts in the worst case. We would like to rely on ZK and (for now) on META instead to recover the region assignment on master startup/failure. Also, by the way, we are planning to unify master startup on a fresh cluster start and failover and everything in between, and use the same logic to build a coherent picture of region assignment.


REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Phabricator updated HBASE-5344:
-------------------------------

    Attachment: D1605.1.patch

mbautin requested code review of "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".
Reviewers: Kannan, Karthik, Liyin, JIRA, stack

  In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

  The current solution tries to reassign the root region if it is unassigned but does not implement a work-around if META regions are missing. Also, it currently heavily relies on "direct scanning" of regionservers (reading regionserver list from ZK and doing an RPC on each regionserver to get the list of online regions). We were already doing that in master failover, but I am making it parallel here.

TEST PLAN
  Unit tests, dev cluster, dark launch with killing regionservers and master

REVISION DETAIL
  https://reviews.facebook.net/D1605

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/executor/RegionTransitionEventData.java
  src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java
  src/main/java/org/apache/hadoop/hbase/master/BaseScanner.java
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  src/main/java/org/apache/hadoop/hbase/master/ProcessRegionOpen.java
  src/main/java/org/apache/hadoop/hbase/master/RegionManager.java
  src/main/java/org/apache/hadoop/hbase/master/RootScanner.java
  src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
  src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java
  src/main/java/org/apache/hadoop/hbase/master/handler/MasterOpenRegionHandler.java
  src/test/java/org/apache/hadoop/hbase/master/TestRegionStateOnMasterFailure.java

MANAGE HERALD DIFFERENTIAL RULES
  https://reviews.facebook.net/herald/view/differential/

WHY DID I GET THIS EMAIL?
  https://reviews.facebook.net/herald/transcript/3429/

Tip: use the X-Herald-Rules header to filter Herald messages in your client.

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208036#comment-13208036 ] 

Phabricator commented on HBASE-5344:
------------------------------------

stack has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  Looking forward to it.

  Why the change in direction?  Why not ask the regionservers?  (Out of interest?)  Good stuff.

REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5344) [89-fb] Scan unassigned region directory on master failover

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13203286#comment-13203286 ] 

Phabricator commented on HBASE-5344:
------------------------------------

stack has commented on the revision "[jira] [HBASE-5344] [89-fb] Scan unassigned region directory on master failover".

  I should work on the forward-port to trunk of this patch (it won't look much like this when done but there is some good stuff in here).

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java:50 Nice.  Should we do same in trunk?
  src/main/java/org/apache/hadoop/hbase/master/DirectRegionServerScanner.java:156 Single thread access in here only?
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java:1207 This is interesting.  YOu just trying to figure who called this method?
  src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java:62 We should do this in trunk too...
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java:254 So, you'd do this rather than wait on the regionservers to report in?  Via heartbeat?

  I need to glean from this patch stuff that will help our trunk story.

REVISION DETAIL
  https://reviews.facebook.net/D1605

                
> [89-fb] Scan unassigned region directory on master failover
> -----------------------------------------------------------
>
>                 Key: HBASE-5344
>                 URL: https://issues.apache.org/jira/browse/HBASE-5344
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>         Attachments: D1605.1.patch
>
>
> In case the master dies after a regionserver writes region state as OPENED or CLOSED in ZK but before the update is received by master and written to meta, the new master that comes up has to pick up the region state from ZK and write it to meta. Otherwise we can get multiply-assigned regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira