You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Nitay Joffe (JIRA)" <ji...@apache.org> on 2009/03/31 20:00:51 UTC

[jira] Created: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

When a new master comes up, regionservers should continue with their region assignments from the last master
------------------------------------------------------------------------------------------------------------

                 Key: HBASE-1302
                 URL: https://issues.apache.org/jira/browse/HBASE-1302
             Project: Hadoop HBase
          Issue Type: Improvement
    Affects Versions: 0.20.0
            Reporter: Nitay Joffe
             Fix For: 0.20.0


After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711874#action_12711874 ] 

stack commented on HBASE-1302:
------------------------------

+1 on this patch.

I spent some testing killing master bringing it back up again after a little while and all continued without hiccup.  Commit!

Only issue I ran into was when I tried to start master on another machine.  Then things got a little odd.  Had to change the master address -- as fellas have already speculated -- but then I was getting this:

{code}
2009-05-21 23:58:28,669 [main] WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:518)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:293)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:314)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeMasterAddress(ZooKeeperWrapper.java:402)
        at org.apache.hadoop.hbase.master.HMaster.writeAddressToZooKeeper(HMaster.java:259)
        at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:249)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1093)
        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1130)
{code}

Which seems wrong.  Why does it not just to assume what is under /hbase?

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-1302:
--------------------------------------

    Attachment: hbase-1302-v3.patch

Patch that fixes the NodeCreated/NodeDeleted confusion. Now I see:

{code}
2009-05-22 15:47:32,344 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeCreated, path: /hbase/master
2009-05-22 15:47:32,345 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master got 192.168.1.81:62000
2009-05-22 15:47:32,345 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 192.168.1.81:62000 that we are up
....
....
2009-05-22 15:49:12,250 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDeleted, path: /hbase/master
2009-05-22 15:49:12,252 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher on master address ZNode /hbase/master
2009-05-22 15:49:12,285 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeCreated, path: /hbase/master
2009-05-22 15:49:12,286 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master got 192.168.1.83:62000
2009-05-22 15:49:12,286 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 192.168.1.83:62000 that we are up
{code}

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch, hbase-1302-v3.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-1302:
--------------------------------------

    Attachment: hbase-1302-v1.patch

First cut on this issue, don't know if it passes the unit tests. It was used on a 4 nodes with 24 regions cluster with no stress and it was working. 

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712765#action_12712765 ] 

stack commented on HBASE-1302:
------------------------------

Yes, I did as you speculated J-D.  I changed hbase-site.xml only.  And yes, I agree, we need to fix it but it belongs elsewhere.

I took a quick look at patch and looks good.  I'd say go ahead and commit.



> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch, hbase-1302-v3.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-1302:
--------------------------------------

    Attachment: hbase-1302-v2.patch

Second version of the patch. The RS directory is now cleared when the master is cleanly shutting down.

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706575#action_12706575 ] 

Jean-Daniel Cryans commented on HBASE-1302:
-------------------------------------------

I will follow jim's advices which are relatively the same as what's in the bigtable paper.


> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694236#action_12694236 ] 

Jim Kellerman commented on HBASE-1302:
--------------------------------------

What I would expect to happen is in Master.run() before calling startServiceThreads, the master should query ZK for known region servers and populate serversToServerInfo.

In the case of a cluster that is already running, the master then needs to find out which server is serving the root region.

Then, before calling startServiceThreads, it should invoke the method that recovers dead region server logs (HBASE-698).
Making this issue a blocker for HBASE-698

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Nitay Joffe
>             Fix For: 0.20.0
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711794#action_12711794 ] 

stack commented on HBASE-1302:
------------------------------

This patch is amazing.  I have a little cluster up, I kill the master..... while its down, I can scan.  I bring the master back up.  All regions stay where they were previous.  Again I can scan.  Let me do a bit more heavy-duty testing but this is eXXXXXXcellent.

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707502#action_12707502 ] 

Andrew Purtell commented on HBASE-1302:
---------------------------------------

Failover activity can happen if the cluster is restarted relatively quickly. I use a HBase ZK session timeout of 30000 (30 seconds). If I do stop-hbase.sh, wait for the master to exit, then do a start-hbase.sh within a few seconds, I see

2009-05-8 19:55;36,906 INFO org.apache.hadoop.hbase.master.HMaster: This is a failover, ZK inspection begins...
2009-05-8 19:55;36,940 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /172.20.3.229:60020. Already tried 0 time(s).

Then nothing. 

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707507#action_12707507 ] 

Jean-Daniel Cryans commented on HBASE-1302:
-------------------------------------------

Yes this is something I've seen too. Maybe when a master shuts down cleanly it should clear up the folders in ZK?

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-1302.
---------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed in trunk.

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch, hbase-1302-v3.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712199#action_12712199 ] 

Jean-Daniel Cryans commented on HBASE-1302:
-------------------------------------------

Stack, if you used a hbase-handled ZK instance then your zoo.cfg got this line:

server.0=${hbase.master.hostname}:2888:3888

And when you changed hbase.master.hostname in hbase-site.xml it changed it there too, so your new master tried to connect on itself. I'd say that stuff is in the scope of 1357/1445. Waiting after your confirmation to commit.

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712226#action_12712226 ] 

Jean-Daniel Cryans commented on HBASE-1302:
-------------------------------------------

I actually tried to do the same, I didn't get the "failed to create" exception but got this (it never stops): 

{code}
2009-05-22 14:59:48,126 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for 445473 milliseconds - retrying
2009-05-22 14:59:49,127 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 0 time(s).
2009-05-22 14:59:50,128 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 1 time(s).
2009-05-22 14:59:51,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 2 time(s).
2009-05-22 14:59:52,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 3 time(s).
2009-05-22 14:59:53,130 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 4 time(s).
2009-05-22 14:59:54,131 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 5 time(s).
2009-05-22 14:59:55,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 6 time(s).
2009-05-22 14:59:56,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 7 time(s).
2009-05-22 14:59:57,133 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 8 time(s).
2009-05-22 14:59:58,134 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 9 time(s).
2009-05-22 14:59:58,135 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10
{code}

We don't get this forever when the master is restarted on the same node because HRS.hbaseMaster is at the same place. In fact the problem is in this code:

{code}
public void process(WatchedEvent event) {
    EventType type = event.getType();
    KeeperState state = event.getState();
    LOG.info("Got ZooKeeper event, state: " + state + ", type: " +
              type + ", path: " + event.getPath());

    // Ignore events if we're shutting down.
    if (stopRequested.get()) {
      LOG.debug("Ignoring ZooKeeper event while shutting down");
      return;
    }

    if (state == KeeperState.Expired) {
      LOG.error("ZooKeeper session expired");
      restart();
    } else if (type == EventType.NodeCreated) {
      getMaster();

      // ZooKeeper watches are one time only, so we need to re-register our watch.
      watchMasterAddress();
    }
  }
{code}

I see that the node is deleted but I never see it being created because we don't set a watch after a NodeDeleted tho we should because we will never know when the master comes back. This should be changed. Instead, we have set a watch when the master node is deleted and then set a watch on the folder to see when it's recreated. 

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans reassigned HBASE-1302:
-----------------------------------------

    Assignee: Jean-Daniel Cryans  (was: Nitay Joffe)

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Nitay Joffe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nitay Joffe updated HBASE-1302:
-------------------------------

    Component/s: regionserver
                 master

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Nitay Joffe
>             Fix For: 0.20.0
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Nitay Joffe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nitay Joffe reassigned HBASE-1302:
----------------------------------

    Assignee: Nitay Joffe

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Nitay Joffe
>             Fix For: 0.20.0
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694237#action_12694237 ] 

Jim Kellerman commented on HBASE-1302:
--------------------------------------

In the case of an entire cluster startup, region servers should not register with ZK until after they have completed 'reportForDuty'

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Nitay Joffe
>             Fix For: 0.20.0
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707510#action_12707510 ] 

Andrew Purtell commented on HBASE-1302:
---------------------------------------

That sounds like a good idea.

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707408#action_12707408 ] 

Andrew Purtell commented on HBASE-1302:
---------------------------------------

Testing this patch here now. (Also on a 4 node cluster.)

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.