You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2010/09/28 23:32:32 UTC

[jira] Created: (HBASE-3047) If new master crashes, restart is messy

If new master crashes, restart is messy
---------------------------------------

                 Key: HBASE-3047
                 URL: https://issues.apache.org/jira/browse/HBASE-3047
             Project: HBase
          Issue Type: Bug
            Reporter: stack
             Fix For: 0.90.0


If master crashes, the cluster-is-up flag is left stuck on.

On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.

Here's sample of kinda of issues we're running into:

{code}
2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
Unhandled exception. Starting shutdown.
java.io.IOException: Call to /10.20.20.188:60020 failed on local
exception: java.io.IOException: Connection reset by peer
   at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
   at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
   at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
   at $Proxy1.getProtocolVersion(Unknown Source)
   at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
   at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
   at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
   at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
   at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
   at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
   at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
   at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
   at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
   at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
   at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
   at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
   at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
   at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
   at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
   at sun.nio.ch.IOUtil.read(IOUtil.java:206)
{code}

Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3047) If new master crashes, restart is messy

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916296#action_12916296 ] 

HBase Review Board commented on HBASE-3047:
-------------------------------------------

Message from: "Jonathan Gray" <jg...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/915/#review1360
-----------------------------------------------------------

Ship it!


Commit!  Just fix the missing insertion into deadServers map on commit as discussed.

- Jonathan





> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3047) If new master crashes, restart is messy

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915992#action_12915992 ] 

HBase Review Board commented on HBASE-3047:
-------------------------------------------

Message from: "Jonathan Gray" <jg...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/915/#review1349
-----------------------------------------------------------


Overall this looks like a good improvement over what we had.  I'm still a little confused about isRunningCluster (or isProperRunningCluster per comments).

Repeat from inline comments, but, is there ever a time a single region is deployed and we don't want to trigger the failover codepath?

Isn't the case we're really protecting against here that the cluster was not shutdown properly so the cluster status flag is up when it shouldn't be?

And does this handle case that cluster is killed quickly and then restarted again so the master ephemeral node is actually still there?  Then the RS will have master node and cluster up node and startup but potentially without a real master?


trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<http://review.cloudera.org/r/915/#comment4482>

    Why is this an "implementation"?  Doesn't the HRI represent the actual connection object?  I get that it's an implementation of HRI but normally that would be used in class names implementing?  No biggie, should just be consistent and seems a weird name to me (I think I was referring to this stuff as "connection" elsewhere in the class in method names/variable names)



trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4481>

    Is this really the exception we want to throw (commons.lang)?  Or this is just short-term temporary?



trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4483>

    yay thanks



trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4484>

    So case that we are adding for here (but just throwing exception for now) is master came up, did not think it was fresh cluster (because cluster status flag in zk up? maybe note in comments above?), but we determine the cluster was not running because ROOT and META are not assigned.
    
    What about case where other regions are assigned?  Should this check actually be whether _any_ regions are assigned?  I think we discussed this, and I think looking for root/meta covers most cases, but maybe add a TODO?
    
    Though, even in failover case, we'll need to handle ROOT/META not being properly assigned, so if _any_ regions are assigned we would trigger failover, if no regions assigned we would assume it actually is a cluster startup and go into the branch of code which currently throws the exception.



trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4485>

    javadoc about what this method does to determine if it's running cluster



trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4486>

    So this method would be "proper running cluster"?
    
    Isn't it the case that if a single region is deployed anywhere we are not in startup, we are failover?



trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
<http://review.cloudera.org/r/915/#comment4487>

    looks good


- Jonathan





> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3047) If new master crashes, restart is messy

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916007#action_12916007 ] 

HBase Review Board commented on HBASE-3047:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/915/#review1353
-----------------------------------------------------------


Here's a few comments on yours.

Actually, testing this patch on cluster brought up some issues.  I think I should recast.  I have some ideas on how.  v2 coming.  Will incorporate your belows.


trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<http://review.cloudera.org/r/915/#comment4495>

    I can change it (you get my intent but it still confused so I should change it).



trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4496>

    Yeah, what you say.  Let me fix up comments.



trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4497>

    will do


- stack





> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3047) If new master crashes, restart is messy

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915967#action_12915967 ] 

HBase Review Board commented on HBASE-3047:
-------------------------------------------

Message from: "Jonathan Gray" <jg...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/915/
-----------------------------------------------------------

Review request for hbase, stack and Jonathan Gray.


Summary
-------

This is patch from Stack, just putting up on rb.

M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
  Add test of case where HRegionInterface connection throws a
  ConnectionException. Also tests two new verify root and meta 
  locations added to CatalogTracker.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  Change order in which we start up trackers in ZK.  Also add blocking
  until master is up to make it less likely we'll start before master
  comes up, especially around the cluster start up situation.
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  Introduce new state on startup, the case where the cluster is
  NOT a fresh startup and its NOT a cluster where all is fully
  assigned.  The repair the master needs run to fixup this new
  state is not yet done; we throw a NotImplementedException for
  now.  TODO.  Added new isRunningCluster checker used figuring
  what the cluster condition is when master is joining.  Not
  comprehensive but good enough for now.
M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
  Javadoc.
  Added new verifyRootRegionLocation and verifyMetaRegionLocation.
  Needed to verify whats in zk is actually locations of catalog
  regions.
M src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
  Add fact that the verifying method, getRegionInfo, can throw
  ConnectException


This addresses bug HBASE-3047.
    http://issues.apache.org/jira/browse/HBASE-3047


Diffs
-----

  trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1002359 
  trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java 1002359 

Diff: http://review.cloudera.org/r/915/diff


Testing
-------


Thanks,

Jonathan




> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3047) If new master crashes, restart is messy

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3047:
-------------------------

    Attachment: 3047.txt

{code}
M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
  Add test of case where HRegionInterface connection throws a
  ConnectionException. Also tests two new verify root and meta 
  locations added to CatalogTracker.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  Change order in which we start up trackers in ZK.  Also add blocking
  until master is up to make it less likely we'll start before master
  comes up, especially around the cluster start up situation.
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  Introduce new state on startup, the case where the cluster is
  NOT a fresh startup and its NOT a cluster where all is fully
  assigned.  The repair the master needs run to fixup this new
  state is not yet done; we throw a NotImplementedException for
  now.  TODO.  Added new isRunningCluster checker used figuring
  what the cluster condition is when master is joining.  Not
  comprehensive but good enough for now.
M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
  Javadoc.
  Added new verifyRootRegionLocation and verifyMetaRegionLocation.
  Needed to verify whats in zk is actually locations of catalog
  regions.
M src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
  Add fact that the verifying method, getRegionInfo, can throw
  ConnectException
{code}

> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3047) If new master crashes, restart is messy

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916291#action_12916291 ] 

HBase Review Board commented on HBASE-3047:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/915/
-----------------------------------------------------------

(Updated 2010-09-29 15:12:30.449220)


Review request for hbase, stack and Jonathan Gray.


Changes
-------

New version.  Comes of back and forth w/ Jon


Summary
-------

This is patch from Stack, just putting up on rb.

M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
  Add test of case where HRegionInterface connection throws a
  ConnectionException. Also tests two new verify root and meta 
  locations added to CatalogTracker.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  Change order in which we start up trackers in ZK.  Also add blocking
  until master is up to make it less likely we'll start before master
  comes up, especially around the cluster start up situation.
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  Introduce new state on startup, the case where the cluster is
  NOT a fresh startup and its NOT a cluster where all is fully
  assigned.  The repair the master needs run to fixup this new
  state is not yet done; we throw a NotImplementedException for
  now.  TODO.  Added new isRunningCluster checker used figuring
  what the cluster condition is when master is joining.  Not
  comprehensive but good enough for now.
M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
  Javadoc.
  Added new verifyRootRegionLocation and verifyMetaRegionLocation.
  Needed to verify whats in zk is actually locations of catalog
  regions.
M src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
  Add fact that the verifying method, getRegionInfo, can throw
  ConnectException


This addresses bug HBASE-3047.
    http://issues.apache.org/jira/browse/HBASE-3047


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/RemoteExceptionHandler.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Leases.java 1002359 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1002359 
  trunk/src/main/resources/hbase-default.xml 1002359 
  trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java 1002359 

Diff: http://review.cloudera.org/r/915/diff


Testing
-------


Thanks,

Jonathan




> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3047) If new master crashes, restart is messy

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916050#action_12916050 ] 

HBase Review Board commented on HBASE-3047:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/915/
-----------------------------------------------------------

(Updated 2010-09-28 23:31:22.975377)


Review request for hbase, stack and Jonathan Gray.


Changes
-------

Here, this should be more robust.  Your comments should be addressed also.  For sure, AM#processFailover has holes -- e.g. what if a regionserver crashed while new master was coming up -- but lets address that in another issue.  Below are notes on changes made since v1 of the patch.

M src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
  Change here was because saw a case where we hung for ever (my guess is that remaining became equal to NO_TIMEOUT).  Redid the logic here.
M src/main/java/org/apache/hadoop/hbase/regionserver/Leases.java
  Set this thread to be daemon.  Have seen it hold up RS shutdowns.
M src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
  Renamed the initialize method as createInitialFileSystemLayout, made it private it and called it from constructor.  Its idempotent, cheap, and no need others should be concerned with these mechanics; encapsulate it.
M src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
  Removed freshClusterStartup flag.  Now, let any 'unknown' server in and register it UNLESS its a dead server (fixed up expiration so we add to dead servers BEFORE we remove from online servers).  Have waitForRegionServers return count of regions out on cluster.  This will be 0 if servers are coming in with clean regionServerStartup but if they came in and were registered on a regionServerReport, then they'll have a filled out HServerLoad with a count of regions.  Use count of regions as way to tell if regions out on cluster or not.
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  Removed freshClusterStartup.  Added logging of state of cluster-up flag, and # of regionservers out on cluster.  Use count of regions out on cluster to figure if we are to do assign of all user regions or if instead we are to do process failover.  Added splitting of WALs always and check and reassign of root and meta whether fresh start up or failover.
M src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
  Added notes on holes in processFailover.
M src/main/resources/hbase-default.xml
  Set checkin down from 5 to 3 seconds again.


Summary
-------

This is patch from Stack, just putting up on rb.

M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
  Add test of case where HRegionInterface connection throws a
  ConnectionException. Also tests two new verify root and meta 
  locations added to CatalogTracker.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  Change order in which we start up trackers in ZK.  Also add blocking
  until master is up to make it less likely we'll start before master
  comes up, especially around the cluster start up situation.
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  Introduce new state on startup, the case where the cluster is
  NOT a fresh startup and its NOT a cluster where all is fully
  assigned.  The repair the master needs run to fixup this new
  state is not yet done; we throw a NotImplementedException for
  now.  TODO.  Added new isRunningCluster checker used figuring
  what the cluster condition is when master is joining.  Not
  comprehensive but good enough for now.
M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
  Javadoc.
  Added new verifyRootRegionLocation and verifyMetaRegionLocation.
  Needed to verify whats in zk is actually locations of catalog
  regions.
M src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
  Add fact that the verifying method, getRegionInfo, can throw
  ConnectException


This addresses bug HBASE-3047.
    http://issues.apache.org/jira/browse/HBASE-3047


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Leases.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1001981 
  trunk/src/main/resources/hbase-default.xml 1001981 
  trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java 1001981 

Diff: http://review.cloudera.org/r/915/diff


Testing
-------


Thanks,

Jonathan




> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-3047) If new master crashes, restart is messy

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-3047.
--------------------------

    Hadoop Flags: [Reviewed]
        Assignee: stack
      Resolution: Fixed

Thanks for the review Jon.  The not-putting-servername-into-deadservers though big comment about how important doing so at that point was a good catch.  Committed earlier to day.

> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3047) If new master crashes, restart is messy

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916210#action_12916210 ] 

HBase Review Board commented on HBASE-3047:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/915/
-----------------------------------------------------------

(Updated 2010-09-29 10:23:49.845051)


Review request for hbase, stack and Jonathan Gray.


Changes
-------

More cleanup


Summary
-------

This is patch from Stack, just putting up on rb.

M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
  Add test of case where HRegionInterface connection throws a
  ConnectionException. Also tests two new verify root and meta 
  locations added to CatalogTracker.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  Change order in which we start up trackers in ZK.  Also add blocking
  until master is up to make it less likely we'll start before master
  comes up, especially around the cluster start up situation.
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  Introduce new state on startup, the case where the cluster is
  NOT a fresh startup and its NOT a cluster where all is fully
  assigned.  The repair the master needs run to fixup this new
  state is not yet done; we throw a NotImplementedException for
  now.  TODO.  Added new isRunningCluster checker used figuring
  what the cluster condition is when master is joining.  Not
  comprehensive but good enough for now.
M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
  Javadoc.
  Added new verifyRootRegionLocation and verifyMetaRegionLocation.
  Needed to verify whats in zk is actually locations of catalog
  regions.
M src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
  Add fact that the verifying method, getRegionInfo, can throw
  ConnectException


This addresses bug HBASE-3047.
    http://issues.apache.org/jira/browse/HBASE-3047


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/RemoteExceptionHandler.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Leases.java 1001981 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1001981 
  trunk/src/main/resources/hbase-default.xml 1001981 
  trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java 1001981 

Diff: http://review.cloudera.org/r/915/diff


Testing
-------


Thanks,

Jonathan




> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3047) If new master crashes, restart is messy

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3047:
-------------------------

    Attachment: 3047-final.txt

What I committed -- last diff up on review board plus jon suggstion.

> If new master crashes, restart is messy
> ---------------------------------------
>
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>             Fix For: 0.90.0
>
>         Attachments: 3047-final.txt, 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered themselves in zk by time the master assumes its role and master will think its joining an up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup our inmemory picture of the running cluster, only the scan of meta fails because the meta isn not assigned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.