You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Brush,Ryan" <RB...@CERNER.COM> on 2011/04/01 17:48:44 UTC

NoRouteToHostException causes Master abort when the RegionServer hosting ROOT is not available

This happens in similar conditions but is distinct from HBASE-3617. When the region hosting ROOT isn't available during restart, the NoRouteToHostException propagates all the way up the call stack and causes the master to abort.  It looks like this can be addressed by handling NoRouteToHostException at some point and considering that node/region server offline.

I applied the patch from HBASE-3617 and it didn't fix the problem I'm seeing, which I expected given the stack trace below.  Assuming this reasoning is correct, does this merit a separate JIRA?  It does seem critical in that the failure of a single node is preventing us from being up our cluster.

2011-04-01 10:15:19,472 INFO org.apache.hadoop.hbase.master.ServerManager: Exiting wait on regionserver(s) to checkin; count=2, stopped=false, count of regions out on cluster=0
2011-04-01 10:15:19,486 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://iphadoop01:9000/hbase/.logs/iphadoop03.northamerica.cerner.net,60020,1301665635981 belongs to an existing region server
2011-04-01 10:15:19,486 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://iphadoop01:9000/hbase/.logs/iphadoop05.northamerica.cerner.net,60020,1301665659785 belongs to an existing region server
2011-04-01 10:15:22,508 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.net.NoRouteToHostException: No route to host
     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
     at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
     at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
     at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
     at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
     at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
     at $Proxy6.getProtocolVersion(Unknown Source)
     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
     at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
     at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:954)
     at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:385)
     at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnection(CatalogTracker.java:211)
     at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:458)
     at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:425)
     at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:383)
     at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:278)
2011-04-01 10:15:22,510 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2011-04-01 10:15:22,510 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads

----------------------------------------------------------------------
CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: NoRouteToHostException causes Master abort when the RegionServer hosting ROOT is not available

Posted by "Brush,Ryan" <RB...@CERNER.COM>.
I've verified this was indeed caused by HBASE-3660, and it fixed the issue
in our environment. Thanks!


On 4/1/11 10:57 AM, "Stack" <st...@duboce.net> wrote:

>The below looks like HBASE-3660, 'HMaster will exit when starting with
>stale data in cached locations such as -ROOT- or .META.', included in
>0.90.2 RC.
>St.Ack
>
>On Fri, Apr 1, 2011 at 8:48 AM, Brush,Ryan <RB...@cerner.com> wrote:
>> This happens in similar conditions but is distinct from HBASE-3617.
>>When the region hosting ROOT isn't available during restart, the
>>NoRouteToHostException propagates all the way up the call stack and
>>causes the master to abort.  It looks like this can be addressed by
>>handling NoRouteToHostException at some point and considering that
>>node/region server offline.
>>
>> I applied the patch from HBASE-3617 and it didn't fix the problem I'm
>>seeing, which I expected given the stack trace below.  Assuming this
>>reasoning is correct, does this merit a separate JIRA?  It does seem
>>critical in that the failure of a single node is preventing us from
>>being up our cluster.
>>
>> 2011-04-01 10:15:19,472 INFO
>>org.apache.hadoop.hbase.master.ServerManager: Exiting wait on
>>regionserver(s) to checkin; count=2, stopped=false, count of regions out
>>on cluster=0
>> 2011-04-01 10:15:19,486 INFO
>>org.apache.hadoop.hbase.master.MasterFileSystem: Log folder
>>hdfs://iphadoop01:9000/hbase/.logs/iphadoop03.northamerica.cerner.net,600
>>20,1301665635981 belongs to an existing region server
>> 2011-04-01 10:15:19,486 INFO
>>org.apache.hadoop.hbase.master.MasterFileSystem: Log folder
>>hdfs://iphadoop01:9000/hbase/.logs/iphadoop05.northamerica.cerner.net,600
>>20,1301665659785 belongs to an existing region server
>> 2011-04-01 10:15:22,508 FATAL org.apache.hadoop.hbase.master.HMaster:
>>Unhandled exception. Starting shutdown.
>> java.net.NoRouteToHostException: No route to host
>>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>     at 
>>sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>>     at 
>>org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.jav
>>a:206)
>>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
>>     at 
>>org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseCl
>>ient.java:328)
>>     at 
>>org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:88
>>3)
>>     at 
>>org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>>     at 
>>org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>>     at $Proxy6.getProtocolVersion(Unknown Source)
>>     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
>>     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
>>     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
>>     at 
>>org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
>>     at 
>>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementati
>>on.getHRegionConnection(HConnectionManager.java:954)
>>     at 
>>org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(Catalo
>>gTracker.java:385)
>>     at 
>>org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnectio
>>n(CatalogTracker.java:211)
>>     at 
>>org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(C
>>atalogTracker.java:458)
>>     at 
>>org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:425
>>)
>>     at 
>>org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:
>>383)
>>     at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:278)
>> 2011-04-01 10:15:22,510 INFO org.apache.hadoop.hbase.master.HMaster:
>>Aborting
>> 2011-04-01 10:15:22,510 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>Stopping service threads
>>
>> ----------------------------------------------------------------------
>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>from Cerner Corporation and are intended only for the addressee. The
>>information contained in this message is confidential and may constitute
>>inside or non-public information under international, federal, or state
>>securities laws. Unauthorized forwarding, printing, copying,
>>distribution, or use of such information is strictly prohibited and may
>>be unlawful. If you are not the addressee, please promptly delete this
>>message and notify the sender of the delivery error by e-mail or you may
>>call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
>>(816)221-1024.
>>


Re: NoRouteToHostException causes Master abort when the RegionServer hosting ROOT is not available

Posted by Stack <st...@duboce.net>.
The below looks like HBASE-3660, 'HMaster will exit when starting with
stale data in cached locations such as -ROOT- or .META.', included in
0.90.2 RC.
St.Ack

On Fri, Apr 1, 2011 at 8:48 AM, Brush,Ryan <RB...@cerner.com> wrote:
> This happens in similar conditions but is distinct from HBASE-3617. When the region hosting ROOT isn't available during restart, the NoRouteToHostException propagates all the way up the call stack and causes the master to abort.  It looks like this can be addressed by handling NoRouteToHostException at some point and considering that node/region server offline.
>
> I applied the patch from HBASE-3617 and it didn't fix the problem I'm seeing, which I expected given the stack trace below.  Assuming this reasoning is correct, does this merit a separate JIRA?  It does seem critical in that the failure of a single node is preventing us from being up our cluster.
>
> 2011-04-01 10:15:19,472 INFO org.apache.hadoop.hbase.master.ServerManager: Exiting wait on regionserver(s) to checkin; count=2, stopped=false, count of regions out on cluster=0
> 2011-04-01 10:15:19,486 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://iphadoop01:9000/hbase/.logs/iphadoop03.northamerica.cerner.net,60020,1301665635981 belongs to an existing region server
> 2011-04-01 10:15:19,486 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://iphadoop01:9000/hbase/.logs/iphadoop05.northamerica.cerner.net,60020,1301665659785 belongs to an existing region server
> 2011-04-01 10:15:22,508 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
> java.net.NoRouteToHostException: No route to host
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>     at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
>     at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>     at $Proxy6.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
>     at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:954)
>     at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:385)
>     at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnection(CatalogTracker.java:211)
>     at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:458)
>     at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:425)
>     at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:383)
>     at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:278)
> 2011-04-01 10:15:22,510 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> 2011-04-01 10:15:22,510 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads
>
> ----------------------------------------------------------------------
> CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>