You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Karthik Ranganathan <kr...@facebook.com> on 2010/01/29 00:57:39 UTC

Cannot locate root region

Hey guys,

Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI

Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
Exceptions:
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)


>From a program that reads from a HBase table:
java.lang.reflect.UndeclaredThrowableException
        at $Proxy1.getRegionInfo(Unknown Source)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
<snip>


Followed  up on the hmaster's log:

2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
org.apache.hadoop.hbase.TableExistsException: test1
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

>From a hregionserver's logs:

2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP


The code says the following:
              case MSG_CALL_SERVER_STARTUP:
                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
                // get it when the master is panicking because for instance
                // the HDFS has been yanked out from under it.  Be wary of
                // this message.

Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?


Thanks
Karthik


Re: Cannot locate root region

Posted by Stack <st...@duboce.net>.
On Fri, Jan 29, 2010 at 11:29 AM, Joydeep Sarma <js...@gmail.com> wrote:
> i meant even if we were using hostnames for RS registration (which i
> think has lot of advantages - not necessarily in our environment
> though) -

Agreed, we should use hostnames for the advantages it gives listed
earlier in this thread...

 the master processing of the heartbeat (or whatever it's
> processing) shouldn't require a forward lookup. if it needs the ip
> address - it already has that via the connection object.
>

Agreed.

There is an old issue to address current dumb lookup on each heartbeat
opening us to damage if DNS starts flapping:
https://issues.apache.org/jira/browse/HBASE-1679.  Its not yet fixed.

St.Ack



> On Fri, Jan 29, 2010 at 11:19 AM, Karthik Ranganathan
> <kr...@facebook.com> wrote:
>> Yup totally - either name or ip would work. Not sure if there is a pro or a con to choosing either one - but thought it better to use the ip as that always remains the same (no resolve required) and used to open the sockets.
>>
>> @jd-cryans: Saw your JIRA update: "One example of weirdness is when the region server is told which address to use according to the master:"
>>
>> Was meaning to ask about that too :) its all good now.
>>
>> Thanks
>> Karthik
>>
>>
>> -----Original Message-----
>> From: Joydeep Sarma [mailto:jsensarma@gmail.com]
>> Sent: Friday, January 29, 2010 11:01 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: Cannot locate root region
>>
>> hmmm .. if the master doesn't need the RS ip address at this point -
>> seems like it should be able to use the hostname offered by the RS
>> directly?
>>
>> On Fri, Jan 29, 2010 at 10:44 AM, Karthik Ranganathan
>> <kr...@facebook.com> wrote:
>>> The master does another lookup independent of the region server using the hostname given by the region server:
>>>
>>> ServerManager.java, regionServerReport() does:
>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); // info.getServerName() is hostname
>>>
>>> Which eventually does:
>>>        HServerAddress.getHostname()
>>>
>>> HServerAddress' constructor creates the InetSocketAddress from the hostname:port, which involves mapping the hostname to the ip address using a lookup.
>>>
>>> Thanks
>>> Karthik
>>>
>>>
>>> -----Original Message-----
>>> From: Joydeep Sarma [mailto:jsensarma@gmail.com]
>>> Sent: Friday, January 29, 2010 9:46 AM
>>> To: hbase-dev@hadoop.apache.org
>>> Subject: Re: Cannot locate root region
>>>
>>> @Kannan - Karthik's mail said the reverse lookup happens in the RS
>>> (not the master). the master simply tried to match the offered
>>> hostname.
>>>
>>> i dont know whose reading is right - but if it's the RS - i didn't
>>> understand why that wasn't just the reverse lookup done once at
>>> bootstrap time (which wouldn't be affected by ongoing DNS badness).
>>>
>>>
>>> On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>> I just created https://issues.apache.org/jira/browse/HBASE-2174
>>>>
>>>> We handle addresses in different ways depending on which part of the
>>>> code you're in. We should correct that everywhere by implementing a
>>>> solution that also solves what you guys are seeing.
>>>>
>>>> J-D
>>>>
>>>> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
>>>> <Ka...@facebook.com> wrote:
>>>>> @Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.
>>>>>
>>>>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:
>>>>>
>>>>> A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.
>>>>>
>>>>> The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:
>>>>>
>>>>>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>>    if (storedInfo == null) {
>>>>>      if (LOG.isDebugEnabled()) {
>>>>>        LOG.debug("Received report from unknown server -- telling it " +   <<============
>>>>>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
>>>>>      }
>>>>>
>>>>> and bad things down the road.
>>>>>
>>>>> The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.
>>>>>
>>>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>>>
>>>>> This affected three of our test clusters at the same time!
>>>>>
>>>>> Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
>>>>>
>>>>> regards,
>>>>> Kannan
>>>>> ________________________________________
>>>>> From: Joydeep Sarma [jsensarma@gmail.com]
>>>>> Sent: Friday, January 29, 2010 1:20 AM
>>>>> To: hbase-dev@hadoop.apache.org
>>>>> Subject: Re: Cannot locate root region
>>>>>
>>>>> hadoop also uses the hostnames. if a host is multi-homed - it's
>>>>> hostname is a better identifier (which still allows it to use
>>>>> different nics/ips for actual traffic). it can help in the case the
>>>>> cluster is migrated for example (all the ips change). one could have
>>>>> the same hostname resolve to different ips depending on who's doing
>>>>> the lookup (this happens in AWS where the same elastic hostname
>>>>> resolves to private or public ip depending on where the peer is. so
>>>>> clients can talk from outside AWS via public ips and master etc. can
>>>>> talk over private ips).
>>>>>
>>>>> so lots of reasons i guess. doesn't the reverse ip lookup just once at
>>>>> RS startup time? (wondering how this reconciles with the  DNS being
>>>>> flaky after the cluster was up and running).
>>>>>
>>>>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
>>>>> <kr...@facebook.com> wrote:
>>>>>>
>>>>>> We did some more digging into this and here is the theory.
>>>>>>
>>>>>> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>>>>>>
>>>>>> HRegionServer.java
>>>>>> reinitialize() -
>>>>>> this.serverInfo = new HServerInfo(new HServerAddress(
>>>>>>      new InetSocketAddress(address.getBindAddress(),
>>>>>>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>>>>>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>>>>>
>>>>>> In run() -
>>>>>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>>>>>              serverInfo, outboundArray, getMostLoadedRegions());
>>>>>>
>>>>>>
>>>>>> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>>>>>>
>>>>>>
>>>>>> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>>>>>
>>>>>>
>>>>>> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>>>>>
>>>>>> ServerManager.java in regionServerReport() -
>>>>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>>>    if (storedInfo == null) {
>>>>>>      // snip...
>>>>>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>>>>>    }
>>>>>>
>>>>>>
>>>>>> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>>>>>>
>>>>>> Thanks
>>>>>> Karthik
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
>>>>>> Sent: Thursday, January 28, 2010 3:58 PM
>>>>>> To: hbase-dev@hadoop.apache.org
>>>>>> Subject: Cannot locate root region
>>>>>>
>>>>>> Hey guys,
>>>>>>
>>>>>> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>>>>>>
>>>>>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
>>>>>> Exceptions:
>>>>>> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>>>>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>
>>>>>>
>>>>>> From a program that reads from a HBase table:
>>>>>> java.lang.reflect.UndeclaredThrowableException
>>>>>>        at $Proxy1.getRegionInfo(Unknown Source)
>>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>>>>>> <snip>
>>>>>>
>>>>>>
>>>>>> Followed  up on the hmaster's log:
>>>>>>
>>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
>>>>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>>>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
>>>>>> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
>>>>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
>>>>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>>>>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>
>>>>>> From a hregionserver's logs:
>>>>>>
>>>>>> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>>>> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>>>> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>>>>>
>>>>>>
>>>>>> The code says the following:
>>>>>>              case MSG_CALL_SERVER_STARTUP:
>>>>>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>>>>>                // get it when the master is panicking because for instance
>>>>>>                // the HDFS has been yanked out from under it.  Be wary of
>>>>>>                // this message.
>>>>>>
>>>>>> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Karthik
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cannot locate root region

Posted by Joydeep Sarma <js...@gmail.com>.
i meant even if we were using hostnames for RS registration (which i
think has lot of advantages - not necessarily in our environment
though) - the master processing of the heartbeat (or whatever it's
processing) shouldn't require a forward lookup. if it needs the ip
address - it already has that via the connection object.

On Fri, Jan 29, 2010 at 11:19 AM, Karthik Ranganathan
<kr...@facebook.com> wrote:
> Yup totally - either name or ip would work. Not sure if there is a pro or a con to choosing either one - but thought it better to use the ip as that always remains the same (no resolve required) and used to open the sockets.
>
> @jd-cryans: Saw your JIRA update: "One example of weirdness is when the region server is told which address to use according to the master:"
>
> Was meaning to ask about that too :) its all good now.
>
> Thanks
> Karthik
>
>
> -----Original Message-----
> From: Joydeep Sarma [mailto:jsensarma@gmail.com]
> Sent: Friday, January 29, 2010 11:01 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: Cannot locate root region
>
> hmmm .. if the master doesn't need the RS ip address at this point -
> seems like it should be able to use the hostname offered by the RS
> directly?
>
> On Fri, Jan 29, 2010 at 10:44 AM, Karthik Ranganathan
> <kr...@facebook.com> wrote:
>> The master does another lookup independent of the region server using the hostname given by the region server:
>>
>> ServerManager.java, regionServerReport() does:
>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); // info.getServerName() is hostname
>>
>> Which eventually does:
>>        HServerAddress.getHostname()
>>
>> HServerAddress' constructor creates the InetSocketAddress from the hostname:port, which involves mapping the hostname to the ip address using a lookup.
>>
>> Thanks
>> Karthik
>>
>>
>> -----Original Message-----
>> From: Joydeep Sarma [mailto:jsensarma@gmail.com]
>> Sent: Friday, January 29, 2010 9:46 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: Cannot locate root region
>>
>> @Kannan - Karthik's mail said the reverse lookup happens in the RS
>> (not the master). the master simply tried to match the offered
>> hostname.
>>
>> i dont know whose reading is right - but if it's the RS - i didn't
>> understand why that wasn't just the reverse lookup done once at
>> bootstrap time (which wouldn't be affected by ongoing DNS badness).
>>
>>
>> On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> I just created https://issues.apache.org/jira/browse/HBASE-2174
>>>
>>> We handle addresses in different ways depending on which part of the
>>> code you're in. We should correct that everywhere by implementing a
>>> solution that also solves what you guys are seeing.
>>>
>>> J-D
>>>
>>> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
>>> <Ka...@facebook.com> wrote:
>>>> @Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.
>>>>
>>>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:
>>>>
>>>> A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.
>>>>
>>>> The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:
>>>>
>>>>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>    if (storedInfo == null) {
>>>>      if (LOG.isDebugEnabled()) {
>>>>        LOG.debug("Received report from unknown server -- telling it " +   <<============
>>>>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
>>>>      }
>>>>
>>>> and bad things down the road.
>>>>
>>>> The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.
>>>>
>>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>>
>>>> This affected three of our test clusters at the same time!
>>>>
>>>> Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
>>>>
>>>> regards,
>>>> Kannan
>>>> ________________________________________
>>>> From: Joydeep Sarma [jsensarma@gmail.com]
>>>> Sent: Friday, January 29, 2010 1:20 AM
>>>> To: hbase-dev@hadoop.apache.org
>>>> Subject: Re: Cannot locate root region
>>>>
>>>> hadoop also uses the hostnames. if a host is multi-homed - it's
>>>> hostname is a better identifier (which still allows it to use
>>>> different nics/ips for actual traffic). it can help in the case the
>>>> cluster is migrated for example (all the ips change). one could have
>>>> the same hostname resolve to different ips depending on who's doing
>>>> the lookup (this happens in AWS where the same elastic hostname
>>>> resolves to private or public ip depending on where the peer is. so
>>>> clients can talk from outside AWS via public ips and master etc. can
>>>> talk over private ips).
>>>>
>>>> so lots of reasons i guess. doesn't the reverse ip lookup just once at
>>>> RS startup time? (wondering how this reconciles with the  DNS being
>>>> flaky after the cluster was up and running).
>>>>
>>>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
>>>> <kr...@facebook.com> wrote:
>>>>>
>>>>> We did some more digging into this and here is the theory.
>>>>>
>>>>> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>>>>>
>>>>> HRegionServer.java
>>>>> reinitialize() -
>>>>> this.serverInfo = new HServerInfo(new HServerAddress(
>>>>>      new InetSocketAddress(address.getBindAddress(),
>>>>>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>>>>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>>>>
>>>>> In run() -
>>>>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>>>>              serverInfo, outboundArray, getMostLoadedRegions());
>>>>>
>>>>>
>>>>> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>>>>>
>>>>>
>>>>> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>>>>
>>>>>
>>>>> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>>>>
>>>>> ServerManager.java in regionServerReport() -
>>>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>>    if (storedInfo == null) {
>>>>>      // snip...
>>>>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>>>>    }
>>>>>
>>>>>
>>>>> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>>>>>
>>>>> Thanks
>>>>> Karthik
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
>>>>> Sent: Thursday, January 28, 2010 3:58 PM
>>>>> To: hbase-dev@hadoop.apache.org
>>>>> Subject: Cannot locate root region
>>>>>
>>>>> Hey guys,
>>>>>
>>>>> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>>>>>
>>>>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
>>>>> Exceptions:
>>>>> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>>>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>
>>>>>
>>>>> From a program that reads from a HBase table:
>>>>> java.lang.reflect.UndeclaredThrowableException
>>>>>        at $Proxy1.getRegionInfo(Unknown Source)
>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>>>>> <snip>
>>>>>
>>>>>
>>>>> Followed  up on the hmaster's log:
>>>>>
>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
>>>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
>>>>> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
>>>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
>>>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>>>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>
>>>>> From a hregionserver's logs:
>>>>>
>>>>> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>>> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>>> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>>>>
>>>>>
>>>>> The code says the following:
>>>>>              case MSG_CALL_SERVER_STARTUP:
>>>>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>>>>                // get it when the master is panicking because for instance
>>>>>                // the HDFS has been yanked out from under it.  Be wary of
>>>>>                // this message.
>>>>>
>>>>> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>>>>>
>>>>>
>>>>> Thanks
>>>>> Karthik
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cannot locate root region

Posted by Jean-Daniel Cryans <jd...@apache.org>.
It would be nice to log all those kind of issues and all your findings
in the Jira to start the discussion up there.

J-D

On Fri, Jan 29, 2010 at 11:19 AM, Karthik Ranganathan
<kr...@facebook.com> wrote:
> Yup totally - either name or ip would work. Not sure if there is a pro or a con to choosing either one - but thought it better to use the ip as that always remains the same (no resolve required) and used to open the sockets.
>
> @jd-cryans: Saw your JIRA update: "One example of weirdness is when the region server is told which address to use according to the master:"
>
> Was meaning to ask about that too :) its all good now.
>
> Thanks
> Karthik
>
>
> -----Original Message-----
> From: Joydeep Sarma [mailto:jsensarma@gmail.com]
> Sent: Friday, January 29, 2010 11:01 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: Cannot locate root region
>
> hmmm .. if the master doesn't need the RS ip address at this point -
> seems like it should be able to use the hostname offered by the RS
> directly?
>
> On Fri, Jan 29, 2010 at 10:44 AM, Karthik Ranganathan
> <kr...@facebook.com> wrote:
>> The master does another lookup independent of the region server using the hostname given by the region server:
>>
>> ServerManager.java, regionServerReport() does:
>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); // info.getServerName() is hostname
>>
>> Which eventually does:
>>        HServerAddress.getHostname()
>>
>> HServerAddress' constructor creates the InetSocketAddress from the hostname:port, which involves mapping the hostname to the ip address using a lookup.
>>
>> Thanks
>> Karthik
>>
>>
>> -----Original Message-----
>> From: Joydeep Sarma [mailto:jsensarma@gmail.com]
>> Sent: Friday, January 29, 2010 9:46 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: Cannot locate root region
>>
>> @Kannan - Karthik's mail said the reverse lookup happens in the RS
>> (not the master). the master simply tried to match the offered
>> hostname.
>>
>> i dont know whose reading is right - but if it's the RS - i didn't
>> understand why that wasn't just the reverse lookup done once at
>> bootstrap time (which wouldn't be affected by ongoing DNS badness).
>>
>>
>> On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> I just created https://issues.apache.org/jira/browse/HBASE-2174
>>>
>>> We handle addresses in different ways depending on which part of the
>>> code you're in. We should correct that everywhere by implementing a
>>> solution that also solves what you guys are seeing.
>>>
>>> J-D
>>>
>>> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
>>> <Ka...@facebook.com> wrote:
>>>> @Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.
>>>>
>>>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:
>>>>
>>>> A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.
>>>>
>>>> The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:
>>>>
>>>>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>    if (storedInfo == null) {
>>>>      if (LOG.isDebugEnabled()) {
>>>>        LOG.debug("Received report from unknown server -- telling it " +   <<============
>>>>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
>>>>      }
>>>>
>>>> and bad things down the road.
>>>>
>>>> The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.
>>>>
>>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>>
>>>> This affected three of our test clusters at the same time!
>>>>
>>>> Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
>>>>
>>>> regards,
>>>> Kannan
>>>> ________________________________________
>>>> From: Joydeep Sarma [jsensarma@gmail.com]
>>>> Sent: Friday, January 29, 2010 1:20 AM
>>>> To: hbase-dev@hadoop.apache.org
>>>> Subject: Re: Cannot locate root region
>>>>
>>>> hadoop also uses the hostnames. if a host is multi-homed - it's
>>>> hostname is a better identifier (which still allows it to use
>>>> different nics/ips for actual traffic). it can help in the case the
>>>> cluster is migrated for example (all the ips change). one could have
>>>> the same hostname resolve to different ips depending on who's doing
>>>> the lookup (this happens in AWS where the same elastic hostname
>>>> resolves to private or public ip depending on where the peer is. so
>>>> clients can talk from outside AWS via public ips and master etc. can
>>>> talk over private ips).
>>>>
>>>> so lots of reasons i guess. doesn't the reverse ip lookup just once at
>>>> RS startup time? (wondering how this reconciles with the  DNS being
>>>> flaky after the cluster was up and running).
>>>>
>>>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
>>>> <kr...@facebook.com> wrote:
>>>>>
>>>>> We did some more digging into this and here is the theory.
>>>>>
>>>>> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>>>>>
>>>>> HRegionServer.java
>>>>> reinitialize() -
>>>>> this.serverInfo = new HServerInfo(new HServerAddress(
>>>>>      new InetSocketAddress(address.getBindAddress(),
>>>>>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>>>>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>>>>
>>>>> In run() -
>>>>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>>>>              serverInfo, outboundArray, getMostLoadedRegions());
>>>>>
>>>>>
>>>>> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>>>>>
>>>>>
>>>>> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>>>>
>>>>>
>>>>> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>>>>
>>>>> ServerManager.java in regionServerReport() -
>>>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>>    if (storedInfo == null) {
>>>>>      // snip...
>>>>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>>>>    }
>>>>>
>>>>>
>>>>> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>>>>>
>>>>> Thanks
>>>>> Karthik
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
>>>>> Sent: Thursday, January 28, 2010 3:58 PM
>>>>> To: hbase-dev@hadoop.apache.org
>>>>> Subject: Cannot locate root region
>>>>>
>>>>> Hey guys,
>>>>>
>>>>> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>>>>>
>>>>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
>>>>> Exceptions:
>>>>> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>>>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>
>>>>>
>>>>> From a program that reads from a HBase table:
>>>>> java.lang.reflect.UndeclaredThrowableException
>>>>>        at $Proxy1.getRegionInfo(Unknown Source)
>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>>>>> <snip>
>>>>>
>>>>>
>>>>> Followed  up on the hmaster's log:
>>>>>
>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
>>>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
>>>>> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
>>>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
>>>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>>>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>
>>>>> From a hregionserver's logs:
>>>>>
>>>>> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>>> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>>> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>>>>
>>>>>
>>>>> The code says the following:
>>>>>              case MSG_CALL_SERVER_STARTUP:
>>>>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>>>>                // get it when the master is panicking because for instance
>>>>>                // the HDFS has been yanked out from under it.  Be wary of
>>>>>                // this message.
>>>>>
>>>>> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>>>>>
>>>>>
>>>>> Thanks
>>>>> Karthik
>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Cannot locate root region

Posted by Karthik Ranganathan <kr...@facebook.com>.
Yup totally - either name or ip would work. Not sure if there is a pro or a con to choosing either one - but thought it better to use the ip as that always remains the same (no resolve required) and used to open the sockets.

@jd-cryans: Saw your JIRA update: "One example of weirdness is when the region server is told which address to use according to the master:"

Was meaning to ask about that too :) its all good now.

Thanks
Karthik


-----Original Message-----
From: Joydeep Sarma [mailto:jsensarma@gmail.com] 
Sent: Friday, January 29, 2010 11:01 AM
To: hbase-dev@hadoop.apache.org
Subject: Re: Cannot locate root region

hmmm .. if the master doesn't need the RS ip address at this point -
seems like it should be able to use the hostname offered by the RS
directly?

On Fri, Jan 29, 2010 at 10:44 AM, Karthik Ranganathan
<kr...@facebook.com> wrote:
> The master does another lookup independent of the region server using the hostname given by the region server:
>
> ServerManager.java, regionServerReport() does:
>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); // info.getServerName() is hostname
>
> Which eventually does:
>        HServerAddress.getHostname()
>
> HServerAddress' constructor creates the InetSocketAddress from the hostname:port, which involves mapping the hostname to the ip address using a lookup.
>
> Thanks
> Karthik
>
>
> -----Original Message-----
> From: Joydeep Sarma [mailto:jsensarma@gmail.com]
> Sent: Friday, January 29, 2010 9:46 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: Cannot locate root region
>
> @Kannan - Karthik's mail said the reverse lookup happens in the RS
> (not the master). the master simply tried to match the offered
> hostname.
>
> i dont know whose reading is right - but if it's the RS - i didn't
> understand why that wasn't just the reverse lookup done once at
> bootstrap time (which wouldn't be affected by ongoing DNS badness).
>
>
> On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> I just created https://issues.apache.org/jira/browse/HBASE-2174
>>
>> We handle addresses in different ways depending on which part of the
>> code you're in. We should correct that everywhere by implementing a
>> solution that also solves what you guys are seeing.
>>
>> J-D
>>
>> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
>> <Ka...@facebook.com> wrote:
>>> @Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.
>>>
>>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:
>>>
>>> A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.
>>>
>>> The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:
>>>
>>>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>    if (storedInfo == null) {
>>>      if (LOG.isDebugEnabled()) {
>>>        LOG.debug("Received report from unknown server -- telling it " +   <<============
>>>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
>>>      }
>>>
>>> and bad things down the road.
>>>
>>> The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.
>>>
>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>
>>> This affected three of our test clusters at the same time!
>>>
>>> Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
>>>
>>> regards,
>>> Kannan
>>> ________________________________________
>>> From: Joydeep Sarma [jsensarma@gmail.com]
>>> Sent: Friday, January 29, 2010 1:20 AM
>>> To: hbase-dev@hadoop.apache.org
>>> Subject: Re: Cannot locate root region
>>>
>>> hadoop also uses the hostnames. if a host is multi-homed - it's
>>> hostname is a better identifier (which still allows it to use
>>> different nics/ips for actual traffic). it can help in the case the
>>> cluster is migrated for example (all the ips change). one could have
>>> the same hostname resolve to different ips depending on who's doing
>>> the lookup (this happens in AWS where the same elastic hostname
>>> resolves to private or public ip depending on where the peer is. so
>>> clients can talk from outside AWS via public ips and master etc. can
>>> talk over private ips).
>>>
>>> so lots of reasons i guess. doesn't the reverse ip lookup just once at
>>> RS startup time? (wondering how this reconciles with the  DNS being
>>> flaky after the cluster was up and running).
>>>
>>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
>>> <kr...@facebook.com> wrote:
>>>>
>>>> We did some more digging into this and here is the theory.
>>>>
>>>> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>>>>
>>>> HRegionServer.java
>>>> reinitialize() -
>>>> this.serverInfo = new HServerInfo(new HServerAddress(
>>>>      new InetSocketAddress(address.getBindAddress(),
>>>>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>>>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>>>
>>>> In run() -
>>>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>>>              serverInfo, outboundArray, getMostLoadedRegions());
>>>>
>>>>
>>>> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>>>>
>>>>
>>>> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>>>
>>>>
>>>> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>>>
>>>> ServerManager.java in regionServerReport() -
>>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>    if (storedInfo == null) {
>>>>      // snip...
>>>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>>>    }
>>>>
>>>>
>>>> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>>>>
>>>> Thanks
>>>> Karthik
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
>>>> Sent: Thursday, January 28, 2010 3:58 PM
>>>> To: hbase-dev@hadoop.apache.org
>>>> Subject: Cannot locate root region
>>>>
>>>> Hey guys,
>>>>
>>>> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>>>>
>>>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
>>>> Exceptions:
>>>> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>
>>>>
>>>> From a program that reads from a HBase table:
>>>> java.lang.reflect.UndeclaredThrowableException
>>>>        at $Proxy1.getRegionInfo(Unknown Source)
>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>>>> <snip>
>>>>
>>>>
>>>> Followed  up on the hmaster's log:
>>>>
>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
>>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
>>>> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
>>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
>>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>
>>>> From a hregionserver's logs:
>>>>
>>>> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>>>
>>>>
>>>> The code says the following:
>>>>              case MSG_CALL_SERVER_STARTUP:
>>>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>>>                // get it when the master is panicking because for instance
>>>>                // the HDFS has been yanked out from under it.  Be wary of
>>>>                // this message.
>>>>
>>>> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>>>>
>>>>
>>>> Thanks
>>>> Karthik
>>>>
>>>>
>>>
>>
>

Re: Cannot locate root region

Posted by Joydeep Sarma <js...@gmail.com>.
hmmm .. if the master doesn't need the RS ip address at this point -
seems like it should be able to use the hostname offered by the RS
directly?

On Fri, Jan 29, 2010 at 10:44 AM, Karthik Ranganathan
<kr...@facebook.com> wrote:
> The master does another lookup independent of the region server using the hostname given by the region server:
>
> ServerManager.java, regionServerReport() does:
>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); // info.getServerName() is hostname
>
> Which eventually does:
>        HServerAddress.getHostname()
>
> HServerAddress' constructor creates the InetSocketAddress from the hostname:port, which involves mapping the hostname to the ip address using a lookup.
>
> Thanks
> Karthik
>
>
> -----Original Message-----
> From: Joydeep Sarma [mailto:jsensarma@gmail.com]
> Sent: Friday, January 29, 2010 9:46 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: Cannot locate root region
>
> @Kannan - Karthik's mail said the reverse lookup happens in the RS
> (not the master). the master simply tried to match the offered
> hostname.
>
> i dont know whose reading is right - but if it's the RS - i didn't
> understand why that wasn't just the reverse lookup done once at
> bootstrap time (which wouldn't be affected by ongoing DNS badness).
>
>
> On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> I just created https://issues.apache.org/jira/browse/HBASE-2174
>>
>> We handle addresses in different ways depending on which part of the
>> code you're in. We should correct that everywhere by implementing a
>> solution that also solves what you guys are seeing.
>>
>> J-D
>>
>> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
>> <Ka...@facebook.com> wrote:
>>> @Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.
>>>
>>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:
>>>
>>> A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.
>>>
>>> The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:
>>>
>>>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>    if (storedInfo == null) {
>>>      if (LOG.isDebugEnabled()) {
>>>        LOG.debug("Received report from unknown server -- telling it " +   <<============
>>>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
>>>      }
>>>
>>> and bad things down the road.
>>>
>>> The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.
>>>
>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>
>>> This affected three of our test clusters at the same time!
>>>
>>> Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
>>>
>>> regards,
>>> Kannan
>>> ________________________________________
>>> From: Joydeep Sarma [jsensarma@gmail.com]
>>> Sent: Friday, January 29, 2010 1:20 AM
>>> To: hbase-dev@hadoop.apache.org
>>> Subject: Re: Cannot locate root region
>>>
>>> hadoop also uses the hostnames. if a host is multi-homed - it's
>>> hostname is a better identifier (which still allows it to use
>>> different nics/ips for actual traffic). it can help in the case the
>>> cluster is migrated for example (all the ips change). one could have
>>> the same hostname resolve to different ips depending on who's doing
>>> the lookup (this happens in AWS where the same elastic hostname
>>> resolves to private or public ip depending on where the peer is. so
>>> clients can talk from outside AWS via public ips and master etc. can
>>> talk over private ips).
>>>
>>> so lots of reasons i guess. doesn't the reverse ip lookup just once at
>>> RS startup time? (wondering how this reconciles with the  DNS being
>>> flaky after the cluster was up and running).
>>>
>>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
>>> <kr...@facebook.com> wrote:
>>>>
>>>> We did some more digging into this and here is the theory.
>>>>
>>>> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>>>>
>>>> HRegionServer.java
>>>> reinitialize() -
>>>> this.serverInfo = new HServerInfo(new HServerAddress(
>>>>      new InetSocketAddress(address.getBindAddress(),
>>>>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>>>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>>>
>>>> In run() -
>>>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>>>              serverInfo, outboundArray, getMostLoadedRegions());
>>>>
>>>>
>>>> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>>>>
>>>>
>>>> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>>>
>>>>
>>>> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>>>
>>>> ServerManager.java in regionServerReport() -
>>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>    if (storedInfo == null) {
>>>>      // snip...
>>>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>>>    }
>>>>
>>>>
>>>> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>>>>
>>>> Thanks
>>>> Karthik
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
>>>> Sent: Thursday, January 28, 2010 3:58 PM
>>>> To: hbase-dev@hadoop.apache.org
>>>> Subject: Cannot locate root region
>>>>
>>>> Hey guys,
>>>>
>>>> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>>>>
>>>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
>>>> Exceptions:
>>>> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>
>>>>
>>>> From a program that reads from a HBase table:
>>>> java.lang.reflect.UndeclaredThrowableException
>>>>        at $Proxy1.getRegionInfo(Unknown Source)
>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>>>> <snip>
>>>>
>>>>
>>>> Followed  up on the hmaster's log:
>>>>
>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
>>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
>>>> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
>>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
>>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>
>>>> From a hregionserver's logs:
>>>>
>>>> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>>> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>>>
>>>>
>>>> The code says the following:
>>>>              case MSG_CALL_SERVER_STARTUP:
>>>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>>>                // get it when the master is panicking because for instance
>>>>                // the HDFS has been yanked out from under it.  Be wary of
>>>>                // this message.
>>>>
>>>> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>>>>
>>>>
>>>> Thanks
>>>> Karthik
>>>>
>>>>
>>>
>>
>

RE: Cannot locate root region

Posted by Karthik Ranganathan <kr...@facebook.com>.
The master does another lookup independent of the region server using the hostname given by the region server:

ServerManager.java, regionServerReport() does:
    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); // info.getServerName() is hostname

Which eventually does:
	HServerAddress.getHostname()

HServerAddress' constructor creates the InetSocketAddress from the hostname:port, which involves mapping the hostname to the ip address using a lookup.

Thanks
Karthik


-----Original Message-----
From: Joydeep Sarma [mailto:jsensarma@gmail.com] 
Sent: Friday, January 29, 2010 9:46 AM
To: hbase-dev@hadoop.apache.org
Subject: Re: Cannot locate root region

@Kannan - Karthik's mail said the reverse lookup happens in the RS
(not the master). the master simply tried to match the offered
hostname.

i dont know whose reading is right - but if it's the RS - i didn't
understand why that wasn't just the reverse lookup done once at
bootstrap time (which wouldn't be affected by ongoing DNS badness).


On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> I just created https://issues.apache.org/jira/browse/HBASE-2174
>
> We handle addresses in different ways depending on which part of the
> code you're in. We should correct that everywhere by implementing a
> solution that also solves what you guys are seeing.
>
> J-D
>
> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
> <Ka...@facebook.com> wrote:
>> @Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.
>>
>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:
>>
>> A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.
>>
>> The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:
>>
>>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>    if (storedInfo == null) {
>>      if (LOG.isDebugEnabled()) {
>>        LOG.debug("Received report from unknown server -- telling it " +   <<============
>>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
>>      }
>>
>> and bad things down the road.
>>
>> The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.
>>
>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>
>> This affected three of our test clusters at the same time!
>>
>> Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
>>
>> regards,
>> Kannan
>> ________________________________________
>> From: Joydeep Sarma [jsensarma@gmail.com]
>> Sent: Friday, January 29, 2010 1:20 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: Cannot locate root region
>>
>> hadoop also uses the hostnames. if a host is multi-homed - it's
>> hostname is a better identifier (which still allows it to use
>> different nics/ips for actual traffic). it can help in the case the
>> cluster is migrated for example (all the ips change). one could have
>> the same hostname resolve to different ips depending on who's doing
>> the lookup (this happens in AWS where the same elastic hostname
>> resolves to private or public ip depending on where the peer is. so
>> clients can talk from outside AWS via public ips and master etc. can
>> talk over private ips).
>>
>> so lots of reasons i guess. doesn't the reverse ip lookup just once at
>> RS startup time? (wondering how this reconciles with the  DNS being
>> flaky after the cluster was up and running).
>>
>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
>> <kr...@facebook.com> wrote:
>>>
>>> We did some more digging into this and here is the theory.
>>>
>>> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>>>
>>> HRegionServer.java
>>> reinitialize() -
>>> this.serverInfo = new HServerInfo(new HServerAddress(
>>>      new InetSocketAddress(address.getBindAddress(),
>>>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>>
>>> In run() -
>>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>>              serverInfo, outboundArray, getMostLoadedRegions());
>>>
>>>
>>> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>>>
>>>
>>> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>>
>>>
>>> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>>
>>> ServerManager.java in regionServerReport() -
>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>    if (storedInfo == null) {
>>>      // snip...
>>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>>    }
>>>
>>>
>>> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>>>
>>> Thanks
>>> Karthik
>>>
>>>
>>> -----Original Message-----
>>> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
>>> Sent: Thursday, January 28, 2010 3:58 PM
>>> To: hbase-dev@hadoop.apache.org
>>> Subject: Cannot locate root region
>>>
>>> Hey guys,
>>>
>>> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>>>
>>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
>>> Exceptions:
>>> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>
>>>
>>> From a program that reads from a HBase table:
>>> java.lang.reflect.UndeclaredThrowableException
>>>        at $Proxy1.getRegionInfo(Unknown Source)
>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>>> <snip>
>>>
>>>
>>> Followed  up on the hmaster's log:
>>>
>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
>>> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>
>>> From a hregionserver's logs:
>>>
>>> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>>
>>>
>>> The code says the following:
>>>              case MSG_CALL_SERVER_STARTUP:
>>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>>                // get it when the master is panicking because for instance
>>>                // the HDFS has been yanked out from under it.  Be wary of
>>>                // this message.
>>>
>>> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>>>
>>>
>>> Thanks
>>> Karthik
>>>
>>>
>>
>

Re: Cannot locate root region

Posted by Joydeep Sarma <js...@gmail.com>.
@Kannan - Karthik's mail said the reverse lookup happens in the RS
(not the master). the master simply tried to match the offered
hostname.

i dont know whose reading is right - but if it's the RS - i didn't
understand why that wasn't just the reverse lookup done once at
bootstrap time (which wouldn't be affected by ongoing DNS badness).


On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> I just created https://issues.apache.org/jira/browse/HBASE-2174
>
> We handle addresses in different ways depending on which part of the
> code you're in. We should correct that everywhere by implementing a
> solution that also solves what you guys are seeing.
>
> J-D
>
> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
> <Ka...@facebook.com> wrote:
>> @Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.
>>
>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:
>>
>> A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.
>>
>> The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:
>>
>>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>    if (storedInfo == null) {
>>      if (LOG.isDebugEnabled()) {
>>        LOG.debug("Received report from unknown server -- telling it " +   <<============
>>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
>>      }
>>
>> and bad things down the road.
>>
>> The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.
>>
>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>
>> This affected three of our test clusters at the same time!
>>
>> Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
>>
>> regards,
>> Kannan
>> ________________________________________
>> From: Joydeep Sarma [jsensarma@gmail.com]
>> Sent: Friday, January 29, 2010 1:20 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: Cannot locate root region
>>
>> hadoop also uses the hostnames. if a host is multi-homed - it's
>> hostname is a better identifier (which still allows it to use
>> different nics/ips for actual traffic). it can help in the case the
>> cluster is migrated for example (all the ips change). one could have
>> the same hostname resolve to different ips depending on who's doing
>> the lookup (this happens in AWS where the same elastic hostname
>> resolves to private or public ip depending on where the peer is. so
>> clients can talk from outside AWS via public ips and master etc. can
>> talk over private ips).
>>
>> so lots of reasons i guess. doesn't the reverse ip lookup just once at
>> RS startup time? (wondering how this reconciles with the  DNS being
>> flaky after the cluster was up and running).
>>
>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
>> <kr...@facebook.com> wrote:
>>>
>>> We did some more digging into this and here is the theory.
>>>
>>> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>>>
>>> HRegionServer.java
>>> reinitialize() -
>>> this.serverInfo = new HServerInfo(new HServerAddress(
>>>      new InetSocketAddress(address.getBindAddress(),
>>>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>>
>>> In run() -
>>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>>              serverInfo, outboundArray, getMostLoadedRegions());
>>>
>>>
>>> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>>>
>>>
>>> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>>
>>>
>>> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>>
>>> ServerManager.java in regionServerReport() -
>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>    if (storedInfo == null) {
>>>      // snip...
>>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>>    }
>>>
>>>
>>> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>>>
>>> Thanks
>>> Karthik
>>>
>>>
>>> -----Original Message-----
>>> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
>>> Sent: Thursday, January 28, 2010 3:58 PM
>>> To: hbase-dev@hadoop.apache.org
>>> Subject: Cannot locate root region
>>>
>>> Hey guys,
>>>
>>> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>>>
>>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
>>> Exceptions:
>>> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>
>>>
>>> From a program that reads from a HBase table:
>>> java.lang.reflect.UndeclaredThrowableException
>>>        at $Proxy1.getRegionInfo(Unknown Source)
>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>>> <snip>
>>>
>>>
>>> Followed  up on the hmaster's log:
>>>
>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
>>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
>>> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>
>>> From a hregionserver's logs:
>>>
>>> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>>> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>>
>>>
>>> The code says the following:
>>>              case MSG_CALL_SERVER_STARTUP:
>>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>>                // get it when the master is panicking because for instance
>>>                // the HDFS has been yanked out from under it.  Be wary of
>>>                // this message.
>>>
>>> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>>>
>>>
>>> Thanks
>>> Karthik
>>>
>>>
>>
>

Re: Cannot locate root region

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I just created https://issues.apache.org/jira/browse/HBASE-2174

We handle addresses in different ways depending on which part of the
code you're in. We should correct that everywhere by implementing a
solution that also solves what you guys are seeing.

J-D

On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
<Ka...@facebook.com> wrote:
> @Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.
>
> Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:
>
> A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.
>
> The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:
>
>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>    if (storedInfo == null) {
>      if (LOG.isDebugEnabled()) {
>        LOG.debug("Received report from unknown server -- telling it " +   <<============
>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
>      }
>
> and bad things down the road.
>
> The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.
>
> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>
> This affected three of our test clusters at the same time!
>
> Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
>
> regards,
> Kannan
> ________________________________________
> From: Joydeep Sarma [jsensarma@gmail.com]
> Sent: Friday, January 29, 2010 1:20 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: Cannot locate root region
>
> hadoop also uses the hostnames. if a host is multi-homed - it's
> hostname is a better identifier (which still allows it to use
> different nics/ips for actual traffic). it can help in the case the
> cluster is migrated for example (all the ips change). one could have
> the same hostname resolve to different ips depending on who's doing
> the lookup (this happens in AWS where the same elastic hostname
> resolves to private or public ip depending on where the peer is. so
> clients can talk from outside AWS via public ips and master etc. can
> talk over private ips).
>
> so lots of reasons i guess. doesn't the reverse ip lookup just once at
> RS startup time? (wondering how this reconciles with the  DNS being
> flaky after the cluster was up and running).
>
> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
> <kr...@facebook.com> wrote:
>>
>> We did some more digging into this and here is the theory.
>>
>> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>>
>> HRegionServer.java
>> reinitialize() -
>> this.serverInfo = new HServerInfo(new HServerAddress(
>>      new InetSocketAddress(address.getBindAddress(),
>>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>
>> In run() -
>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>              serverInfo, outboundArray, getMostLoadedRegions());
>>
>>
>> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>>
>>
>> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>
>>
>> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>
>> ServerManager.java in regionServerReport() -
>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>    if (storedInfo == null) {
>>      // snip...
>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>    }
>>
>>
>> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>>
>> Thanks
>> Karthik
>>
>>
>> -----Original Message-----
>> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
>> Sent: Thursday, January 28, 2010 3:58 PM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Cannot locate root region
>>
>> Hey guys,
>>
>> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>>
>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
>> Exceptions:
>> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>
>>
>> From a program that reads from a HBase table:
>> java.lang.reflect.UndeclaredThrowableException
>>        at $Proxy1.getRegionInfo(Unknown Source)
>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>> <snip>
>>
>>
>> Followed  up on the hmaster's log:
>>
>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
>> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
>> org.apache.hadoop.hbase.TableExistsException: test1
>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>
>> From a hregionserver's logs:
>>
>> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>
>>
>> The code says the following:
>>              case MSG_CALL_SERVER_STARTUP:
>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>                // get it when the master is panicking because for instance
>>                // the HDFS has been yanked out from under it.  Be wary of
>>                // this message.
>>
>> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>>
>>
>> Thanks
>> Karthik
>>
>>
>

RE: Cannot locate root region

Posted by Kannan Muthukkaruppan <Ka...@facebook.com>.
@Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.

Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:

A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.

The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:

 HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
    if (storedInfo == null) {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Received report from unknown server -- telling it " +   <<============
          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<============
      }

and bad things down the road.

The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.

2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210

This affected three of our test clusters at the same time!

Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.

regards,
Kannan
________________________________________
From: Joydeep Sarma [jsensarma@gmail.com]
Sent: Friday, January 29, 2010 1:20 AM
To: hbase-dev@hadoop.apache.org
Subject: Re: Cannot locate root region

hadoop also uses the hostnames. if a host is multi-homed - it's
hostname is a better identifier (which still allows it to use
different nics/ips for actual traffic). it can help in the case the
cluster is migrated for example (all the ips change). one could have
the same hostname resolve to different ips depending on who's doing
the lookup (this happens in AWS where the same elastic hostname
resolves to private or public ip depending on where the peer is. so
clients can talk from outside AWS via public ips and master etc. can
talk over private ips).

so lots of reasons i guess. doesn't the reverse ip lookup just once at
RS startup time? (wondering how this reconciles with the  DNS being
flaky after the cluster was up and running).

On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
<kr...@facebook.com> wrote:
>
> We did some more digging into this and here is the theory.
>
> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>
> HRegionServer.java
> reinitialize() -
> this.serverInfo = new HServerInfo(new HServerAddress(
>      new InetSocketAddress(address.getBindAddress(),
>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>
> In run() -
> HMsg msgs[] = hbaseMaster.regionServerReport(
>              serverInfo, outboundArray, getMostLoadedRegions());
>
>
> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>
>
> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>
>
> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>
> ServerManager.java in regionServerReport() -
>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>    if (storedInfo == null) {
>      // snip...
>      return new HMsg[] {CALL_SERVER_STARTUP};
>    }
>
>
> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>
> Thanks
> Karthik
>
>
> -----Original Message-----
> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
> Sent: Thursday, January 28, 2010 3:58 PM
> To: hbase-dev@hadoop.apache.org
> Subject: Cannot locate root region
>
> Hey guys,
>
> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>
> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
> Exceptions:
> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
>
> From a program that reads from a HBase table:
> java.lang.reflect.UndeclaredThrowableException
>        at $Proxy1.getRegionInfo(Unknown Source)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
> <snip>
>
>
> Followed  up on the hmaster's log:
>
> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
> org.apache.hadoop.hbase.TableExistsException: test1
>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
> From a hregionserver's logs:
>
> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>
>
> The code says the following:
>              case MSG_CALL_SERVER_STARTUP:
>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>                // get it when the master is panicking because for instance
>                // the HDFS has been yanked out from under it.  Be wary of
>                // this message.
>
> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>
>
> Thanks
> Karthik
>
>

Re: Cannot locate root region

Posted by Joydeep Sarma <js...@gmail.com>.
hadoop also uses the hostnames. if a host is multi-homed - it's
hostname is a better identifier (which still allows it to use
different nics/ips for actual traffic). it can help in the case the
cluster is migrated for example (all the ips change). one could have
the same hostname resolve to different ips depending on who's doing
the lookup (this happens in AWS where the same elastic hostname
resolves to private or public ip depending on where the peer is. so
clients can talk from outside AWS via public ips and master etc. can
talk over private ips).

so lots of reasons i guess. doesn't the reverse ip lookup just once at
RS startup time? (wondering how this reconciles with the  DNS being
flaky after the cluster was up and running).

On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
<kr...@facebook.com> wrote:
>
> We did some more digging into this and here is the theory.
>
> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
>
> HRegionServer.java
> reinitialize() -
> this.serverInfo = new HServerInfo(new HServerAddress(
>      new InetSocketAddress(address.getBindAddress(),
>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>
> In run() -
> HMsg msgs[] = hbaseMaster.regionServerReport(
>              serverInfo, outboundArray, getMostLoadedRegions());
>
>
> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
>
>
> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>
>
> 4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>
> ServerManager.java in regionServerReport() -
>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>    if (storedInfo == null) {
>      // snip...
>      return new HMsg[] {CALL_SERVER_STARTUP};
>    }
>
>
> Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?
>
> Thanks
> Karthik
>
>
> -----Original Message-----
> From: Karthik Ranganathan [mailto:kranganathan@facebook.com]
> Sent: Thursday, January 28, 2010 3:58 PM
> To: hbase-dev@hadoop.apache.org
> Subject: Cannot locate root region
>
> Hey guys,
>
> Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
>
> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
> Exceptions:
> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
>
> From a program that reads from a HBase table:
> java.lang.reflect.UndeclaredThrowableException
>        at $Proxy1.getRegionInfo(Unknown Source)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
> <snip>
>
>
> Followed  up on the hmaster's log:
>
> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
> 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
> org.apache.hadoop.hbase.TableExistsException: test1
>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
> From a hregionserver's logs:
>
> 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
> 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
> 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>
>
> The code says the following:
>              case MSG_CALL_SERVER_STARTUP:
>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>                // get it when the master is panicking because for instance
>                // the HDFS has been yanked out from under it.  Be wary of
>                // this message.
>
> Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?
>
>
> Thanks
> Karthik
>
>

RE: Cannot locate root region

Posted by Karthik Ranganathan <kr...@facebook.com>.
We did some more digging into this and here is the theory. 

1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap. 

HRegionServer.java
reinitialize() -
this.serverInfo = new HServerInfo(new HServerAddress(
      new InetSocketAddress(address.getBindAddress(),
      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);

In run() -
HMsg msgs[] = hbaseMaster.regionServerReport(
              serverInfo, outboundArray, getMostLoadedRegions());


2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.


3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.


4. The region server that receives it is in fact already running (under its real hostname) and enters the "HMaster panic" mode and bad stuff happens.

ServerManager.java in regionServerReport() -
    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
    if (storedInfo == null) {
      // snip...
      return new HMsg[] {CALL_SERVER_STARTUP};
    }


Any reason why we use the hostname instead of the ip address in the map that stores the regionserver info?

Thanks
Karthik


-----Original Message-----
From: Karthik Ranganathan [mailto:kranganathan@facebook.com] 
Sent: Thursday, January 28, 2010 3:58 PM
To: hbase-dev@hadoop.apache.org
Subject: Cannot locate root region

Hey guys,

Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI

Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts.
Exceptions:
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)


>From a program that reads from a HBase table:
java.lang.reflect.UndeclaredThrowableException
        at $Proxy1.getRegionInfo(Unknown Source)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
<snip>


Followed  up on the hmaster's log:

2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600
2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020
2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1
org.apache.hadoop.hbase.TableExistsException: test1
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

>From a hregionserver's logs:

2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP


The code says the following:
              case MSG_CALL_SERVER_STARTUP:
                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
                // get it when the master is panicking because for instance
                // the HDFS has been yanked out from under it.  Be wary of
                // this message.

Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this?


Thanks
Karthik