You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by James Kennedy <ja...@troove.net> on 2011/01/21 08:28:27 UTC

HMaster won't die waiting for RegionServer that is already dead

I've come across a strange bug that I'm having trouble debugging.
Basically I have a seed application that is executed via maven and runs a single JVM ApplicationStarter that starts up hdfs, regionserver, hmaster threads. It does some seeding then shuts those down in reverse order.

So this isn't a typical way of running hbase to be sure. However it has always worked until I upgraded to HBase 0.90.0.
I didn't notice it when I was originally testing 0.90.0 because it only seems to be happening on our EC2.small build server node when I run this particular seeder.

Running the same thing locally on my mac works.

Attached is the error output starting from when the HRegionServer.stop() is called to when HMaster.shutdown() is called and it starts looping forever in letRegionServersShutdown(). 

It looks like RegionServerTracker is getting to "RegionServer ephemeral node deleted, processing expiration" but then because it can't get the HServerInfo it doesn't follow-through with actually expiring it. 

Does anyone have any ideas as to why this might be happening?



Thanks,

James Kennedy
Project Manager
Troove Inc.

Re: HMaster won't die waiting for RegionServer that is already dead

Posted by Stack <st...@duboce.net>.

Write it up James.  Others will probably trip on it too.
Good stuff,
St.Ack

On Fri, Jan 21, 2011 at 4:44 PM, James Kennedy <ja...@troove.net> wrote:
> Aha that stupid dot!
>
> My /etc/hosts file looks pretty standard:
>
> 127.0.0.1 localhost
>
> ::1     ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
>
> However look what I found in the data-seed-specific hbase-site.xml
>
> <property>
>        <name>hbase.master.dns.interface</name>
>        <value>lo</value>
> </property>
> <property>
>       <name>hbase.regionserver.dns.interface</name>
>       <value>lo</value>
> </property>
>
> Not sure why we had that in there originally but taking it out fixes the problem. Both sides now resolve hregioninfo to "localhost" instead of "localhost.".  I have no idea how specifying the lo interface adds a period to the localhost name but that sounds like a bug to me. Shall I report it or is this a known issue?
>
> Thanks for your help,
>
> James Kennedy
> Project Manager
> Troove Inc.
>
> On 2011-01-21, at 1:34 PM, Jean-Daniel Cryans wrote:
>
>> There's some sort of mismatch:
>>
>> RegionServer ephemeral node deleted, processing expiration
>> [localhost.,60020,1295592845214]
>>
>> and
>>
>> Waiting on regionserver(s) to go down localhost,60020,1295592845214
>>
>>
>> Do you see the dot after "localhost" in the first line? I wonder how
>> it got different in the znode and in ServerManager.onlineServers... In
>> any case, I'm pretty sure you can get it working by playing with your
>> /etc/hosts
>>
>> J-D
>>
>> On Thu, Jan 20, 2011 at 11:28 PM, James Kennedy
>> <ja...@troove.net> wrote:
>>> I've come across a strange bug that I'm having trouble debugging.
>>> Basically I have a seed application that is executed via maven and runs a
>>> single JVM ApplicationStarter that starts up hdfs, regionserver, hmaster
>>> threads. It does some seeding then shuts those down in reverse order.
>>> So this isn't a typical way of running hbase to be sure. However it has
>>> always worked until I upgraded to HBase 0.90.0.
>>> I didn't notice it when I was originally testing 0.90.0 because it only
>>> seems to be happening on our EC2.small build server node when I run this
>>> particular seeder.
>>> Running the same thing locally on my mac works.
>>> Attached is the error output starting from when the HRegionServer.stop() is
>>> called to when HMaster.shutdown() is called and it starts looping forever in
>>> letRegionServersShutdown().
>>> It looks like RegionServerTracker is getting to "RegionServer ephemeral node
>>> deleted, processing expiration" but then because it can't get the
>>> HServerInfo it doesn't follow-through with actually expiring it.
>>> Does anyone have any ideas as to why this might be happening?
>>>
>>>
>>> Thanks,
>>> James Kennedy
>>> Project Manager
>>> Troove Inc.
>>>
>>>
>
>

Re: HMaster won't die waiting for RegionServer that is already dead

Posted by James Kennedy <ja...@troove.net>.

Aha that stupid dot!

My /etc/hosts file looks pretty standard:

127.0.0.1 localhost

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

However look what I found in the data-seed-specific hbase-site.xml

<property>
        <name>hbase.master.dns.interface</name>
        <value>lo</value>
</property>
<property>
       <name>hbase.regionserver.dns.interface</name>
       <value>lo</value>
</property>

Not sure why we had that in there originally but taking it out fixes the problem. Both sides now resolve hregioninfo to "localhost" instead of "localhost.".  I have no idea how specifying the lo interface adds a period to the localhost name but that sounds like a bug to me. Shall I report it or is this a known issue?

Thanks for your help,

James Kennedy
Project Manager
Troove Inc.

On 2011-01-21, at 1:34 PM, Jean-Daniel Cryans wrote:

> There's some sort of mismatch:
> 
> RegionServer ephemeral node deleted, processing expiration
> [localhost.,60020,1295592845214]
> 
> and
> 
> Waiting on regionserver(s) to go down localhost,60020,1295592845214
> 
> 
> Do you see the dot after "localhost" in the first line? I wonder how
> it got different in the znode and in ServerManager.onlineServers... In
> any case, I'm pretty sure you can get it working by playing with your
> /etc/hosts
> 
> J-D
> 
> On Thu, Jan 20, 2011 at 11:28 PM, James Kennedy
> <ja...@troove.net> wrote:
>> I've come across a strange bug that I'm having trouble debugging.
>> Basically I have a seed application that is executed via maven and runs a
>> single JVM ApplicationStarter that starts up hdfs, regionserver, hmaster
>> threads. It does some seeding then shuts those down in reverse order.
>> So this isn't a typical way of running hbase to be sure. However it has
>> always worked until I upgraded to HBase 0.90.0.
>> I didn't notice it when I was originally testing 0.90.0 because it only
>> seems to be happening on our EC2.small build server node when I run this
>> particular seeder.
>> Running the same thing locally on my mac works.
>> Attached is the error output starting from when the HRegionServer.stop() is
>> called to when HMaster.shutdown() is called and it starts looping forever in
>> letRegionServersShutdown().
>> It looks like RegionServerTracker is getting to "RegionServer ephemeral node
>> deleted, processing expiration" but then because it can't get the
>> HServerInfo it doesn't follow-through with actually expiring it.
>> Does anyone have any ideas as to why this might be happening?
>> 
>> 
>> Thanks,
>> James Kennedy
>> Project Manager
>> Troove Inc.
>> 
>>

Re: HMaster won't die waiting for RegionServer that is already dead

Posted by Jean-Daniel Cryans <jd...@apache.org>.

There's some sort of mismatch:

RegionServer ephemeral node deleted, processing expiration
[localhost.,60020,1295592845214]

and

Waiting on regionserver(s) to go down localhost,60020,1295592845214


Do you see the dot after "localhost" in the first line? I wonder how
it got different in the znode and in ServerManager.onlineServers... In
any case, I'm pretty sure you can get it working by playing with your
/etc/hosts

J-D

On Thu, Jan 20, 2011 at 11:28 PM, James Kennedy
<ja...@troove.net> wrote:
> I've come across a strange bug that I'm having trouble debugging.
> Basically I have a seed application that is executed via maven and runs a
> single JVM ApplicationStarter that starts up hdfs, regionserver, hmaster
> threads. It does some seeding then shuts those down in reverse order.
> So this isn't a typical way of running hbase to be sure. However it has
> always worked until I upgraded to HBase 0.90.0.
> I didn't notice it when I was originally testing 0.90.0 because it only
> seems to be happening on our EC2.small build server node when I run this
> particular seeder.
> Running the same thing locally on my mac works.
> Attached is the error output starting from when the HRegionServer.stop() is
> called to when HMaster.shutdown() is called and it starts looping forever in
> letRegionServersShutdown().
> It looks like RegionServerTracker is getting to "RegionServer ephemeral node
> deleted, processing expiration" but then because it can't get the
> HServerInfo it doesn't follow-through with actually expiring it.
> Does anyone have any ideas as to why this might be happening?
>
>
> Thanks,
> James Kennedy
> Project Manager
> Troove Inc.
>
>