You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Yi Liang <wh...@gmail.com> on 2011/02/18 09:36:15 UTC

Not running balancer because processing dead regionserver(s)

Hi all,

We have a hbase cluster with 10 region servers running HBase 0.90.0 + CDH3.
We're now importing big data into HBase.

During the process, 2 servers crashed, but after restaring them, they're no
longer assigned with any region, while regions on other servers keep
splitting when more data inserted.

>From the master log, we can see the periodical messages like:

2011-02-18 16:09:35,067 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
running balancer because processing dead regionserver(s):
[zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
zcl.local,60020,1297919367472]

zcl.local and qics.local are the machines we have restared, other 2 machine
have kept running without restarting and are actually still serving regions.

>From the shell status:
10 servers, 5 dead, 10.1000 average Load

Why are there dead servers? And how to clear them so we could start
balancer?

Thanks,
Yi

Re: Not running balancer because processing dead regionserver(s)

Posted by Yi Liang <wh...@gmail.com>.

Thanks you Stack!

On Wed, Feb 23, 2011 at 6:25 AM, Stack <st...@duboce.net> wrote:

> On Mon, Feb 21, 2011 at 10:04 PM, Yi Liang <wh...@gmail.com> wrote:
> > Yes, the server zcl crashed at that time.
> >
> > But after I restarted it later, it's still in the dead server list.
> >
>
> We failed processing its death:
>
> 2011-02-18 10:08:14,873 ERROR org.apache.hadoop.hbase.HServerAddress:
> Could not resolve the DNS name of zcl.local:60020
> 2011-02-18 10:08:14,874 ERROR
> org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
> processing event M_SERVER_SHUTDOWN
> java.lang.IllegalArgumentException: Could not resolve the DNS name of
> zcl.local:60020
>        at
> org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>        at
> org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>        at
> org.apache.hadoop.hbase.catalog.MetaReader.metaRowToRegionPairWithInfo(MetaReader.java:407)
>        at
> org.apache.hadoop.hbase.catalog.MetaReader.getServerUserRegions(MetaReader.java:594)
>        at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:124)
>        at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
>
> It looks like the above exception caused us to jump out of the
> processing of the server shutdown.  Above is related to the no route
> to host.
>
> I filed HBASE-3556.  It'll be 'fixed' by HBASE-1501 but we should
> never just give up processing.  Need to look into that.
>
> While a server is in the dead servers list, we'll not run the
> balancer.  The dead servers list is an in-memory list.  You'd need to
> kill the master and bring it back up again to rid the dead server
> state.
>
> St.Ack
>
>
> > 2011-02-18 10:39:26,895 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > Registering server=zcl.local,60020,1297996817352, regionCount=0,
> > userLoad=false
> > 2011-02-18 10:39:35,062 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
> > running balancer because processing dead regionserver(s):
> > [Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
> > zcl.local,60020,1297919367472]
> >
> > On Tue, Feb 22, 2011 at 1:48 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> Looks like there was connectivity issue:
> >>
> >> java.net.NoRouteToHostException: No route to host
> >>
> >> On Sun, Feb 20, 2011 at 10:09 PM, Yi Liang <wh...@gmail.com> wrote:
> >>
> >> > The related log is at: http://pastebin.com/0a1CjDUD
> >> >
> >> > It's ok now after restarting hbase, but still curious why it happend.
> >> >
> >> > Thanks,
> >> > Yi
> >> > On Sat, Feb 19, 2011 at 3:58 AM, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> > >wrote:
> >> >
> >> > > The master should finish processing those dead servers at some point
> >> > > and it seems it's not happening? Unfortunately without the log
> nobody
> >> > > can'tell why. If you can post the complete log in pastebin or put it
> >> > > on a web server then we could take a look.
> >> > >
> >> > > J-D
> >> > >
> >> > > On Fri, Feb 18, 2011 at 12:39 AM, Yi Liang <wh...@gmail.com>
> wrote:
> >> > > > Hi all,
> >> > > >
> >> > > > We have a hbase cluster with 10 region servers running HBase
> 0.90.0 +
> >> > > CDH3.
> >> > > > We're now importing big data into HBase.
> >> > > >
> >> > > > During the process, 2 servers crashed, but after restaring them,
> >> > they're
> >> > > no
> >> > > > longer assigned with any region, while regions on other servers
> keep
> >> > > > splitting when more data inserted.
> >> > > >
> >> > > > From the master log, we can see the periodical messages like:
> >> > > >
> >> > > > 2011-02-18 16:09:35,067 DEBUG
> org.apache.hadoop.hbase.master.HMaster:
> >> > Not
> >> > > > running balancer because processing dead regionserver(s):
> >> > > > [zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
> >> > > > Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
> >> > > > zcl.local,60020,1297919367472]
> >> > > >
> >> > > > zcl.local and qics.local are the machines we have restared, other
> 2
> >> > > machine
> >> > > > have kept running without restarting and are actually still
> serving
> >> > > regions.
> >> > > >
> >> > > > From the shell status:
> >> > > > 10 servers, 5 dead, 10.1000 average Load
> >> > > >
> >> > > > Why are there dead servers? And how to clear them so we could
> start
> >> > > > balancer?
> >> > > >
> >> > > > Thanks,
> >> > > > Yi
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Not running balancer because processing dead regionserver(s)

Posted by Stack <st...@duboce.net>.

On Mon, Feb 21, 2011 at 10:04 PM, Yi Liang <wh...@gmail.com> wrote:
> Yes, the server zcl crashed at that time.
>
> But after I restarted it later, it's still in the dead server list.
>

We failed processing its death:

2011-02-18 10:08:14,873 ERROR org.apache.hadoop.hbase.HServerAddress:
Could not resolve the DNS name of zcl.local:60020
2011-02-18 10:08:14,874 ERROR
org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
processing event M_SERVER_SHUTDOWN
java.lang.IllegalArgumentException: Could not resolve the DNS name of
zcl.local:60020
        at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
        at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
        at org.apache.hadoop.hbase.catalog.MetaReader.metaRowToRegionPairWithInfo(MetaReader.java:407)
        at org.apache.hadoop.hbase.catalog.MetaReader.getServerUserRegions(MetaReader.java:594)
        at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:124)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

It looks like the above exception caused us to jump out of the
processing of the server shutdown.  Above is related to the no route
to host.

I filed HBASE-3556.  It'll be 'fixed' by HBASE-1501 but we should
never just give up processing.  Need to look into that.

While a server is in the dead servers list, we'll not run the
balancer.  The dead servers list is an in-memory list.  You'd need to
kill the master and bring it back up again to rid the dead server
state.

St.Ack


> 2011-02-18 10:39:26,895 INFO org.apache.hadoop.hbase.master.ServerManager:
> Registering server=zcl.local,60020,1297996817352, regionCount=0,
> userLoad=false
> 2011-02-18 10:39:35,062 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
> running balancer because processing dead regionserver(s):
> [Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
> zcl.local,60020,1297919367472]
>
> On Tue, Feb 22, 2011 at 1:48 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Looks like there was connectivity issue:
>>
>> java.net.NoRouteToHostException: No route to host
>>
>> On Sun, Feb 20, 2011 at 10:09 PM, Yi Liang <wh...@gmail.com> wrote:
>>
>> > The related log is at: http://pastebin.com/0a1CjDUD
>> >
>> > It's ok now after restarting hbase, but still curious why it happend.
>> >
>> > Thanks,
>> > Yi
>> > On Sat, Feb 19, 2011 at 3:58 AM, Jean-Daniel Cryans <jdcryans@apache.org
>> > >wrote:
>> >
>> > > The master should finish processing those dead servers at some point
>> > > and it seems it's not happening? Unfortunately without the log nobody
>> > > can'tell why. If you can post the complete log in pastebin or put it
>> > > on a web server then we could take a look.
>> > >
>> > > J-D
>> > >
>> > > On Fri, Feb 18, 2011 at 12:39 AM, Yi Liang <wh...@gmail.com> wrote:
>> > > > Hi all,
>> > > >
>> > > > We have a hbase cluster with 10 region servers running HBase 0.90.0 +
>> > > CDH3.
>> > > > We're now importing big data into HBase.
>> > > >
>> > > > During the process, 2 servers crashed, but after restaring them,
>> > they're
>> > > no
>> > > > longer assigned with any region, while regions on other servers keep
>> > > > splitting when more data inserted.
>> > > >
>> > > > From the master log, we can see the periodical messages like:
>> > > >
>> > > > 2011-02-18 16:09:35,067 DEBUG org.apache.hadoop.hbase.master.HMaster:
>> > Not
>> > > > running balancer because processing dead regionserver(s):
>> > > > [zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
>> > > > Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
>> > > > zcl.local,60020,1297919367472]
>> > > >
>> > > > zcl.local and qics.local are the machines we have restared, other 2
>> > > machine
>> > > > have kept running without restarting and are actually still serving
>> > > regions.
>> > > >
>> > > > From the shell status:
>> > > > 10 servers, 5 dead, 10.1000 average Load
>> > > >
>> > > > Why are there dead servers? And how to clear them so we could start
>> > > > balancer?
>> > > >
>> > > > Thanks,
>> > > > Yi
>> > > >
>> > >
>> >
>>
>

Re: Not running balancer because processing dead regionserver(s)

Posted by Yi Liang <wh...@gmail.com>.

Yes, the server zcl crashed at that time.

But after I restarted it later, it's still in the dead server list.

2011-02-18 10:39:26,895 INFO org.apache.hadoop.hbase.master.ServerManager:
Registering server=zcl.local,60020,1297996817352, regionCount=0,
userLoad=false
2011-02-18 10:39:35,062 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
running balancer because processing dead regionserver(s):
[Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
zcl.local,60020,1297919367472]

On Tue, Feb 22, 2011 at 1:48 AM, Ted Yu <yu...@gmail.com> wrote:

> Looks like there was connectivity issue:
>
> java.net.NoRouteToHostException: No route to host
>
> On Sun, Feb 20, 2011 at 10:09 PM, Yi Liang <wh...@gmail.com> wrote:
>
> > The related log is at: http://pastebin.com/0a1CjDUD
> >
> > It's ok now after restarting hbase, but still curious why it happend.
> >
> > Thanks,
> > Yi
> > On Sat, Feb 19, 2011 at 3:58 AM, Jean-Daniel Cryans <jdcryans@apache.org
> > >wrote:
> >
> > > The master should finish processing those dead servers at some point
> > > and it seems it's not happening? Unfortunately without the log nobody
> > > can'tell why. If you can post the complete log in pastebin or put it
> > > on a web server then we could take a look.
> > >
> > > J-D
> > >
> > > On Fri, Feb 18, 2011 at 12:39 AM, Yi Liang <wh...@gmail.com> wrote:
> > > > Hi all,
> > > >
> > > > We have a hbase cluster with 10 region servers running HBase 0.90.0 +
> > > CDH3.
> > > > We're now importing big data into HBase.
> > > >
> > > > During the process, 2 servers crashed, but after restaring them,
> > they're
> > > no
> > > > longer assigned with any region, while regions on other servers keep
> > > > splitting when more data inserted.
> > > >
> > > > From the master log, we can see the periodical messages like:
> > > >
> > > > 2011-02-18 16:09:35,067 DEBUG org.apache.hadoop.hbase.master.HMaster:
> > Not
> > > > running balancer because processing dead regionserver(s):
> > > > [zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
> > > > Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
> > > > zcl.local,60020,1297919367472]
> > > >
> > > > zcl.local and qics.local are the machines we have restared, other 2
> > > machine
> > > > have kept running without restarting and are actually still serving
> > > regions.
> > > >
> > > > From the shell status:
> > > > 10 servers, 5 dead, 10.1000 average Load
> > > >
> > > > Why are there dead servers? And how to clear them so we could start
> > > > balancer?
> > > >
> > > > Thanks,
> > > > Yi
> > > >
> > >
> >
>

Re: Not running balancer because processing dead regionserver(s)

Posted by Ted Yu <yu...@gmail.com>.

Looks like there was connectivity issue:

java.net.NoRouteToHostException: No route to host

On Sun, Feb 20, 2011 at 10:09 PM, Yi Liang <wh...@gmail.com> wrote:

> The related log is at: http://pastebin.com/0a1CjDUD
>
> It's ok now after restarting hbase, but still curious why it happend.
>
> Thanks,
> Yi
> On Sat, Feb 19, 2011 at 3:58 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > The master should finish processing those dead servers at some point
> > and it seems it's not happening? Unfortunately without the log nobody
> > can'tell why. If you can post the complete log in pastebin or put it
> > on a web server then we could take a look.
> >
> > J-D
> >
> > On Fri, Feb 18, 2011 at 12:39 AM, Yi Liang <wh...@gmail.com> wrote:
> > > Hi all,
> > >
> > > We have a hbase cluster with 10 region servers running HBase 0.90.0 +
> > CDH3.
> > > We're now importing big data into HBase.
> > >
> > > During the process, 2 servers crashed, but after restaring them,
> they're
> > no
> > > longer assigned with any region, while regions on other servers keep
> > > splitting when more data inserted.
> > >
> > > From the master log, we can see the periodical messages like:
> > >
> > > 2011-02-18 16:09:35,067 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Not
> > > running balancer because processing dead regionserver(s):
> > > [zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
> > > Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
> > > zcl.local,60020,1297919367472]
> > >
> > > zcl.local and qics.local are the machines we have restared, other 2
> > machine
> > > have kept running without restarting and are actually still serving
> > regions.
> > >
> > > From the shell status:
> > > 10 servers, 5 dead, 10.1000 average Load
> > >
> > > Why are there dead servers? And how to clear them so we could start
> > > balancer?
> > >
> > > Thanks,
> > > Yi
> > >
> >
>

Re: Not running balancer because processing dead regionserver(s)

Posted by Yi Liang <wh...@gmail.com>.

The related log is at: http://pastebin.com/0a1CjDUD

It's ok now after restarting hbase, but still curious why it happend.

Thanks,
Yi
On Sat, Feb 19, 2011 at 3:58 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> The master should finish processing those dead servers at some point
> and it seems it's not happening? Unfortunately without the log nobody
> can'tell why. If you can post the complete log in pastebin or put it
> on a web server then we could take a look.
>
> J-D
>
> On Fri, Feb 18, 2011 at 12:39 AM, Yi Liang <wh...@gmail.com> wrote:
> > Hi all,
> >
> > We have a hbase cluster with 10 region servers running HBase 0.90.0 +
> CDH3.
> > We're now importing big data into HBase.
> >
> > During the process, 2 servers crashed, but after restaring them, they're
> no
> > longer assigned with any region, while regions on other servers keep
> > splitting when more data inserted.
> >
> > From the master log, we can see the periodical messages like:
> >
> > 2011-02-18 16:09:35,067 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
> > running balancer because processing dead regionserver(s):
> > [zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
> > Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
> > zcl.local,60020,1297919367472]
> >
> > zcl.local and qics.local are the machines we have restared, other 2
> machine
> > have kept running without restarting and are actually still serving
> regions.
> >
> > From the shell status:
> > 10 servers, 5 dead, 10.1000 average Load
> >
> > Why are there dead servers? And how to clear them so we could start
> > balancer?
> >
> > Thanks,
> > Yi
> >
>

Re: Not running balancer because processing dead regionserver(s)

Posted by Jean-Daniel Cryans <jd...@apache.org>.

The master should finish processing those dead servers at some point
and it seems it's not happening? Unfortunately without the log nobody
can'tell why. If you can post the complete log in pastebin or put it
on a web server then we could take a look.

J-D

On Fri, Feb 18, 2011 at 12:39 AM, Yi Liang <wh...@gmail.com> wrote:
> Hi all,
>
> We have a hbase cluster with 10 region servers running HBase 0.90.0 + CDH3.
> We're now importing big data into HBase.
>
> During the process, 2 servers crashed, but after restaring them, they're no
> longer assigned with any region, while regions on other servers keep
> splitting when more data inserted.
>
> From the master log, we can see the periodical messages like:
>
> 2011-02-18 16:09:35,067 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
> running balancer because processing dead regionserver(s):
> [zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
> Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
> zcl.local,60020,1297919367472]
>
> zcl.local and qics.local are the machines we have restared, other 2 machine
> have kept running without restarting and are actually still serving regions.
>
> From the shell status:
> 10 servers, 5 dead, 10.1000 average Load
>
> Why are there dead servers? And how to clear them so we could start
> balancer?
>
> Thanks,
> Yi
>

Not running balancer because processing dead regionserver(s)

Posted by Yi Liang <wh...@gmail.com>.

Hi all,

We have a hbase cluster with 10 region servers running HBase 0.90.0 + CDH3.
We're now importing big data into HBase.

During the process, 2 servers crashed, but after restaring them, they're no
longer assigned with any region, while regions on other servers keep
splitting when more data inserted.

>From the master log, we can see the periodical messages like:

2011-02-18 16:09:35,067 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
running balancer because processing dead regionserver(s):
[zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
zcl.local,60020,1297919367472]

zcl.local and qics.local are the machines we have restared, other 2 machine
have kept running without restarting and are actually still serving regions.

>From the shell status:
10 servers, 5 dead, 10.1000 average Load

Why are there dead servers? And how to clear them so we could start
balancer?

Thanks,
Yi