You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Michael Dagaev <mi...@gmail.com> on 2009/03/26 12:20:50 UTC

NotServingRegionException as indication of cluster overloading

Hi, all

    It looks like NotServingRegionException start to occur on region servers
when the cluster is getting overloaded with too much regions. Is it correct ?

Thank you for your cooperation,
M.

Re: NotServingRegionException as indication of cluster overloading

Posted by Andrew Purtell <ap...@apache.org>.

If next time just restarting the region server clears the
problem, then you can know the DFSClient block location cache
issue is the cause. 

If you can run with xceivers at 4096, this was the last setting
I used with success to achieve successful cluster (re)starts
and high scanner load with very large tables.

Best regards,

   - Andy

> From: Michael Dagaev
> Subject: Re: NotServingRegionException as indication of cluster overloading
> To: hbase-user@hadoop.apache.org, apurtell@apache.org
> Date: Thursday, March 26, 2009, 11:17 AM
>
> Andrew, thank you for the detailed answer.
> 
> That is what I am doing now. I increased the number of
> xceivers to 2048 and when I see a lot NSRE I add more servers.
> 
> Now the problem was  that NSRE exceptions occur after
> adding more servers. Just after some voodoo (restarting region
> servers and restarting cluster) it seems to be working now.
> 
> M.

Re: NotServingRegionException as indication of cluster overloading

Posted by Michael Dagaev <mi...@gmail.com>.

Andrew, thank you for the detailed answer.

That is what I am doing now. I increased the number of xceivers to 2048
and when I see a lot NSRE I add more servers.

Now the problem was  that NSRE exceptions occur after adding more servers.
Just after some voodoo (restarting region servers and restarting
cluster) it seems
to be working now.

M.

On Thu, Mar 26, 2009 at 7:36 PM, Andrew Purtell <ap...@apache.org> wrote:
>
> chooseDataNode cannot find a live block, this is the root cause
> of the NSREs.
>
> I found that as the number of regions I deployed onto my fixed
> size cluster increased, I had to either
>    1) steadily increase the number of xceivers and handlers
>       in the DFS data node config, with the xceiver setting
>       having dominant effect;
> or
>    2) add more servers to act as DataNodes and RegionServers
>       to spread the load -- so fixed size cannot really be
>       fixed :-) but is load dependent, of course
>
> In general you have to do #2 with this type of technology. That
> said, there are some known inefficiencies that contribute to
> the problem. For more information on that, please see
> HADOOP-3856.
>
> There is an additional wrinkle here in that I too found that
> sometimes the DFSClient of a region server would get into a
> state where it could not see the block locations that every
> other client could. What I found worked was to shut down the
> regionserver and restart it. Because of this issue I filed
> HBASE-1084. This approaches the problem of bad state in the
> DFSClient through a wrapper approach. A better solution, but
> one that would involve DFS surgery, is either to fix the
> underlying cause (least likely to be a quick solution) or
> expose a DFSClient API for flushing the block location cache
> (which possibly could be achieved and accepted more quickly).
> Even so, this type of aberrant behavior seemed to happen only
> under load.
>
> Additionally, I have been considering deploying a Lustre (or
> similar) based distributed FS underneath a HBase deployment
> instead of DFS. My conjecture is something like this can
> sustain higher loads for the same number of nodes than an
> equivalent HBase + HDFS deployment. Someday I hope to be able
> to try this experiment because the results either way would
> be most informative. But to be credible, any alternative to
> HDFS would have to handle appends. See HADOOP-1700 and
> HADOOP-4379 for more information there.
>
> Best regards,
>
>   - Andy
>
>> From: Michael Dagaev <mi...@gmail.com>
>> Subject: Re: NotServingRegionException as indication of cluster overloading
>> To: hbase-user@hadoop.apache.org
>> Date: Thursday, March 26, 2009, 6:35 AM
>> Stack
>>
>>        Currenlty, one region server of 7 throws a lot of
>> NotServingRegionException. No map reduce jobs and no CPU
>> starvation on
>> this host. There also a lot of IO exceptions like that:
>>
>> java.io.IOException: Could not obtain block:
>> blk_-3762232304446475286_1462869
>> file=/hbase/ENTITIES/267503732/oldlogfile.log
>>         at
>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462)
>>         at
>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312)
>>         at
>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417)
>>         at
>> java.io.DataInputStream.readFully(DataInputStream.java:178)
>>         at
>> java.io.DataInputStream.readFully(DataInputStream.java:152)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1453)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
>>         at
>> org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:303)
>>         at
>> org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:233)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1728)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:469)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:911)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:883)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:823)
>>         at java.lang.Thread.run(Thread.java:619)
>>
>> Thank you for your cooperation,
>> M.
>>
>> On Thu, Mar 26, 2009 at 2:51 PM, stack
>> <st...@duboce.net> wrote:
>> > Michael:
>> >
>> > What is happening on your cluster?  Are you doing any
>> mapreduce jobs or is
>> > there any other kind of heavy access afoot?  Look in
>> your logs to see if you
>> > can figure whats going on.  What else is going on on
>> these machines?   Are
>> > the other processes starving the hbase regionservers?
>> >
>> > St.Ack
>> >
>> > On Thu, Mar 26, 2009 at 1:22 PM, Michael Dagaev
>> <mi...@gmail.com>wrote:
>> >
>> >> Thanks, Schubert. I wonder why it is starting to
>> occur very frequently
>> >> and make also clients fail.
>> >>
>> >> On Thu, Mar 26, 2009 at 2:05 PM, schubert zhang
>> <zs...@gmail.com> wrote:
>> >> > This NotServingRegionException would happen
>> when assignment.When
>> >> > region-splitting, master should assign the
>> new regions to servers, the
>> >> > processing need time. In this duration, the
>> regions are not accessable.
>> >> >
>> >> >
>> >> > On Thu, Mar 26, 2009 at 7:20 PM, Michael
>> Dagaev <
>> >> michael.dagaev@gmail.com>wrote:
>> >> >
>> >> >> Hi, all
>> >> >>
>> >> >>    It looks like
>> NotServingRegionException start to occur on region
>> >> servers
>> >> >> when the cluster is getting overloaded
>> with too much regions. Is it
>> >> correct
>> >> >> ?
>> >> >>
>> >> >> Thank you for your cooperation,
>> >> >> M.
>> >> >>
>> >> >
>> >>
>> >
>
>
>
>

Re: NotServingRegionException as indication of cluster overloading

Posted by Andrew Purtell <ap...@apache.org>.

chooseDataNode cannot find a live block, this is the root cause
of the NSREs. 

I found that as the number of regions I deployed onto my fixed
size cluster increased, I had to either 
    1) steadily increase the number of xceivers and handlers
       in the DFS data node config, with the xceiver setting
       having dominant effect;
or
    2) add more servers to act as DataNodes and RegionServers
       to spread the load -- so fixed size cannot really be
       fixed :-) but is load dependent, of course

In general you have to do #2 with this type of technology. That
said, there are some known inefficiencies that contribute to
the problem. For more information on that, please see
HADOOP-3856.

There is an additional wrinkle here in that I too found that
sometimes the DFSClient of a region server would get into a 
state where it could not see the block locations that every
other client could. What I found worked was to shut down the
regionserver and restart it. Because of this issue I filed
HBASE-1084. This approaches the problem of bad state in the
DFSClient through a wrapper approach. A better solution, but
one that would involve DFS surgery, is either to fix the
underlying cause (least likely to be a quick solution) or
expose a DFSClient API for flushing the block location cache
(which possibly could be achieved and accepted more quickly).
Even so, this type of aberrant behavior seemed to happen only
under load. 

Additionally, I have been considering deploying a Lustre (or
similar) based distributed FS underneath a HBase deployment
instead of DFS. My conjecture is something like this can
sustain higher loads for the same number of nodes than an
equivalent HBase + HDFS deployment. Someday I hope to be able
to try this experiment because the results either way would
be most informative. But to be credible, any alternative to
HDFS would have to handle appends. See HADOOP-1700 and 
HADOOP-4379 for more information there. 

Best regards,

   - Andy

> From: Michael Dagaev <mi...@gmail.com>
> Subject: Re: NotServingRegionException as indication of cluster overloading
> To: hbase-user@hadoop.apache.org
> Date: Thursday, March 26, 2009, 6:35 AM
> Stack
> 
>        Currenlty, one region server of 7 throws a lot of
> NotServingRegionException. No map reduce jobs and no CPU
> starvation on
> this host. There also a lot of IO exceptions like that:
> 
> java.io.IOException: Could not obtain block:
> blk_-3762232304446475286_1462869
> file=/hbase/ENTITIES/267503732/oldlogfile.log
>         at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462)
>         at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312)
>         at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417)
>         at
> java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at
> java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1453)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
>         at
> org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:303)
>         at
> org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:233)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1728)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:469)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:911)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:883)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:823)
>         at java.lang.Thread.run(Thread.java:619)
> 
> Thank you for your cooperation,
> M.
> 
> On Thu, Mar 26, 2009 at 2:51 PM, stack
> <st...@duboce.net> wrote:
> > Michael:
> >
> > What is happening on your cluster?  Are you doing any
> mapreduce jobs or is
> > there any other kind of heavy access afoot?  Look in
> your logs to see if you
> > can figure whats going on.  What else is going on on
> these machines?   Are
> > the other processes starving the hbase regionservers?
> >
> > St.Ack
> >
> > On Thu, Mar 26, 2009 at 1:22 PM, Michael Dagaev
> <mi...@gmail.com>wrote:
> >
> >> Thanks, Schubert. I wonder why it is starting to
> occur very frequently
> >> and make also clients fail.
> >>
> >> On Thu, Mar 26, 2009 at 2:05 PM, schubert zhang
> <zs...@gmail.com> wrote:
> >> > This NotServingRegionException would happen
> when assignment.When
> >> > region-splitting, master should assign the
> new regions to servers, the
> >> > processing need time. In this duration, the
> regions are not accessable.
> >> >
> >> >
> >> > On Thu, Mar 26, 2009 at 7:20 PM, Michael
> Dagaev <
> >> michael.dagaev@gmail.com>wrote:
> >> >
> >> >> Hi, all
> >> >>
> >> >>    It looks like
> NotServingRegionException start to occur on region
> >> servers
> >> >> when the cluster is getting overloaded
> with too much regions. Is it
> >> correct
> >> >> ?
> >> >>
> >> >> Thank you for your cooperation,
> >> >> M.
> >> >>
> >> >
> >>
> >

Re: NotServingRegionException as indication of cluster overloading

Posted by Michael Dagaev <mi...@gmail.com>.

See below.

On Thu, Mar 26, 2009 at 4:18 PM, stack <st...@duboce.net> wrote:
> So, does that region fail to deploy?  Does it ever come on line?

Do you mean a region or region server ?

> Can you download that file successfully using hadoop command-line:
>
> ./bin/hadoop fs -get /hbase/ENTITIES/267503732/oldlogfile.log .

No. The command prints out "get: null"

> Do you have the xceivers' bumped up on your cluster and the dfsclient
> timeout set to zero?

Yes. xceivers = 2048 and dfsclient timeout = 0

Thank you for your cooperation,
M.

Re: NotServingRegionException as indication of cluster overloading

Posted by stack <st...@duboce.net>.

So, does that region fail to deploy?  Does it ever come on line?

Can you download that file successfully using hadoop command-line:

./bin/hadoop fs -get /hbase/ENTITIES/267503732/oldlogfile.log .

Do you have the xceivers' bumped up on your cluster and the dfsclient
timeout set to zero?

St.Ack


On Thu, Mar 26, 2009 at 2:35 PM, Michael Dagaev <mi...@gmail.com>wrote:

> Stack
>
>       Currenlty, one region server of 7 throws a lot of
> NotServingRegionException. No map reduce jobs and no CPU starvation on
> this host. There also a lot of IO exceptions like that:
>
> java.io.IOException: Could not obtain block:
> blk_-3762232304446475286_1462869
> file=/hbase/ENTITIES/267503732/oldlogfile.log
>        at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462)
>        at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312)
>        at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417)
>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1453)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
>        at
> org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:303)
>        at
> org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:233)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1728)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:469)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:911)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:883)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:823)
>        at java.lang.Thread.run(Thread.java:619)
>
> Thank you for your cooperation,
> M.
>
> On Thu, Mar 26, 2009 at 2:51 PM, stack <st...@duboce.net> wrote:
> > Michael:
> >
> > What is happening on your cluster?  Are you doing any mapreduce jobs or
> is
> > there any other kind of heavy access afoot?  Look in your logs to see if
> you
> > can figure whats going on.  What else is going on on these machines?
> Are
> > the other processes starving the hbase regionservers?
> >
> > St.Ack
> >
> > On Thu, Mar 26, 2009 at 1:22 PM, Michael Dagaev <
> michael.dagaev@gmail.com>wrote:
> >
> >> Thanks, Schubert. I wonder why it is starting to occur very frequently
> >> and make also clients fail.
> >>
> >> On Thu, Mar 26, 2009 at 2:05 PM, schubert zhang <zs...@gmail.com>
> wrote:
> >> > This NotServingRegionException would happen when assignment.When
> >> > region-splitting, master should assign the new regions to servers, the
> >> > processing need time. In this duration, the regions are not
> accessable.
> >> >
> >> >
> >> > On Thu, Mar 26, 2009 at 7:20 PM, Michael Dagaev <
> >> michael.dagaev@gmail.com>wrote:
> >> >
> >> >> Hi, all
> >> >>
> >> >>    It looks like NotServingRegionException start to occur on region
> >> servers
> >> >> when the cluster is getting overloaded with too much regions. Is it
> >> correct
> >> >> ?
> >> >>
> >> >> Thank you for your cooperation,
> >> >> M.
> >> >>
> >> >
> >>
> >
>

Re: NotServingRegionException as indication of cluster overloading

Posted by Michael Dagaev <mi...@gmail.com>.

Stack

       Currenlty, one region server of 7 throws a lot of
NotServingRegionException. No map reduce jobs and no CPU starvation on
this host. There also a lot of IO exceptions like that:

java.io.IOException: Could not obtain block:
blk_-3762232304446475286_1462869
file=/hbase/ENTITIES/267503732/oldlogfile.log
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1453)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
        at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:303)
        at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:233)
        at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1728)
        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:469)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:911)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:883)
        at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:823)
        at java.lang.Thread.run(Thread.java:619)

Thank you for your cooperation,
M.

On Thu, Mar 26, 2009 at 2:51 PM, stack <st...@duboce.net> wrote:
> Michael:
>
> What is happening on your cluster?  Are you doing any mapreduce jobs or is
> there any other kind of heavy access afoot?  Look in your logs to see if you
> can figure whats going on.  What else is going on on these machines?   Are
> the other processes starving the hbase regionservers?
>
> St.Ack
>
> On Thu, Mar 26, 2009 at 1:22 PM, Michael Dagaev <mi...@gmail.com>wrote:
>
>> Thanks, Schubert. I wonder why it is starting to occur very frequently
>> and make also clients fail.
>>
>> On Thu, Mar 26, 2009 at 2:05 PM, schubert zhang <zs...@gmail.com> wrote:
>> > This NotServingRegionException would happen when assignment.When
>> > region-splitting, master should assign the new regions to servers, the
>> > processing need time. In this duration, the regions are not accessable.
>> >
>> >
>> > On Thu, Mar 26, 2009 at 7:20 PM, Michael Dagaev <
>> michael.dagaev@gmail.com>wrote:
>> >
>> >> Hi, all
>> >>
>> >>    It looks like NotServingRegionException start to occur on region
>> servers
>> >> when the cluster is getting overloaded with too much regions. Is it
>> correct
>> >> ?
>> >>
>> >> Thank you for your cooperation,
>> >> M.
>> >>
>> >
>>
>

Re: NotServingRegionException as indication of cluster overloading

Posted by stack <st...@duboce.net>.

Michael:

What is happening on your cluster?  Are you doing any mapreduce jobs or is
there any other kind of heavy access afoot?  Look in your logs to see if you
can figure whats going on.  What else is going on on these machines?   Are
the other processes starving the hbase regionservers?

St.Ack

On Thu, Mar 26, 2009 at 1:22 PM, Michael Dagaev <mi...@gmail.com>wrote:

> Thanks, Schubert. I wonder why it is starting to occur very frequently
> and make also clients fail.
>
> On Thu, Mar 26, 2009 at 2:05 PM, schubert zhang <zs...@gmail.com> wrote:
> > This NotServingRegionException would happen when assignment.When
> > region-splitting, master should assign the new regions to servers, the
> > processing need time. In this duration, the regions are not accessable.
> >
> >
> > On Thu, Mar 26, 2009 at 7:20 PM, Michael Dagaev <
> michael.dagaev@gmail.com>wrote:
> >
> >> Hi, all
> >>
> >>    It looks like NotServingRegionException start to occur on region
> servers
> >> when the cluster is getting overloaded with too much regions. Is it
> correct
> >> ?
> >>
> >> Thank you for your cooperation,
> >> M.
> >>
> >
>

Re: NotServingRegionException as indication of cluster overloading

Posted by Michael Dagaev <mi...@gmail.com>.

Thanks, Schubert. I wonder why it is starting to occur very frequently
and make also clients fail.

On Thu, Mar 26, 2009 at 2:05 PM, schubert zhang <zs...@gmail.com> wrote:
> This NotServingRegionException would happen when assignment.When
> region-splitting, master should assign the new regions to servers, the
> processing need time. In this duration, the regions are not accessable.
>
>
> On Thu, Mar 26, 2009 at 7:20 PM, Michael Dagaev <mi...@gmail.com>wrote:
>
>> Hi, all
>>
>>    It looks like NotServingRegionException start to occur on region servers
>> when the cluster is getting overloaded with too much regions. Is it correct
>> ?
>>
>> Thank you for your cooperation,
>> M.
>>
>

Re: NotServingRegionException as indication of cluster overloading

Posted by schubert zhang <zs...@gmail.com>.

This NotServingRegionException would happen when assignment.When
region-splitting, master should assign the new regions to servers, the
processing need time. In this duration, the regions are not accessable.

On Thu, Mar 26, 2009 at 7:20 PM, Michael Dagaev <mi...@gmail.com>wrote:

> Hi, all
>
>    It looks like NotServingRegionException start to occur on region servers
> when the cluster is getting overloaded with too much regions. Is it correct
> ?
>
> Thank you for your cooperation,
> M.
>