You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Dru Jensen <dr...@gmail.com> on 2008/10/23 20:11:53 UTC

NotServingRegionException - Map/Reduce process fails

Hi hbase-users,

During a fairly large MR process, on the Reduce cycle as its writing  
its results to a table, I see  
org.apache.hadoop.hbase.NotServingRegionException in the region server  
log several times and then I see a split reporting it was successful.

Eventually, the Reduce process fails with  
org.apache.hadoop.hbase.client.RetriesExhaustedException after 10  
failed attempts.

What can I do to fix it?

Thanks,
Dru

Re: NotServingRegionException - Map/Reduce process fails

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Dru.

To make sure it's not 921, check if the region that NSRE is currently
reachable (provided that you did not reboot). It should be assigned in META
but missing from the region server.

J-D

On Thu, Oct 23, 2008 at 3:07 PM, Dru Jensen <dr...@gmail.com> wrote:

> Stack and J-D, Thanks for your responses.
>
> It looks like the RetriesExhaustedException occurred during:
>
> 2008-10-23 11:08:55,180 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> compaction completed on region ... 1224785065371 in 4mins, 25sec
>
> It doesn't look like I am having the HBASE-921 issue (yet).
>
> What settings can I change to cause the compaction to not take so long?
>
> I found this setting:
>
> <property>
>    <name>hbase.hstore.compactionThreshold</name>
>    <value>3</value>
>    <description>
>    If more than this number of HStoreFiles in any one HStore
>    (one HStoreFile is written per flush of memcache) then a compaction
>    is run to rewrite all HStoreFiles files as one.  Larger numbers
>    put off compaction but when it runs, it takes longer to complete.
>    During a compaction, updates cannot be flushed to disk.  Long
>    compactions require memory sufficient to carry the logging of
>    all updates across the duration of the compaction.
>
>    If too large, clients timeout during compaction.
>    </description>
> </property>
>
> Should I lower this or is there a better way?
>
> Thanks,
> Dru
>
> On Oct 23, 2008, at 11:37 AM, Jean-Daniel Cryans wrote:
>
>  Dru.
>>
>> See also if it's a case of
>> HBASE-921<https://issues.apache.org/jira/browse/HBASE-921>because it
>>
>> would make sense if not using hbase 0.18.1 and under a heavy
>> load.
>>
>> J-D
>>
>> On Thu, Oct 23, 2008 at 2:30 PM, stack <st...@duboce.net> wrote:
>>
>>  Find the MR task that failed.  Click through the UI to look at its logs.
>>> It may have interesting info.  Its probably complaining about a region
>>> not
>>> being available (NSRE).  Figure which region it is.  Use the region
>>> historian or grep in the master logs -- 'grep -v metaScanner REGIONNAME'
>>> so
>>> you avoid the metaScanner noise -- to see if you can figure the regions
>>> history around the failure.  Look too at loading around failure time.
>>>  Were
>>> you swapping, etc. (Ganglia or some such helps here).
>>>
>>> You might also test table is still wholesome -- that the MR job didn't
>>> damage the table.  A quick check that all regions are onlined and
>>> accessible
>>> is to scan for a column whose column family does exist but whose
>>> qualifier
>>> you know is not present: e.g. if you have columnfamily 'page' and you
>>> know
>>> there is no column 'page:xyz', scan with that (Enable DEBUG in log4j so
>>> you
>>> can see regions being loaded as scan progresses): "scan 'TABLENAME',
>>> ['page:xyz']".
>>>
>>> You might need to up the timeouts/retries.
>>> St.Ack
>>>
>>>
>>>
>>> Dru Jensen wrote:
>>>
>>>  Hi hbase-users,
>>>>
>>>> During a fairly large MR process, on the Reduce cycle as its writing its
>>>> results to a table, I see
>>>> org.apache.hadoop.hbase.NotServingRegionException
>>>> in the region server log several times and then I see a split reporting
>>>> it
>>>> was successful.
>>>>
>>>> Eventually, the Reduce process fails with
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException after 10 failed
>>>> attempts.
>>>>
>>>> What can I do to fix it?
>>>>
>>>> Thanks,
>>>> Dru
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: NotServingRegionException - Map/Reduce process fails

Posted by Dru Jensen <dr...@gmail.com>.

St.Ack and J-D,

Thanks for your help.  Upgrading to the latest 0.19.0 and changing the  
region size back to 256MB along with the Premature EOF settings from  
Jean Adrien fixed the issues I was seeing.

Dru

On Oct 23, 2008, at 4:04 PM, stack wrote:

> Dru Jensen wrote:
>> Stack,
>>
>> Sorry for the confusion, I am not using the old implementation of  
>> TableReduce.  The new 0.19.0 changed this to an interface.  The  
>> reduce process is performing calculations.  It's not just writing  
>> to the table and requires the sort.
> Or try running with even more reducers so loading is spread more  
> evenly?
>
>> I will change the region size back and see if that helps.  If I  
>> find that I need a larger region, should I change the flush by the  
>> same multiple?
> Yes.
> St.Ack
>>
>> thanks,
>> Dru
>>
>> On Oct 23, 2008, at 2:18 PM, stack wrote:
>>
>>> Any reason you need to use TableReduce?  If you delay the insert  
>>> into hbase till reduce-time, it means 1.), the MR framework has  
>>> spent a bunch of resources shuffling and sorting your data, a sort  
>>> that is going to happen on hbase insert anyways, and 2). your  
>>> inserts are going into hbase in order so you pound one region  
>>> rather than insert across all.  You might try inserting into hbase  
>>> at the tail of your map task and output nothing (or something  
>>> small to keep up the job counters).
>>>
>>> Are your rows > 256MB? At the moment at least, there needs to be a  
>>> bit of balance maintained between flushing, compacting and  
>>> splitting.  The defaults do that.  I'm not sure what happens when  
>>> you double the max filesize but not correspondingly the  
>>> flushsize.  You might trying restoring the default (hbase will not  
>>> try and split a row if its > configured maxfile size).
>>>
>>> St.Ack
>>
>

Re: NotServingRegionException - Map/Reduce process fails

Posted by stack <st...@duboce.net>.

Dru Jensen wrote:
> Stack,
>
> Sorry for the confusion, I am not using the old implementation of 
> TableReduce.  The new 0.19.0 changed this to an interface.  The reduce 
> process is performing calculations.  It's not just writing to the 
> table and requires the sort.
Or try running with even more reducers so loading is spread more evenly?

> I will change the region size back and see if that helps.  If I find 
> that I need a larger region, should I change the flush by the same 
> multiple?
Yes.
St.Ack
>
> thanks,
> Dru
>
> On Oct 23, 2008, at 2:18 PM, stack wrote:
>
>> Any reason you need to use TableReduce?  If you delay the insert into 
>> hbase till reduce-time, it means 1.), the MR framework has spent a 
>> bunch of resources shuffling and sorting your data, a sort that is 
>> going to happen on hbase insert anyways, and 2). your inserts are 
>> going into hbase in order so you pound one region rather than insert 
>> across all.  You might try inserting into hbase at the tail of your 
>> map task and output nothing (or something small to keep up the job 
>> counters).
>>
>> Are your rows > 256MB? At the moment at least, there needs to be a 
>> bit of balance maintained between flushing, compacting and 
>> splitting.  The defaults do that.  I'm not sure what happens when you 
>> double the max filesize but not correspondingly the flushsize.  You 
>> might trying restoring the default (hbase will not try and split a 
>> row if its > configured maxfile size).
>>
>> St.Ack
>

Re: NotServingRegionException - Map/Reduce process fails

Posted by Dru Jensen <dr...@gmail.com>.

Stack,

Sorry for the confusion, I am not using the old implementation of  
TableReduce.  The new 0.19.0 changed this to an interface.  The reduce  
process is performing calculations.  It's not just writing to the  
table and requires the sort.

I will change the region size back and see if that helps.  If I find  
that I need a larger region, should I change the flush by the same  
multiple?

thanks,
Dru

On Oct 23, 2008, at 2:18 PM, stack wrote:

> Any reason you need to use TableReduce?  If you delay the insert  
> into hbase till reduce-time, it means 1.), the MR framework has  
> spent a bunch of resources shuffling and sorting your data, a sort  
> that is going to happen on hbase insert anyways, and 2). your  
> inserts are going into hbase in order so you pound one region rather  
> than insert across all.  You might try inserting into hbase at the  
> tail of your map task and output nothing (or something small to keep  
> up the job counters).
>
> Are your rows > 256MB? At the moment at least, there needs to be a  
> bit of balance maintained between flushing, compacting and  
> splitting.  The defaults do that.  I'm not sure what happens when  
> you double the max filesize but not correspondingly the flushsize.   
> You might trying restoring the default (hbase will not try and split  
> a row if its > configured maxfile size).
>
> St.Ack

Re: NotServingRegionException - Map/Reduce process fails

Posted by stack <st...@duboce.net>.

Any reason you need to use TableReduce?  If you delay the insert into 
hbase till reduce-time, it means 1.), the MR framework has spent a bunch 
of resources shuffling and sorting your data, a sort that is going to 
happen on hbase insert anyways, and 2). your inserts are going into 
hbase in order so you pound one region rather than insert across all.  
You might try inserting into hbase at the tail of your map task and 
output nothing (or something small to keep up the job counters).

Are your rows > 256MB? At the moment at least, there needs to be a bit 
of balance maintained between flushing, compacting and splitting.  The 
defaults do that.  I'm not sure what happens when you double the max 
filesize but not correspondingly the flushsize.  You might trying 
restoring the default (hbase will not try and split a row if its > 
configured maxfile size).

St.Ack

Re: NotServingRegionException - Map/Reduce process fails

Posted by Dru Jensen <dr...@gmail.com>.

I do not see any swapping.  I have a 3 node cluster with 8GB memory  
and 4 cpu's each and 2TB HDFS.  Node 1 is acting as master.

I am reducing 32M map results into about 2M rows, several column  
families with 10's of columns each.  I am writing them to a table  
using TableReduce class.

greping for compaction in the regionserver log, it is progressively  
getting longer from seconds to minutes and then to the 4mins, 25sec  
before it failed the Reduce cycle and started over.

   <property>
     <name>hbase.regionserver.lease.period</name>
     <value>120000</value>
   </property>
   <property>
     <name>hbase.hregion.max.filesize</name>
     <value>536870912</value>
   </property>
   <property>
     <name>dfs.datanode.socket.write.timeout</name>
     <value>0</value>
   </property>
   <property>
     <name>dfs.datanode.max.xcievers</name>
     <value>1023</value>
   </property>

I have increased the region file size to 512MB to support the large  
rows.

I was hitting the same issue as  Jean-Adrien premature socket error so  
I added the timeout and xcievers settings recently.   I haven't seen  
the Premature socket issue anymore with the setting changes but now I  
am hitting this one.

  I checked the META and verified that all the online regions existed  
in the table regions but to make sure I will upgrade to 0.18.1 to  
avoid the HBASE-921 issue.

Should I change the flush size? compaction threshold?

thanks,
Dru


On Oct 23, 2008, at 12:20 PM, stack wrote:

> Dru: If compactions are taking 4minutes, then your instance is being  
> overrun; its unable to keep up with your rate of upload.  Whats your  
> upload rate like?  How are you doing it?  Or is it that your servers  
> are buckled carrying the load?  Are they swapping?   Usually  
> compaction runs fast.  It'll take long if its compacting many more  
> than the threshold.  Grep your logs and see if compactions are  
> taking steadily longer?   Do you have a lot of blocking happening in  
> your logs (where the regionserver puts up temporary block of updates  
> because it isn't able to flush fast enough).  You're on recent  
> hbase?  Have you altered flush or maximum region file sizes?
>
> St.Ack
>
> Dru Jensen wrote:
>> Stack and J-D, Thanks for your responses.
>>
>> It looks like the RetriesExhaustedException occurred during:
>>
>> 2008-10-23 11:08:55,180 INFO  
>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed  
>> on region ... 1224785065371 in 4mins, 25sec
>>
>> It doesn't look like I am having the HBASE-921 issue (yet).
>>
>> What settings can I change to cause the compaction to not take so  
>> long?
>>
>> I found this setting:
>>
>> <property>
>>    <name>hbase.hstore.compactionThreshold</name>
>>    <value>3</value>
>>    <description>
>>    If more than this number of HStoreFiles in any one HStore
>>    (one HStoreFile is written per flush of memcache) then a  
>> compaction
>>    is run to rewrite all HStoreFiles files as one.  Larger numbers
>>    put off compaction but when it runs, it takes longer to complete.
>>    During a compaction, updates cannot be flushed to disk.  Long
>>    compactions require memory sufficient to carry the logging of
>>    all updates across the duration of the compaction.
>>
>>    If too large, clients timeout during compaction.
>>    </description>
>> </property>
>>
>> Should I lower this or is there a better way?
>>
>> Thanks,
>> Dru
>>
>> On Oct 23, 2008, at 11:37 AM, Jean-Daniel Cryans wrote:
>>
>>> Dru.
>>>
>>> See also if it's a case of
>>> HBASE-921<https://issues.apache.org/jira/browse/HBASE-921>because it
>>> would make sense if not using hbase 0.18.1 and under a heavy
>>> load.
>>>
>>> J-D
>>>
>>> On Thu, Oct 23, 2008 at 2:30 PM, stack <st...@duboce.net> wrote:
>>>
>>>> Find the MR task that failed.  Click through the UI to look at  
>>>> its logs.
>>>> It may have interesting info.  Its probably complaining about a  
>>>> region not
>>>> being available (NSRE).  Figure which region it is.  Use the region
>>>> historian or grep in the master logs -- 'grep -v metaScanner  
>>>> REGIONNAME' so
>>>> you avoid the metaScanner noise -- to see if you can figure the  
>>>> regions
>>>> history around the failure.  Look too at loading around failure  
>>>> time.  Were
>>>> you swapping, etc. (Ganglia or some such helps here).
>>>>
>>>> You might also test table is still wholesome -- that the MR job  
>>>> didn't
>>>> damage the table.  A quick check that all regions are onlined and  
>>>> accessible
>>>> is to scan for a column whose column family does exist but whose  
>>>> qualifier
>>>> you know is not present: e.g. if you have columnfamily 'page' and  
>>>> you know
>>>> there is no column 'page:xyz', scan with that (Enable DEBUG in  
>>>> log4j so you
>>>> can see regions being loaded as scan progresses): "scan  
>>>> 'TABLENAME',
>>>> ['page:xyz']".
>>>>
>>>> You might need to up the timeouts/retries.
>>>> St.Ack
>>>>
>>>>
>>>>
>>>> Dru Jensen wrote:
>>>>
>>>>> Hi hbase-users,
>>>>>
>>>>> During a fairly large MR process, on the Reduce cycle as its  
>>>>> writing its
>>>>> results to a table, I see  
>>>>> org.apache.hadoop.hbase.NotServingRegionException
>>>>> in the region server log several times and then I see a split  
>>>>> reporting it
>>>>> was successful.
>>>>>
>>>>> Eventually, the Reduce process fails with
>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException after  
>>>>> 10 failed
>>>>> attempts.
>>>>>
>>>>> What can I do to fix it?
>>>>>
>>>>> Thanks,
>>>>> Dru
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: NotServingRegionException - Map/Reduce process fails

Posted by stack <st...@duboce.net>.

Dru: If compactions are taking 4minutes, then your instance is being 
overrun; its unable to keep up with your rate of upload.  Whats your 
upload rate like?  How are you doing it?  Or is it that your servers are 
buckled carrying the load?  Are they swapping?   Usually compaction runs 
fast.  It'll take long if its compacting many more than the threshold.  
Grep your logs and see if compactions are taking steadily longer?   Do 
you have a lot of blocking happening in your logs (where the 
regionserver puts up temporary block of updates because it isn't able to 
flush fast enough).  You're on recent hbase?  Have you altered flush or 
maximum region file sizes?

St.Ack

Dru Jensen wrote:
> Stack and J-D, Thanks for your responses.
>
> It looks like the RetriesExhaustedException occurred during:
>
> 2008-10-23 11:08:55,180 INFO 
> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on 
> region ... 1224785065371 in 4mins, 25sec
>
> It doesn't look like I am having the HBASE-921 issue (yet).
>
> What settings can I change to cause the compaction to not take so long?
>
> I found this setting:
>
> <property>
>     <name>hbase.hstore.compactionThreshold</name>
>     <value>3</value>
>     <description>
>     If more than this number of HStoreFiles in any one HStore
>     (one HStoreFile is written per flush of memcache) then a compaction
>     is run to rewrite all HStoreFiles files as one.  Larger numbers
>     put off compaction but when it runs, it takes longer to complete.
>     During a compaction, updates cannot be flushed to disk.  Long
>     compactions require memory sufficient to carry the logging of
>     all updates across the duration of the compaction.
>
>     If too large, clients timeout during compaction.
>     </description>
> </property>
>
> Should I lower this or is there a better way?
>
> Thanks,
> Dru
>
> On Oct 23, 2008, at 11:37 AM, Jean-Daniel Cryans wrote:
>
>> Dru.
>>
>> See also if it's a case of
>> HBASE-921<https://issues.apache.org/jira/browse/HBASE-921>because it
>> would make sense if not using hbase 0.18.1 and under a heavy
>> load.
>>
>> J-D
>>
>> On Thu, Oct 23, 2008 at 2:30 PM, stack <st...@duboce.net> wrote:
>>
>>> Find the MR task that failed.  Click through the UI to look at its 
>>> logs.
>>> It may have interesting info.  Its probably complaining about a 
>>> region not
>>> being available (NSRE).  Figure which region it is.  Use the region
>>> historian or grep in the master logs -- 'grep -v metaScanner 
>>> REGIONNAME' so
>>> you avoid the metaScanner noise -- to see if you can figure the regions
>>> history around the failure.  Look too at loading around failure 
>>> time.  Were
>>> you swapping, etc. (Ganglia or some such helps here).
>>>
>>> You might also test table is still wholesome -- that the MR job didn't
>>> damage the table.  A quick check that all regions are onlined and 
>>> accessible
>>> is to scan for a column whose column family does exist but whose 
>>> qualifier
>>> you know is not present: e.g. if you have columnfamily 'page' and 
>>> you know
>>> there is no column 'page:xyz', scan with that (Enable DEBUG in log4j 
>>> so you
>>> can see regions being loaded as scan progresses): "scan 'TABLENAME',
>>> ['page:xyz']".
>>>
>>> You might need to up the timeouts/retries.
>>> St.Ack
>>>
>>>
>>>
>>> Dru Jensen wrote:
>>>
>>>> Hi hbase-users,
>>>>
>>>> During a fairly large MR process, on the Reduce cycle as its 
>>>> writing its
>>>> results to a table, I see 
>>>> org.apache.hadoop.hbase.NotServingRegionException
>>>> in the region server log several times and then I see a split 
>>>> reporting it
>>>> was successful.
>>>>
>>>> Eventually, the Reduce process fails with
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException after 10 
>>>> failed
>>>> attempts.
>>>>
>>>> What can I do to fix it?
>>>>
>>>> Thanks,
>>>> Dru
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: NotServingRegionException - Map/Reduce process fails

Posted by Dru Jensen <dr...@gmail.com>.

Stack and J-D, Thanks for your responses.

It looks like the RetriesExhaustedException occurred during:

2008-10-23 11:08:55,180 INFO  
org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on  
region ... 1224785065371 in 4mins, 25sec

It doesn't look like I am having the HBASE-921 issue (yet).

What settings can I change to cause the compaction to not take so long?

I found this setting:

<property>
     <name>hbase.hstore.compactionThreshold</name>
     <value>3</value>
     <description>
     If more than this number of HStoreFiles in any one HStore
     (one HStoreFile is written per flush of memcache) then a compaction
     is run to rewrite all HStoreFiles files as one.  Larger numbers
     put off compaction but when it runs, it takes longer to complete.
     During a compaction, updates cannot be flushed to disk.  Long
     compactions require memory sufficient to carry the logging of
     all updates across the duration of the compaction.

     If too large, clients timeout during compaction.
     </description>
</property>

Should I lower this or is there a better way?

Thanks,
Dru

On Oct 23, 2008, at 11:37 AM, Jean-Daniel Cryans wrote:

> Dru.
>
> See also if it's a case of
> HBASE-921<https://issues.apache.org/jira/browse/HBASE-921>because it
> would make sense if not using hbase 0.18.1 and under a heavy
> load.
>
> J-D
>
> On Thu, Oct 23, 2008 at 2:30 PM, stack <st...@duboce.net> wrote:
>
>> Find the MR task that failed.  Click through the UI to look at its  
>> logs.
>> It may have interesting info.  Its probably complaining about a  
>> region not
>> being available (NSRE).  Figure which region it is.  Use the region
>> historian or grep in the master logs -- 'grep -v metaScanner  
>> REGIONNAME' so
>> you avoid the metaScanner noise -- to see if you can figure the  
>> regions
>> history around the failure.  Look too at loading around failure  
>> time.  Were
>> you swapping, etc. (Ganglia or some such helps here).
>>
>> You might also test table is still wholesome -- that the MR job  
>> didn't
>> damage the table.  A quick check that all regions are onlined and  
>> accessible
>> is to scan for a column whose column family does exist but whose  
>> qualifier
>> you know is not present: e.g. if you have columnfamily 'page' and  
>> you know
>> there is no column 'page:xyz', scan with that (Enable DEBUG in  
>> log4j so you
>> can see regions being loaded as scan progresses): "scan 'TABLENAME',
>> ['page:xyz']".
>>
>> You might need to up the timeouts/retries.
>> St.Ack
>>
>>
>>
>> Dru Jensen wrote:
>>
>>> Hi hbase-users,
>>>
>>> During a fairly large MR process, on the Reduce cycle as its  
>>> writing its
>>> results to a table, I see  
>>> org.apache.hadoop.hbase.NotServingRegionException
>>> in the region server log several times and then I see a split  
>>> reporting it
>>> was successful.
>>>
>>> Eventually, the Reduce process fails with
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException after 10  
>>> failed
>>> attempts.
>>>
>>> What can I do to fix it?
>>>
>>> Thanks,
>>> Dru
>>>
>>>
>>>
>>>
>>>
>>

Re: NotServingRegionException - Map/Reduce process fails

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Dru.

See also if it's a case of
HBASE-921<https://issues.apache.org/jira/browse/HBASE-921>because it
would make sense if not using hbase 0.18.1 and under a heavy
load.

J-D

On Thu, Oct 23, 2008 at 2:30 PM, stack <st...@duboce.net> wrote:

> Find the MR task that failed.  Click through the UI to look at its logs.
>  It may have interesting info.  Its probably complaining about a region not
> being available (NSRE).  Figure which region it is.  Use the region
> historian or grep in the master logs -- 'grep -v metaScanner REGIONNAME' so
> you avoid the metaScanner noise -- to see if you can figure the regions
> history around the failure.  Look too at loading around failure time.  Were
> you swapping, etc. (Ganglia or some such helps here).
>
> You might also test table is still wholesome -- that the MR job didn't
> damage the table.  A quick check that all regions are onlined and accessible
> is to scan for a column whose column family does exist but whose qualifier
> you know is not present: e.g. if you have columnfamily 'page' and you know
> there is no column 'page:xyz', scan with that (Enable DEBUG in log4j so you
> can see regions being loaded as scan progresses): "scan 'TABLENAME',
> ['page:xyz']".
>
> You might need to up the timeouts/retries.
> St.Ack
>
>
>
> Dru Jensen wrote:
>
>> Hi hbase-users,
>>
>> During a fairly large MR process, on the Reduce cycle as its writing its
>> results to a table, I see org.apache.hadoop.hbase.NotServingRegionException
>> in the region server log several times and then I see a split reporting it
>> was successful.
>>
>> Eventually, the Reduce process fails with
>> org.apache.hadoop.hbase.client.RetriesExhaustedException after 10 failed
>> attempts.
>>
>> What can I do to fix it?
>>
>> Thanks,
>> Dru
>>
>>
>>
>>
>>
>

Re: NotServingRegionException - Map/Reduce process fails

Posted by stack <st...@duboce.net>.

Find the MR task that failed.  Click through the UI to look at its 
logs.  It may have interesting info.  Its probably complaining about a 
region not being available (NSRE).  Figure which region it is.  Use the 
region historian or grep in the master logs -- 'grep -v metaScanner 
REGIONNAME' so you avoid the metaScanner noise -- to see if you can 
figure the regions history around the failure.  Look too at loading 
around failure time.  Were you swapping, etc. (Ganglia or some such 
helps here).

You might also test table is still wholesome -- that the MR job didn't 
damage the table.  A quick check that all regions are onlined and 
accessible is to scan for a column whose column family does exist but 
whose qualifier you know is not present: e.g. if you have columnfamily 
'page' and you know there is no column 'page:xyz', scan with that 
(Enable DEBUG in log4j so you can see regions being loaded as scan 
progresses): "scan 'TABLENAME', ['page:xyz']".

You might need to up the timeouts/retries.
St.Ack

Dru Jensen wrote:
> Hi hbase-users,
>
> During a fairly large MR process, on the Reduce cycle as its writing 
> its results to a table, I see 
> org.apache.hadoop.hbase.NotServingRegionException in the region server 
> log several times and then I see a split reporting it was successful.
>
> Eventually, the Reduce process fails with 
> org.apache.hadoop.hbase.client.RetriesExhaustedException after 10 
> failed attempts.
>
> What can I do to fix it?
>
> Thanks,
> Dru
>
>
>
>