You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bryan Beaudreault <bb...@hubspot.com> on 2012/04/11 19:24:34 UTC

Cascading failure leads to loss of all region servers

We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
hosting about 17k regions.  Each region server has 10GB of heap, and in
normal operating levels I have never seen our used heap go above 5-8GB.
 Yesterday we were running a job to populate a new table, and this resulted
in a cascading OOM failure which ended with all region servers being down.

The failure on each node went something like this (region A is the same
region across all servers, getting passed along as each dies):


   1. RS inherits region A.
   2. RS tries to flush region A, but the region has "too many store
   files".  RS delays flush and instead runs a compaction
   3. 1 minute pause in the logs (could have been a GC, the logs pretty
   steadily were coming in every 1-2 seconds), results in lost connection to ZK
   4. RS reconnects to ZK and blocks updates on region A, due to memstore
   too big (129.8m is > 128m blocking size)
   5. Another 30 second pause (another GC?)
   6. Lost connection to server from master
   7. 1-2 minutes later, aborts the compaction and throws OutOfMemoryError:
   Java heap space.  The exception comes from the compaction (pasted below).

That pattern repeats on all of the region servers, every 5-8 minutes until
all are down. Should there be some safeguards on a compaction causing a
region server to go OOM?  The region appears to only be around 425mb in
size.


---


(This exception comes from the first regionserver to go down.  The others
were very similar, with the same stacktrace.)
12/04/10 08:04:57 INFO regionserver.HRegion: aborted compaction on region
analytics-search-a2,\x00\x00^\xF4\x00\x0A0,1334056032860.ab55c22574a9cddec8a3e73fd99be57d.
after 14mins, 34sec
12/04/10 08:04:57 FATAL regionserver.HRegionServer: ABORTING region server
serverName=XXXXXXXX,60020,1326728856867, load=(requests=65, regions=1082,
usedHeap=10226, maxHeap=10231): Uncaught exception in service thread
regionserver60020.compactor
java.lang.OutOfMemoryError: Java heap space
        at
org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:43)
        at
org.apache.hadoop.io.compress.BlockDecompressorStream.<init>(BlockDecompressorStream.java:45)
        at
com.hadoop.compression.lzo.LzoCodec.createInputStream(LzoCodec.java:173)
        at
org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:206)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1087)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1280)
        at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
        at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:943)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:743)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:808)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:748)
        at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)

Re: Cascading failure leads to loss of all region servers

Posted by Andrew Purtell <ap...@yahoo.com>.
One idea we took from the 0.89-FB branch is setting the internal scanner read batching for compaction (compactionKVMax) to 1 as there isn't a benefit otherwise server side for compaction and we run with heaps sometimes up at 90% utilization for a time as observed with JMX. Wonder if that would have had an impact here. Just a random thought, pardon if the default is 1 (IIRC it's 10) or something silly like that.

Best regards,

    - Andy


On Apr 11, 2012, at 6:17 PM, Bryan Beaudreault <bb...@hubspot.com> wrote:

> Hi Stack,
> 
> Thanks for the reply.  Unfortunately, our first instinct was to restart the
> region servers and when they came up it appears the compaction was able to
> succeed (perhaps because on a fresh restart the heap was low enough to
> succeed).  I listed the files under that region and there is now only 1
> file.
> 
> We are going to be running this job again in the near future.  We are going
> to try to rate limit the writes a bit (though only 10 reducers were running
> at once to begin with), and I will keep in mind your suggestions if it
> happens despite that.
> 
> - Bryan
> 
> On Wed, Apr 11, 2012 at 4:35 PM, Stack <st...@duboce.net> wrote:
> 
>> On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault
>> <bb...@hubspot.com> wrote:
>>> We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
>>> hosting about 17k regions.
>> 
>> Thats too many but thats another story.
>> 
>>> That pattern repeats on all of the region servers, every 5-8 minutes
>> until
>>> all are down. Should there be some safeguards on a compaction causing a
>>> region server to go OOM?  The region appears to only be around 425mb in
>>> size.
>>> 
>> 
>> My guess is that Region A has a massive or corrupt record in it.
>> 
>> You could disable the region for now while you are figuring whats wrong
>> w/it.
>> 
>> If you list files under this region, what do you see?  Are there many?
>> 
>> Can you see what files are selected for compaction?  This will narrow
>> the set to look at.  You could poke at them w/ the hfile tool.  See
>> '8.7.5.2.2. HFile Tool' in the reference guide.
>> 
>> St.Ack
>> 

Re: Cascading failure leads to loss of all region servers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
Hi Stack,

Thanks for the reply.  Unfortunately, our first instinct was to restart the
region servers and when they came up it appears the compaction was able to
succeed (perhaps because on a fresh restart the heap was low enough to
succeed).  I listed the files under that region and there is now only 1
file.

We are going to be running this job again in the near future.  We are going
to try to rate limit the writes a bit (though only 10 reducers were running
at once to begin with), and I will keep in mind your suggestions if it
happens despite that.

- Bryan

On Wed, Apr 11, 2012 at 4:35 PM, Stack <st...@duboce.net> wrote:

> On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault
> <bb...@hubspot.com> wrote:
> > We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
> > hosting about 17k regions.
>
> Thats too many but thats another story.
>
> > That pattern repeats on all of the region servers, every 5-8 minutes
> until
> > all are down. Should there be some safeguards on a compaction causing a
> > region server to go OOM?  The region appears to only be around 425mb in
> > size.
> >
>
> My guess is that Region A has a massive or corrupt record in it.
>
> You could disable the region for now while you are figuring whats wrong
> w/it.
>
> If you list files under this region, what do you see?  Are there many?
>
> Can you see what files are selected for compaction?  This will narrow
> the set to look at.  You could poke at them w/ the hfile tool.  See
> '8.7.5.2.2. HFile Tool' in the reference guide.
>
> St.Ack
>

Re: Cascading failure leads to loss of all region servers

Posted by Stack <st...@duboce.net>.
On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault
<bb...@hubspot.com> wrote:
> We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
> hosting about 17k regions.

Thats too many but thats another story.

> That pattern repeats on all of the region servers, every 5-8 minutes until
> all are down. Should there be some safeguards on a compaction causing a
> region server to go OOM?  The region appears to only be around 425mb in
> size.
>

My guess is that Region A has a massive or corrupt record in it.

You could disable the region for now while you are figuring whats wrong w/it.

If you list files under this region, what do you see?  Are there many?

Can you see what files are selected for compaction?  This will narrow
the set to look at.  You could poke at them w/ the hfile tool.  See
'8.7.5.2.2. HFile Tool' in the reference guide.

St.Ack