You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Tianwei <ti...@gmail.com> on 2012/02/16 07:44:19 UTC

region server died after inserting big data

Hi, all,
    I have two region servers setup and each machine have around 32G
memory. For each region server, I started it with 12G JVM limit.  Recently
I have one map-reduce job which will write big chunk of data into a hbase
table. The job will run around 10 hours and the final hbase table will be
big.
    Now I found that the region server's memory usage will keep increasing
while the map-reduce job is running. While this memory reaches the 12G
limit, it will die. From the log, it seems that the compact/split will fail
due to the memory problem:
  2012-02-15 21:39:58,013 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
failed for region test_table,ali
ramos,1329364709394.5e1b41c1ea5e87d75fbac2e5fb26e68b.

  I know very little about the internal implementation of hbase, could you
guys give me some suggestions for the following questions:
   1. Why the memory usage of the region server keep increasing? Is it
simply because I am writing big data into the hbase table?  Which parts of
hbase will use more memory for increasing table size? Are there any
configuration options for me to alleviate this problem?

   2. Why the region server die? is it because the GC is not quick enough
to free memory for hbase? I assume that writing data, compacting/spliting
all need to allocate new memory. And if the GC is not quick enough, these
function will simply get exception and cause the region server to die. Is
that right?


We used hbase-0.90.1 version and this problem really bothers us a lot. Hope
you can give us some suggestions.

Thank you very much

Tianwei

Re: region server died after inserting big data

Posted by Tianwei <ti...@gmail.com>.
Hi, Stack,
    Thank you very much.  We didn't do any GC tuning and just use the
default config.  We may try to upgrade our hbase version later.

I extracted more log before and after the failure point:

12-02-15 21:38:11,399 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
test_table,alexander hawk,1329369807272.26c8f0b19f1b475e526b12d639dc0241.
due to global heap pressure
2012-02-15 21:38:12,616 INFO org.apache.hadoop.hbase.regionserver.Store:
Renaming flushed file at
hdfs://tianweis:8020/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/.tmp/2688884176779434164
to
hdfs://tianweis:8020/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/attributes/6670946670903024661
2012-02-15 21:38:16,592 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://tianweis:8020/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/attributes/6670946670903024661,
entries=2644, sequenceid=1305864181, memsize=5.6m, filesize=3.4m
2012-02-15 21:38:16,594 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed test_table,ali ramos,1329364709394.5e1b41c1ea5e87d75fbac2e5fb26e68b.
2012-02-15 21:38:30,379 INFO org.apache.hadoop.hbase.regionserver.Store:
Renaming flushed file at
hdfs://tianweis:8020/hbase/test_table/26c8f0b19f1b475e526b12d639dc0241/.tmp/1543489781580457584
to
hdfs://tianweis:8020/hbase/test_table/26c8f0b19f1b475e526b12d639dc0241/attributes/7028169816865652663
2012-02-15 21:38:44,073 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://tianweis:8020/hbase/test_table/26c8f0b19f1b475e526b12d639dc0241/attributes/7028169816865652663,
entries=20658, sequenceid=1305864185, memsize=44.1m, filesize=23.5m
2012-02-15 21:38:44,074 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up
with memory above low water.
2012-02-15 21:38:44,075 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Under global heap
pressure: Region test_table,alex
yost,1329368982672.f55282926b9e8652f7bb0b98616a8216. has too many store
files, but is 88.2m vs best flushable region's 40.4m. Choosing the bigger.
2012-02-15 21:38:44,075 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
test_table,alex yost,1329368982672.f55282926b9e8652f7bb0b98616a8216. due to
global heap pressure
2012-02-15 21:38:51,349 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/splits/3873b831c5d2e7ca743e97e77ff6ddc9/attributes/1099303928625969539.5e1b41c1ea5e87d75fbac2e5fb26e68b
retrying...
2012-02-15 21:38:51,349 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/splits/3873b831c5d2e7ca743e97e77ff6ddc9/attributes/6670946670903024661.5e1b41c1ea5e87d75fbac2e5fb26e68b
retrying...
2012-02-15 21:38:51,349 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/splits/3873b831c5d2e7ca743e97e77ff6ddc9/attributes/8453068235826691778.5e1b41c1ea5e87d75fbac2e5fb26e68b
retrying...
2012-02-15 21:38:56,164 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback
of failed split of test_table,ali
ramos,1329364709394.5e1b41c1ea5e87d75fbac2e5fb26e68b.; Took too long to
split the files and create the references, aborting split
2012-02-15 21:39:58,011 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=tianweis,62222,1329359487662, load=(requests=0, regions=1350,
usedHeap=8560, maxHeap=11979): Replay of HLog required. Forcing server
shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: test_table,alex
yost,1329368982672.f55282926b9e8652f7bb0b98616a8216.
  at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:990)
  at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:895)
  at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:847)
  at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:386)
  at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:194)
  at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:214)
Caused by: java.io.IOException: Call to
tianweis/137.110.161.244:8020failed on local exception:
java.nio.channels.ClosedByInterruptException
  at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
  at org.apache.hadoop.ipc.Client.call(Client.java:743)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
  at $Proxy6.complete(Unknown Source)
  at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
  at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:616)
  at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
  at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
  at $Proxy6.complete(Unknown Source)
  at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3457)
  at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3381)
  at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
  at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
  at org.apache.hadoop.hbase.io.hfile.HFile$Writer.close(HFile.java:655)
  at
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:877)
  at
org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:488)
  at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:448)
  at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:81)
  at
org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1509)
  at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:968)
  ... 5 more
Caused by: java.nio.channels.ClosedByInterruptException
  at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:201)
  at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:359)
  at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
  at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
  at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
  at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
  at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
  at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
2012-02-15 21:39:58,014 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=1350, stores=2147, storefiles=3595,
storefileIndexSize=2286, memstoreSize=4792, compactionQueueSize=119,
flushQueueSize=66, usedHeap=8689, maxHeap=11979, blockCacheSize=944895704,
blockCacheFree=1567335208, blockCacheCount=7739, blockCacheHitCount=17170,
blockCacheMissCount=5362757, blockCacheEvictedCount=0,
blockCacheHitRatio=0, blockCacheHitCachingRatio=68
2012-02-15 21:39:58,015 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Replay of HLog
required. Forcing server shutdown
2012-02-15 21:39:58,034 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Excluding unflushable
region test_table,alex yost,1329368982672.f55282926b9e8652f7bb0b98616a8216.
- trying to find a different region to flush.
2012-02-15 21:39:58,034 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 62222


Thanks.

Tianwei
On Thu, Feb 16, 2012 at 9:02 AM, Stack <st...@duboce.net> wrote:

> On Wed, Feb 15, 2012 at 10:44 PM, Tianwei <ti...@gmail.com> wrote:
> > limit, it will die. From the log, it seems that the compact/split will
> fail
> > due to the memory problem:
> >  2012-02-15 21:39:58,013 ERROR
> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
> > failed for region test_table,ali
> > ramos,1329364709394.5e1b41c1ea5e87d75fbac2e5fb26e68b.
> >
> >  I know very little about the internal implementation of hbase, could you
> > guys give me some suggestions for the following questions:
> >   1. Why the memory usage of the region server keep increasing? Is it
> > simply because I am writing big data into the hbase table?  Which parts
> of
> > hbase will use more memory for increasing table size? Are there any
> > configuration options for me to alleviate this problem?
> >
> >   2. Why the region server die? is it because the GC is not quick enough
> > to free memory for hbase? I assume that writing data, compacting/spliting
> > all need to allocate new memory. And if the GC is not quick enough, these
> > function will simply get exception and cause the region server to die. Is
> > that right?
> >
>
> Need more logs from regionserver and you should enable GC logging.
>
> Heap increases as you put more data in (more regions, more memstores).
>  Its natural!
>
> My guess is you are losing your session with zookeeper because of a
> big GC pause.  Have you done any GC tuning?  Using default configs?
>
> >
> > We used hbase-0.90.1 version and this problem really bothers us a lot.
> Hope
> > you can give us some suggestions.
> >
>
> Update your hbase.  That'll probably help too.
>
> St.Ack
>

Re: region server died after inserting big data

Posted by Stack <st...@duboce.net>.
On Wed, Feb 15, 2012 at 10:44 PM, Tianwei <ti...@gmail.com> wrote:
> limit, it will die. From the log, it seems that the compact/split will fail
> due to the memory problem:
>  2012-02-15 21:39:58,013 ERROR
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
> failed for region test_table,ali
> ramos,1329364709394.5e1b41c1ea5e87d75fbac2e5fb26e68b.
>
>  I know very little about the internal implementation of hbase, could you
> guys give me some suggestions for the following questions:
>   1. Why the memory usage of the region server keep increasing? Is it
> simply because I am writing big data into the hbase table?  Which parts of
> hbase will use more memory for increasing table size? Are there any
> configuration options for me to alleviate this problem?
>
>   2. Why the region server die? is it because the GC is not quick enough
> to free memory for hbase? I assume that writing data, compacting/spliting
> all need to allocate new memory. And if the GC is not quick enough, these
> function will simply get exception and cause the region server to die. Is
> that right?
>

Need more logs from regionserver and you should enable GC logging.

Heap increases as you put more data in (more regions, more memstores).
 Its natural!

My guess is you are losing your session with zookeeper because of a
big GC pause.  Have you done any GC tuning?  Using default configs?

>
> We used hbase-0.90.1 version and this problem really bothers us a lot. Hope
> you can give us some suggestions.
>

Update your hbase.  That'll probably help too.

St.Ack