You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Qian Ye <ye...@gmail.com> on 2012/04/05 05:09:12 UTC

Region Server down when use export tools to backup tabls

Hi all:

I'm using cdh3u3 (based on hbase-0.90.4 and hadoop-0.20.2), and my cluster
contains about 15 servers. The size of data in the hdfs is about 10T, and
about half of this data are in hbase. When running the customized mapreduce
job which need not scan the whole table in hbase, it's fine. However, when
I want to backup hbase tables with export tools provided by HBase, one
Region Server down and the backup mapreduce job failed. The logs of the
region server is like:


2012-04-04 10:11:53,817 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running
rollback/cleanup of failed split of
dailylaunchindex,2012-03-10_4e045076431fe31e74000032_d645cc647e72c5f1cc1ff3c460dcd515,1333303778356.2262c07cfc672237e61aa6113e785f55.;
Failed dp13.abcd.com
,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380
java.io.IOException: Failed dp13.abcd.com
,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380
at
org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:297)
at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:156)
at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:87)
Caused by: java.net.SocketTimeoutException: Call to
dp7.abcd.com/10.18.10.60:60020 failed on socket timeout exception:
java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote=
dp7.abcd.com/10.18.10.60:60020]
at
org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy9.put(Unknown Source)
at
org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:122)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1392)
at
org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:375)
at
org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:342)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while
waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote=
dp7.abcd.com/10.18.10.60:60020]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
*2012-04-04 10:11:53,821 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=dp13.abcd.com,60020,1333436117207, load=(requests=18470,
regions=244, usedHeap=6108, maxHeap=7973): Abort; we got an error after
point-of-no-return*
2012-04-04 10:11:53,821 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
requests=6015, regions=244, stores=244, storefiles=557,
storefileIndexSize=414, memstoreSize=1792, compactionQueueSize=4,
flushQueueSize=0, usedHeap=6156, maxHeap=7973, blockCacheSize=1335446112,
blockCacheFree=336613152, blockCacheCount=20071,
blockCacheHitCount=65577505, blockCacheMissCount=30264896,
blockCacheEvictedCount=23463221, blockCacheHitRatio=68,
blockCacheHitCachingRatio=73
2012-04-04 10:11:53,824 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Abort; we got
an error after point-of-no-return
2012-04-04 10:11:53,824 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver60020.compactor exiting
2012-04-04 10:11:53,967 INFO
org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
2012-04-04 10:11:54,062 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver60020.cacheFlusher exiting
2012-04-04 10:11:54,837 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
and client tried to access missing scanner -7174278054087519478
2012-04-04 10:11:54,951 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
and client tried to access missing scanner 5883825799758583233
2012-04-04 10:11:55,224 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
and client tried to access missing scanner 5800828333591092756
2012-04-04 10:11:55,261 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
and client tried to access missing scanner 5153473163996089139
2012-04-04 10:11:55,332 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
and client tried to access missing scanner 3494993576774767091
2012-04-04 10:11:55,684 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
and client tried to access missing scanner -1265087592996306143
2012-04-04 10:11:55,849 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
and client tried to access missing scanner -7174278054087519478
...
2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 27 on 60020: exiting
2012-04-04 10:11:55,930 INFO
org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to
stop the worker thread
2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 25 on 60020: exiting
2012-04-04 10:11:55,933 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
*2012-04-04 10:11:55,933 WARN
org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
inteurrpted while waiting for task, exiting*
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:205)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
at java.lang.Thread.run(Thread.java:662)
2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 26 on 60020: exiting
2012-04-04 10:11:55,933 INFO
org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
dp13.abcd.com,60020,1333436117207 exiting


My questions are:

1. can I tune some parameters to make the export mapreduce job works?
2. is there any other way to backup my hbase tables in this situation? I
don't have another cluster and I cannot stop the serving when I need to
backup the tables.


Thanks for any advice on this issue.

-- 
With Regards!

Ye, Qian

Re: Region Server down when use export tools to backup tabls

Posted by Jean-Daniel Cryans <jd...@apache.org>.
The log says that the region server tried to talk to the region server
"dp7.abcd.com" and it timed out after 60 seconds, and that happened
during a split which is pretty bad. As the log says:

org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Abort; we got
an error after point-of-no-return

So what happened to that machine?

I understand the logs can look opaque but they usually give you some
clue so please investigate them on that machine and please don't post
them back here without analyzing them.

J-D

On Wed, Apr 4, 2012 at 8:09 PM, Qian Ye <ye...@gmail.com> wrote:
> Hi all:
>
> I'm using cdh3u3 (based on hbase-0.90.4 and hadoop-0.20.2), and my cluster
> contains about 15 servers. The size of data in the hdfs is about 10T, and
> about half of this data are in hbase. When running the customized mapreduce
> job which need not scan the whole table in hbase, it's fine. However, when
> I want to backup hbase tables with export tools provided by HBase, one
> Region Server down and the backup mapreduce job failed. The logs of the
> region server is like:
>
>
> 2012-04-04 10:11:53,817 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running
> rollback/cleanup of failed split of
> dailylaunchindex,2012-03-10_4e045076431fe31e74000032_d645cc647e72c5f1cc1ff3c460dcd515,1333303778356.2262c07cfc672237e61aa6113e785f55.;
> Failed dp13.abcd.com
> ,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380
> java.io.IOException: Failed dp13.abcd.com
> ,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380
> at
> org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:297)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:156)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:87)
> Caused by: java.net.SocketTimeoutException: Call to
> dp7.abcd.com/10.18.10.60:60020 failed on socket timeout exception:
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote=
> dp7.abcd.com/10.18.10.60:60020]
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> at $Proxy9.put(Unknown Source)
> at
> org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:122)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1392)
> at
> org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:375)
> at
> org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:342)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote=
> dp7.abcd.com/10.18.10.60:60020]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> at java.io.DataInputStream.readInt(DataInputStream.java:370)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
> *2012-04-04 10:11:53,821 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=dp13.abcd.com,60020,1333436117207, load=(requests=18470,
> regions=244, usedHeap=6108, maxHeap=7973): Abort; we got an error after
> point-of-no-return*
> 2012-04-04 10:11:53,821 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> requests=6015, regions=244, stores=244, storefiles=557,
> storefileIndexSize=414, memstoreSize=1792, compactionQueueSize=4,
> flushQueueSize=0, usedHeap=6156, maxHeap=7973, blockCacheSize=1335446112,
> blockCacheFree=336613152, blockCacheCount=20071,
> blockCacheHitCount=65577505, blockCacheMissCount=30264896,
> blockCacheEvictedCount=23463221, blockCacheHitRatio=68,
> blockCacheHitCachingRatio=73
> 2012-04-04 10:11:53,824 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Abort; we got
> an error after point-of-no-return
> 2012-04-04 10:11:53,824 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> regionserver60020.compactor exiting
> 2012-04-04 10:11:53,967 INFO
> org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> 2012-04-04 10:11:54,062 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver60020.cacheFlusher exiting
> 2012-04-04 10:11:54,837 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner -7174278054087519478
> 2012-04-04 10:11:54,951 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner 5883825799758583233
> 2012-04-04 10:11:55,224 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner 5800828333591092756
> 2012-04-04 10:11:55,261 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner 5153473163996089139
> 2012-04-04 10:11:55,332 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner 3494993576774767091
> 2012-04-04 10:11:55,684 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner -1265087592996306143
> 2012-04-04 10:11:55,849 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner -7174278054087519478
> ...
> 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 27 on 60020: exiting
> 2012-04-04 10:11:55,930 INFO
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to
> stop the worker thread
> 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 25 on 60020: exiting
> 2012-04-04 10:11:55,933 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
> *2012-04-04 10:11:55,933 WARN
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> inteurrpted while waiting for task, exiting*
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:485)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:205)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
> at java.lang.Thread.run(Thread.java:662)
> 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 26 on 60020: exiting
> 2012-04-04 10:11:55,933 INFO
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> dp13.abcd.com,60020,1333436117207 exiting
>
>
> My questions are:
>
> 1. can I tune some parameters to make the export mapreduce job works?
> 2. is there any other way to backup my hbase tables in this situation? I
> don't have another cluster and I cannot stop the serving when I need to
> backup the tables.
>
>
> Thanks for any advice on this issue.
>
> --
> With Regards!
>
> Ye, Qian