You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jay Talreja <jt...@dataraker.com> on 2012/07/30 18:17:55 UTC
Region Server failure due to remote data node errors
A couple of our region servers (in a 16 node cluster) crashed due to
underlying Data Node errors. I am trying to understand how errors on
remote data nodes impact other region server processes.
*To briefly describe what happened:
*
1) Cluster was in operation. All 16 nodes were up, reads and writes were
happening extensively.
2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DN
and RS service were running and the power was just pulled out)
3) Nodes 2 and 5 flushed and DFS client started reporting errors. From
the log it seems like DFS blocks were being replicated to the nodes that
were shutdown (7 and 8) and since replication could not go through
successfully DFS client raised errors on 2 and 5 and eventually the RS
itself died.
The question I am trying to get an answer for is : Is a Region Server
immune from remote data node errors (that are part of the replication
pipeline) or not. ?
*
Part of the Region Server Log:* (Node 5)
2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exception
in createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Bad
connect ack with firstBadLink
as 10.128.204.228:50010
2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_-316956372096761177_489798
2012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
datanode 10.128.204.228:50010
2012-07-26 18:53:16,903 INFO
org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO
DeleteFamily was added to HFile (hdfs://Node101:8020/hbase/table/754de060
c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124)
2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store:
Flushed , sequenceid=4046717645, memsize=256.5m, into tmp file
hdfs://Node101:8020/hbase/table/754de0
60c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-26
18:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming
flushed file at
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5c
d1fb2cb4547972a31073d2da124 to
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124
2012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2d
a124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-26
18:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.net.SocketTimeoutException: 15000 millis timeout while
waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949
remote=/10.128.204.225:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at
java.io.DataOutputStream.write(DataOutputStream.java:90) at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857)
2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_5116092240243398556_489796 bad datanode[0]
10.128.204.225:50010
2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_5116092240243398556_489796 in pipeline
10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: bad
datanode 10.128.204.225:50010
I can pastebin the entire log but this is when things started going
wrong for Node 5 and eventually shutdown hook for RS started and the RS
was shutdown.
Any help in troubleshooting this is greatly appreciated.
Thanks,
Jay