You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jay Talreja <jt...@dataraker.com> on 2012/07/30 18:17:55 UTC
Region Server failure due to remote data node errors

  A couple of our region servers (in a 16 node cluster) crashed due to 
underlying Data Node errors. I am trying to understand how errors on 
remote data nodes impact other region server processes.

*To briefly describe what happened:
*
1) Cluster was in operation. All 16 nodes were up, reads and writes were 
happening extensively.
2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DN 
and RS service were running and the power was just pulled out)
3) Nodes 2 and 5 flushed and DFS client started reporting errors. From 
the log it seems like DFS blocks were being replicated to the nodes that 
were shutdown (7 and 8) and since replication could not go through 
successfully DFS client raised errors on 2 and 5 and eventually the RS 
itself died.

The question I am trying to get an answer for is : Is a Region Server 
immune from remote data node errors (that are part of the replication 
pipeline) or not. ?
*
Part of the Region Server Log:* (Node 5)

2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exception 
in createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Bad 
connect ack with firstBadLink
as 10.128.204.228:50010
2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: 
Abandoning block blk_-316956372096761177_489798
2012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excluding 
datanode 10.128.204.228:50010
2012-07-26 18:53:16,903 INFO 
org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO 
DeleteFamily was added to HFile (hdfs://Node101:8020/hbase/table/754de060
c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124)
2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store: 
Flushed , sequenceid=4046717645, memsize=256.5m, into tmp file 
hdfs://Node101:8020/hbase/table/754de0
60c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-26 
18:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming 
flushed file at 
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5c
d1fb2cb4547972a31073d2da124 to 
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124
2012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store: 
Added 
hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2d
a124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-26 
18:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: java.net.SocketTimeoutException: 15000 millis timeout while 
waiting for channel to be ready for write. ch : 
java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949 
remote=/10.128.204.225:50010]
         at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)        
at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
         at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)        
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
         at 
java.io.DataOutputStream.write(DataOutputStream.java:90)        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857)
2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error 
Recovery for block blk_5116092240243398556_489796 bad datanode[0] 
10.128.204.225:50010
2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error 
Recovery for block blk_5116092240243398556_489796 in pipeline 
10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: bad 
datanode 10.128.204.225:50010

I can pastebin the entire log but this is when things started going 
wrong for Node 5 and eventually shutdown hook for RS started and the RS 
was shutdown.

Any help in troubleshooting this is greatly appreciated.

Thanks,
Jay