You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Maoke <fi...@gmail.com> on 2012/05/25 04:59:44 UTC

datanode hang up after socket connection reset

hi,

we encountered the trouble that a datanode is hung up when a socket
exception connection reset happened with it. since then on, the hbase
running over the hdfs cannot have access to the data tables (but -ROOT- and
.META. are not affected), until we manually stopped the bad datanode. our
environment is an 8 datanode cluster while regionservers are running over
it. namenode and hbase master are running on 2 machines other than this 8
nodes. hadoop version is 0.20.2 and hbase is 0.20.6.

the log related to the troubled data block is as follows (we collected them
from multiple datanodes and the hbase regionserver), sorted in time (but a
slight time difference between nodes exists). i have three questions:
1. why the socket exception of connection reset is caught but still hangs
up the datanode wk008?
2. why only one datanode is failed but any user table region became
unaccessible through the hbase?
3. is there known bugfix for this issue?

hadoop-hadoop-datanode-str-wk008.p-prd.log.2012-05-20:2012-05-20
17:13:49,854 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_9016752944216030896_4468475 src:
/192.168.128.114:41922dest: /
192.168.128.114:50010

2012-05-20 17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.net.SocketTimeoutException: 15000 millis timeout while
waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.128.114:41922remote=/
192.168.128.114:50010]
   at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
   at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
   at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2314)

hbase-hadoop-regionserver-str-wk008.p-prd.log.2012-05-20:2012-05-20
17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for
block blk_9016752944216030896_4468475 bad datanode[0] 192.168.128.114:50010

hbase-hadoop-regionserver-str-wk008.p-prd.log.2012-05-20:2012-05-20
17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for
block blk_9016752944216030896_4468475 in pipeline 192.168.128.114:50010,
192.168.128.104:50010, 192.168.128.105:50010: bad datanode
192.168.128.114:50010

hadoop-hadoop-datanode-str-wk008.p-prd.log.2012-05-20:2012-05-20
17:15:19,910 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder blk_9016752944216030896_4468475 2 Exception
java.net.SocketException: Connection reset
   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
   at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
   at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:132)
   at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:875)
   at java.lang.Thread.run(Thread.java:619)

/* after this message wk008 stops generating any log message until it is
rebooted */

hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
17:17:20,843 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_9016752944216030896_4468475 src:
/192.168.128.114:55438dest: /
192.168.128.104:50010

hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
17:17:20,858 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_9016752944216030896_4468475 src:
/192.168.128.104:43219dest: /
192.168.128.105:50010

hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
17:18:08,307 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Client
calls recoverBlock(block=blk_9016752944216030896_4468475, targets=[
192.168.128.104:50010, 192.168.128.105:50010])

hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
17:18:08,322 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
oldblock=blk_9016752944216030896_4468475(length=1040384),
newblock=blk_9016752944216030896_4468477(length=1040384), datanode=
192.168.128.104:50010

/* after wk008 is shutdown and restart */

hadoop-hadoop-datanode-str-wk007.p-prd.log.2012-05-20:2012-05-20
18:36:07,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_9016752944216030896_4468477 src:
/192.168.128.104:35297dest: /
192.168.128.109:50010
hadoop-hadoop-datanode-str-wk007.p-prd.log.2012-05-20:2012-05-20
18:36:07,700 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received
block blk_9016752944216030896_4468477 src: /192.168.128.104:35297 dest: /
192.168.128.109:50010 of size 2112732
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Exception in receiveBlock for block blk_9016752944216030896_4468475
java.net.SocketException: Connection reset
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 1 for block blk_9016752944216030896_4468475 terminating
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder blk_9016752944216030896_4468475 1 : Thread is interrupted.
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder blk_9016752944216030896_4468475 1 Exception
java.net.SocketException: Socket closed
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
writeBlock blk_9016752944216030896_4468475 received exception
java.io.IOException: Interrupted receiveBlock
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,267 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Exception in receiveBlock for block blk_9016752944216030896_4468475
java.io.EOFException: while trying to read 65557 bytes
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,267 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 0 for block blk_9016752944216030896_4468475 Interrupted.
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,267 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 0 for block blk_9016752944216030896_4468475 terminating
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,267 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
writeBlock blk_9016752944216030896_4468475 received exception
java.io.EOFException: while trying to read 65557 bytes
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,312 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
oldblock=blk_9016752944216030896_4468475(length=1040384),
newblock=blk_9016752944216030896_4468477(length=1040384), datanode=
192.168.128.105:50010
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,329 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_9016752944216030896_4468477 src:
/192.168.128.114:54157dest: /
192.168.128.104:50010
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,329 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen
already-open Block for append blk_9016752944216030896_4468477
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,331 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_9016752944216030896_4468477 src:
/192.168.128.104:46723dest: /
192.168.128.105:50010
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,331 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen
already-open Block for append blk_9016752944216030896_4468477
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,426 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Changing
block file offset of block blk_9016752944216030896_4468477 from 0 to
1040384 meta file offset to 8135
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,436 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Changing
block file offset of block blk_9016752944216030896_4468477 from 0 to
1040384 meta file offset to 8135
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,503 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
192.168.128.104:46723, dest: /192.168.128.105:50010, bytes: 2112732, op:
HDFS_WRITE, cliID: DFSClient_1758679091, srvID:
DS-750403221-192.168.128.105-50010-1301616586785, blockid:
blk_9016752944216030896_4468477
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,503 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
192.168.128.114:54157, dest: /192.168.128.104:50010, bytes: 2112732, op:
HDFS_WRITE, cliID: DFSClient_1758679091, srvID:
DS-1756587443-192.168.128.104-50010-1301616567281, blockid:
blk_9016752944216030896_4468477
hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20
18:36:33,503 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 0 for block blk_9016752944216030896_4468477 terminating
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:36:33,503 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 1 for block blk_9016752944216030896_4468477 terminating
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:38:14,041 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.128.104:50010,
storageID=DS-1756587443-192.168.128.104-50010-1301616567281,
infoPort=50075, ipcPort=50020) Starting thread to transfer block
blk_9016752944216030896_4468477 to 192.168.128.109:50010
hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20
18:38:14,150 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.128.104:50010,
storageID=DS-1756587443-192.168.128.104-50010-1301616567281,
infoPort=50075, ipcPort=50020):Transmitted block
blk_9016752944216030896_4468477 to /192.168.128.109:50010