You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Bhupesh Bansal <bb...@linkedin.com> on 2009/04/02 22:32:44 UTC
Lost TaskTracker Errors

Hey Folks, 

Since last 2-3 days I am seeing many of these errors popping up in our
hadoop cluster. 

Task attempt_200904011612_0025_m_000120_0 failed to report status for 604
seconds. Killing

JobTracker logs are doesn¹t have any more info  And task tracker logs are
clean. 

The failures occurred with these symptoms
1. Datanodes will start timing out
2. hdfs will get extremely slow (hdfs ls will take like 2 mins Vs 1s in
normal mode)

The datanode logs on failing tasktracker nodes are filled up with
2009-04-02 11:39:46,828 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(172.16.216.64:50010,
storageID=DS-707090154-172.16.216.64-50010-1223506297192, infoPort=50075,
ipcPort=50020):Failed to transfer blk_-7774359493260170883_282858 to
172.16.216.62:50010 got java.net.SocketTimeoutException: 480000 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/172.16.216.64:36689
remote=/172.16.216.62:50010]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java
:185)
        at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.
java:159)
        at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.
java:198)
        at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at 
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2855)
        at java.lang.Thread.run(Thread.java:619)


We are running a 10 Node cluster (hadoop-0.18.1) on Dual Quad core boxes (8G
RAM) with these properties
1. mapred.child.java.opts = Xmx600M
2. mapred.tasktracker.map.tasks.maximum = 8
3. mapred.tasktracker.reduce.tasks.maximum = 4
4. dfs.datanode.handler.count = 10
5. dfs.datanode.du.reserved = 102400000
6. dfs.datanode.max.xcievers = 512

The map jobs writes a Ton of data for each record, does increasing
³dfs.datanode.handler.count² will help in this case ??  What other
configuration change can I try ??


Best
Bhupesh