You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Bhupesh Bansal <bb...@linkedin.com> on 2009/04/02 22:32:44 UTC

Lost TaskTracker Errors

Hey Folks, 

Since last 2-3 days I am seeing many of these errors popping up in our
hadoop cluster. 

Task attempt_200904011612_0025_m_000120_0 failed to report status for 604
seconds. Killing

JobTracker logs are doesn¹t have any more info  And task tracker logs are
clean. 

The failures occurred with these symptoms
1. Datanodes will start timing out
2. hdfs will get extremely slow (hdfs ­ls will take like 2 mins Vs 1s in
normal mode)

The datanode logs on failing tasktracker nodes are filled up with
2009-04-02 11:39:46,828 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(172.16.216.64:50010,
storageID=DS-707090154-172.16.216.64-50010-1223506297192, infoPort=50075,
ipcPort=50020):Failed to transfer blk_-7774359493260170883_282858 to
172.16.216.62:50010 got java.net.SocketTimeoutException: 480000 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/172.16.216.64:36689
remote=/172.16.216.62:50010]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java
:185)
        at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.
java:159)
        at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.
java:198)
        at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at 
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2855)
        at java.lang.Thread.run(Thread.java:619)


We are running a 10 Node cluster (hadoop-0.18.1) on Dual Quad core boxes (8G
RAM) with these properties
1. mapred.child.java.opts = Xmx600M
2. mapred.tasktracker.map.tasks.maximum = 8
3. mapred.tasktracker.reduce.tasks.maximum = 4
4. dfs.datanode.handler.count = 10
5. dfs.datanode.du.reserved = 102400000
6. dfs.datanode.max.xcievers = 512

The map jobs writes a Ton of data for each record, does increasing
³dfs.datanode.handler.count² will help in this case ??  What other
configuration change can I try ??


Best
Bhupesh