You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2008/09/10 21:59:44 UTC

[jira] Issue Comment Edited: (HADOOP-4132) high rate of task failures because of bad or full datanodes

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629933#action_12629933 ] 

ckunz edited comment on HADOOP-4132 at 9/10/08 12:58 PM:
------------------------------------------------------------------

Although I do not completely disagree (I still can imagine that some smart algorithm could leverage knowledge about space to ensure better distribution without overloading any particular datanode), I wonder whether there is a bug in assigning blocks.

I checked the block distribution before and after a failed job.
Of 1623 nodes 5% of the nodes received at least 27% of new blocks.
Before the job started the 5% of nodes with highest number of blocks had 7% of all blocks,
at the end of the failed job the top 5% had 10% of all blocks.

Does this make sense? Is this something which got introduced going from 0.16 to 0.17.1 but then fixed in 0.17.2?

I will deploy 0.17.2 and rerun the same job

      was (Author: ckunz):
    Although I do not completely disagree (I still can imagine that some smart algorithm could leverage knowledge about space to ensure better distribution without overloading any particular datanode), I wonder whether there is a bug in assigning blocks.

I checked the block distribution before and after a failed job.
Of 1623 nodes 5% of the nodes received at least 27% of new blocks.
Before the job started the 5% of nodes with highest number of blocks had 7% of all blocks,
at the end of the failed jobs the top 5% had 10% of all blocks.

Does this make sense? Is this something which got introduced going from 0.16 to 0.17.1 but then fixed in 0.17.2?
  
> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.