You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2008/09/09 18:07:44 UTC

[jira] Created: (HADOOP-4132) high rate of task failures because of bad datanodes

high rate of task failures because of bad datanodes
---------------------------------------------------

                 Key: HADOOP-4132
                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.17.1
            Reporter: Christian Kunz


With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.

After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:

08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...

With several such bad datanodes the probability of jobs failing goes up a lot.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630230#action_12630230 ] 

Christian Kunz commented on HADOOP-4132:
----------------------------------------

In addition to run under 0.17.2 I started to run the balancer in the background for a while to ensure better block distribution. So far looks much better than before but I would like to keep this bug open (no longer as blocker) because I would like to monitor the cluster for possible degradation over time.

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629935#action_12629935 ] 

rangadi edited comment on HADOOP-4132 at 9/10/08 1:16 PM:
---------------------------------------------------------------

smarter is better than less smart :-) (though it is possible be too smart).

Though you may not be affected, another thing that influences load is how many blocks a node is currently writing. Slower nodes (either because of slower hardware or slower client writes) overall gets less blocks assigned than a faster nodes At any time NameNode tries to keep number blocks currently being written to less than 2*avg. 

The fact that some slower clients can influence load on datanode should be fixed I think.
 

      was (Author: rangadi):
    smarter is better than less smart :-)

Though you may not be affected, another thing that influences load is how fast a datanode can handle a slower nodes (either because of slower hardware or slower clients) overall gets less blocks assigned than a faster node.. because at any time NameNode tries to keep number blocks currently being written to less than 2*avg.
 
  
> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629933#action_12629933 ] 

ckunz edited comment on HADOOP-4132 at 9/10/08 12:58 PM:
------------------------------------------------------------------

Although I do not completely disagree (I still can imagine that some smart algorithm could leverage knowledge about space to ensure better distribution without overloading any particular datanode), I wonder whether there is a bug in assigning blocks.

I checked the block distribution before and after a failed job.
Of 1623 nodes 5% of the nodes received at least 27% of new blocks.
Before the job started the 5% of nodes with highest number of blocks had 7% of all blocks,
at the end of the failed job the top 5% had 10% of all blocks.

Does this make sense? Is this something which got introduced going from 0.16 to 0.17.1 but then fixed in 0.17.2?

I will deploy 0.17.2 and rerun the same job

      was (Author: ckunz):
    Although I do not completely disagree (I still can imagine that some smart algorithm could leverage knowledge about space to ensure better distribution without overloading any particular datanode), I wonder whether there is a bug in assigning blocks.

I checked the block distribution before and after a failed job.
Of 1623 nodes 5% of the nodes received at least 27% of new blocks.
Before the job started the 5% of nodes with highest number of blocks had 7% of all blocks,
at the end of the failed jobs the top 5% had 10% of all blocks.

Does this make sense? Is this something which got introduced going from 0.16 to 0.17.1 but then fixed in 0.17.2?
  
> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630260#action_12630260 ] 

dhruba borthakur commented on HADOOP-4132:
------------------------------------------

One can create a file even if there is no additional free disk space in the cluster. If an application then tries to allocate a block for this file, the block allocation will fail....but the zero-length file will continue to exist. If the application is happy with a zero-length file, then it can proceed ahead. Does this make sense?

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630316#action_12630316 ] 

Christian Kunz commented on HADOOP-4132:
----------------------------------------

Raghu, thank you for pointing this out. As a matter of fact, we already did that, after lessons learned on another cluster with 0.17.2.

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-4132:
-----------------------------------

    Summary: high rate of task failures because of bad or full datanodes  (was: high rate of task failures because of bad of full datanodes)

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630317#action_12630317 ] 

Raghu Angadi commented on HADOOP-4132:
--------------------------------------

I hope the default is increased too. I don't see much of a reason for 256.

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-4132:
-----------------------------------

    Priority: Major  (was: Blocker)

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630282#action_12630282 ] 

Raghu Angadi commented on HADOOP-4132:
--------------------------------------

> In addition to run under 0.17.2 [...]

Chirstian, you might need to increase "dfs.datanode.max.xcievers" to something like 1000+. The default is 256. If I remember correctly you have many simultaneous readers and writers to each datanode.


> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629893#action_12629893 ] 

dhruba borthakur commented on HADOOP-4132:
------------------------------------------

I agree with Raghu. As ling as the datanode has enough space, it can get load. However, the random selection of datanodes (within a rack) usually balances the disk-space usage on a cluster. Also, while deleting excess replicas, HDFS tries to delete replicas on machines that have least amount of free space.

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630254#action_12630254 ] 

Robert Chansler commented on HADOOP-4132:
-----------------------------------------

The NameNode should be able to continue to provide administrative functions and file access even if file creations must be delayed or deferred. If the policy is to refuse creations, the client should receive a unambiguous message. 

So what should be the policy? It is probably infeasible to allocate the very last block. Is the cluster full when it takes too long to find a block? When too many DataNode report high utilization? If replication attempts fail too often?



> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629935#action_12629935 ] 

Raghu Angadi commented on HADOOP-4132:
--------------------------------------

smarter is better than less smart :-)

Though you may not be affected, another thing that influences load is how fast a datanode can handle a slower nodes (either because of slower hardware or slower clients) overall gets less blocks assigned than a faster node.. because at any time NameNode tries to keep number blocks currently being written to less than 2*avg.
 

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629839#action_12629839 ] 

Raghu Angadi commented on HADOOP-4132:
--------------------------------------

If you have almost full datanodes, you need HADOOP-3707. You are probably hitting HADOOP-3633 ant HADOOP-3685. These are all committed in 0.17.2. Log on some datanodes would show more info.

> I would have assumed that when writing to dfs, datanodes on the same rack with more space are preferred over datanodes with less space. Am I wrong?

Not really. That will result in imbalance in load. As long as a datanode has enough space it gets load.



> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629933#action_12629933 ] 

Christian Kunz commented on HADOOP-4132:
----------------------------------------

Although I do not completely disagree (I still can imagine that some smart algorithm could leverage knowledge about space to ensure better distribution without overloading any particular datanode), I wonder whether there is a bug in assigning blocks.

I checked the block distribution before and after a failed job.
Of 1623 nodes 5% of the nodes received at least 27% of new blocks.
Before the job started the 5% of nodes with highest number of blocks had 7% of all blocks,
at the end of the failed jobs the top 5% had 10% of all blocks.

Does this make sense? Is this something which got introduced going from 0.16 to 0.17.1 but then fixed in 0.17.2?

> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4132) high rate of task failures because of bad of full datanodes

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-4132:
-----------------------------------

    Priority: Blocker  (was: Major)
     Summary: high rate of task failures because of bad of full datanodes  (was: high rate of task failures because of bad datanodes)

The problem does not only occur with bad datanodes but also with full datanodes.

During a long running job the distribution of blocks written to dfs is not uniform. With 35% dfs full, 5% of datanodes are already completely full, typically having double the number of blocks compared to average.

I would have assumed that when writing to dfs, datanodes on the same rack with more space are preferred over datanodes with less space. Am I wrong?

> high rate of task failures because of bad of full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.