You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "ZhuGuanyin (JIRA)" <ji...@apache.org> on 2008/09/26 15:19:44 UTC

[jira] Created: (HADOOP-4291) MapReduce Streaming job hang when all replications of the input file has corrupted!

MapReduce Streaming job hang when all replications of the input file has corrupted!
-----------------------------------------------------------------------------------

                 Key: HADOOP-4291
                 URL: https://issues.apache.org/jira/browse/HADOOP-4291
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.18.1
            Reporter: ZhuGuanyin
            Priority: Critical


On some special cases, all replications of a given file has truncated to zero  but the namenode still hold the original size (we don't know why),  the mapreduce streaming job will hang if we don't specified mapred.task.timeout when the input files contain this corrupted file, even the dfs shell "cat" will hang when fetch data from this corrupted file.

We found that job hang at DFSInputStream.blockSeekTo() when chosing a datanode.  The following test will show:
1)	Copy a little file to hdfs. 
2)	Get the file blocks and login to these datanodes, and truncate these blocks to zero.
3)	Cat this file through dfs shell "cat"
4)	Cat command will enter dead loop.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4291) MapReduce Streaming job hang when all replications of the input file has corrupted!

Posted by "ZhuGuanyin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ZhuGuanyin updated HADOOP-4291:
-------------------------------

    Description: 
On some special cases, all replications of a given file has truncated to zero  but the namenode still hold the original size (we don't know why),  the mapreduce streaming job will hang if we don't specified mapred.task.timeout when the input files contain this corrupted file, even the dfs shell "cat" will hang when fetch data from this corrupted file.

We found that job hang at DFSInputStream.blockSeekTo() when chosing a datanode.  The following test will show:
1)	Copy a small file to hdfs. 
2)	Get the file blocks and login to these datanodes, and truncate these blocks to zero.
3)	Cat this file through dfs shell "cat"
4)	Cat command will enter dead loop.


  was:
On some special cases, all replications of a given file has truncated to zero  but the namenode still hold the original size (we don't know why),  the mapreduce streaming job will hang if we don't specified mapred.task.timeout when the input files contain this corrupted file, even the dfs shell "cat" will hang when fetch data from this corrupted file.

We found that job hang at DFSInputStream.blockSeekTo() when chosing a datanode.  The following test will show:
1)	Copy a little file to hdfs. 
2)	Get the file blocks and login to these datanodes, and truncate these blocks to zero.
3)	Cat this file through dfs shell "cat"
4)	Cat command will enter dead loop.



> MapReduce Streaming job hang when all replications of the input file has corrupted!
> -----------------------------------------------------------------------------------
>
>                 Key: HADOOP-4291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4291
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>            Reporter: ZhuGuanyin
>            Priority: Critical
>
> On some special cases, all replications of a given file has truncated to zero  but the namenode still hold the original size (we don't know why),  the mapreduce streaming job will hang if we don't specified mapred.task.timeout when the input files contain this corrupted file, even the dfs shell "cat" will hang when fetch data from this corrupted file.
> We found that job hang at DFSInputStream.blockSeekTo() when chosing a datanode.  The following test will show:
> 1)	Copy a small file to hdfs. 
> 2)	Get the file blocks and login to these datanodes, and truncate these blocks to zero.
> 3)	Cat this file through dfs shell "cat"
> 4)	Cat command will enter dead loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4291) MapReduce Streaming job hang when all replications of the input file has corrupted!

Posted by "ZhuGuanyin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634852#action_12634852 ] 

ZhuGuanyin commented on HADOOP-4291:
------------------------------------

seems that after try all datanodes, it clears the deadnode list and retry, enter an infinite loop.

We add some debug code as follows:

 In DFSInputStream.blockSeekTo(): 
    
private synchronized DatanodeInfo blockSeekTo(long target) throws IOException {
      while (s == null) {

        LOG.info("blockSeekTo step 1"); 
        DNAddrPair retval = chooseDataNode(targetBlock);
        LOG.info("blockSeekTo step 2"); 
        try {
          blockReader = BlockReader.newBlockReader();
          return chosenNode;
        } catch (IOException ex) {
          LOG.info("blockSeekTo step 3"); 
          addToDeadNodes(chosenNode);
          if (s != null) {
            try {
              s.close();
            } catch (IOException iex) {
            LOG.info("blockSeekTo step 4"); 
            }                        
          }
          s = null;
        LOG.info("blockSeekTo step 5"); 
        }
        LOG.info("blockSeekTo step 6"); 
      }
      return chosenNode;
}



In DFSInputStream. chooseDataNode ():
private DNAddrPair chooseDataNode(LocatedBlock block)
      throws IOException {
      LOG.info("chooseDataNode() step 1");
      while (true) {
        LOG.info("chooseDataNode() step 2");
        DatanodeInfo[] nodes = block.getLocations();
        try {
          LOG.info("chooseDataNode() step 3, failures = " + failures);
          DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
          LOG.info("chooseDataNode() step 4");
          InetSocketAddress targetAddr = DataNode.createSocketAddr(chosenNode.getName());
          LOG.info("chooseDataNode() step 5");
          return new DNAddrPair(chosenNode, targetAddr);
        } catch (IOException ie) {
          String blockInfo = block.getBlock() + " file=" + src;
          LOG.info("chooseDataNode() step 6, failures = " + failures);
          if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) {
            throw new IOException("Could not obtain block: " + blockInfo);
          }
          
          if (nodes == null || nodes.length == 0) {
            LOG.info("No node available for block: " + blockInfo);
          }
          LOG.info("Could not obtain block " + block.getBlock() + " from any node:  " + ie);
          try {
            Thread.sleep(3000);
          } catch (InterruptedException iex) {
          }
          LOG.info("chooseDataNode() step 7, failures = " + failures);
          deadNodes.clear(); //2nd option is to remove only nodes[blockId]
          openInfo();
          failures++;
          LOG.info("chooseDataNode() step 8, failures = " + failures);
          continue;
        }
      }
    } 

After we run ./hadoop dfs -cat /1.txt , we get the following stdout:

[test@test. baidu.com ~]$ ./hadoop fs -cat /1.txt
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 6, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: Could not obtain block blk_1225 from any node:  java.io.IOException: No live nodes contain current block

08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 7, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 8, failures = 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 6, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: Could not obtain block blk_1225 from any node:  java.io.IOException: No live nodes contain current block
.........................................................................................................





> MapReduce Streaming job hang when all replications of the input file has corrupted!
> -----------------------------------------------------------------------------------
>
>                 Key: HADOOP-4291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4291
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>            Reporter: ZhuGuanyin
>            Priority: Critical
>
> On some special cases, all replications of a given file has truncated to zero  but the namenode still hold the original size (we don't know why),  the mapreduce streaming job will hang if we don't specified mapred.task.timeout when the input files contain this corrupted file, even the dfs shell "cat" will hang when fetch data from this corrupted file.
> We found that job hang at DFSInputStream.blockSeekTo() when chosing a datanode.  The following test will show:
> 1)	Copy a little file to hdfs. 
> 2)	Get the file blocks and login to these datanodes, and truncate these blocks to zero.
> 3)	Cat this file through dfs shell "cat"
> 4)	Cat command will enter dead loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.