You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Eli Collins (JIRA)" <ji...@apache.org> on 2011/08/11 20:12:31 UTC

[jira] [Moved] (MAPREDUCE-2813) Tasks freeze with "No live nodes contain current block", job takes long time to recover

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Collins moved HADOOP-5361 to MAPREDUCE-2813:
------------------------------------------------

    Affects Version/s:     (was: 0.21.0)
                       0.21.0
                  Key: MAPREDUCE-2813  (was: HADOOP-5361)
              Project: Hadoop Map/Reduce  (was: Hadoop Common)

> Tasks freeze with "No live nodes contain current block", job takes long time to recover
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>            Reporter: Matei Zaharia
>
> Running a recent version of trunk on 100 nodes, I occasionally see some tasks freeze at startup and hang the job. These tasks are not speculatively executed either. Here's sample output from one of them:
> {noformat}
> 2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
> 2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes contain current block
> 2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes contain current block
> 2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes contain current block
> 2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 file=/user/root/rand2/part-00864
>     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
>     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
>     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>     at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>     at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>     at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
>     at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>     at org.apache.hadoop.mapred.Child.main(Child.java:155)
> 2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 file=/user/root/rand2/part-00864
>     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
>     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
>     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>     at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>     at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>     at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
>     at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>     at org.apache.hadoop.mapred.Child.main(Child.java:155)
> {noformat}
> Note how the DFS client fails multiple times to retrieve the block, with a 2 minute wait between each one, without giving up. During this time, the task is *not* speculated. However, once this task finally failed, a new version of it ran successfully. Getting the input file in question with bin/hadoop fs -get also worked fine.
> There is no mention of the task attempt in question in the NameNode logs but my guess is that something to do with RPC queues is causing its connection to get lost, and the DFSClient does not recover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira