You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/05/01 19:43:47 UTC

[jira] Updated: (HADOOP-147) MapTask removed mapout files before the reduce tasks copy them

     [ http://issues.apache.org/jira/browse/HADOOP-147?page=all ]

Doug Cutting updated HADOOP-147:
--------------------------------

    Component: mapred

> MapTask removed mapout files before the reduce tasks copy them
> --------------------------------------------------------------
>
>          Key: HADOOP-147
>          URL: http://issues.apache.org/jira/browse/HADOOP-147
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>     Reporter: Runping Qi

>
> I was running a job on a cluster of 138 nodes. The job had 1050 map tasks and 128 reduce tasks. It stucked at the reduce stage.
> All the reduce tasks were trying to copy file from a map task with the following status show on the web interface:
> reduce > copy > task_m_ehz5q1@node1262.foo.com:60040
> However, the log on the machine node1262 (where the map task task_m_ehz5q1 ran) showed that the map task finished even before the 
> reduce tasks copied the map output files:
> 060417 103554 Server connection on port 60050 from 72.30.117.220: starting
> 060417 103554 task_m_ehz5q1  Client connection to 0.0.0.0:60050: starting
> 060417 103554 task_m_ehz5q1 1.0% /user/runping/runping/proj/part-00039:0+71
> 060417 103554 Task task_m_ehz5q1 is done.
> 060417 103554 parsing file:/local/hadoop/conf2/hadoop-default.xml
> ......................
> 060417 103613 parsing file:/local/hadoop/conf2/hadoop-site.xml
> 060417 103623 task_m_ehz5q1 done; removing files.
> 060417 103633 parsing file:/local/hadoop/conf2/hadoop-default.xml
> 060417 103633 parsing file:/local/hadoop/conf2/mapred-default.xml
> 060417 103633 parsing file:/local/hadoop/conf2/hadoop-site.xml
> ...........................................
> 060417 190241 SEVERE Can't open map output:/local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
> java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
>         at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
>         at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
>         at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
>         at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
>         at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
>         at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
>         at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
> 060417 190241 Unknown child with bad map output: task_m_ehz5q1. Ignored.
> 060417 190241 Server handler 2 on 60040 caught: java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
> java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
>         at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
>         at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
>         at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
>         at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
>         at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
>         at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
> 060417 190241 parsing file:/local/hadoop/conf2/hadoop-default.xml
> 060417 190241 parsing file:/local/hadoop/conf2/mapred-default.xml
> And the above exceptions repeated for many (not sure whether it is tru for all the reduce task) other reduce tasks.
> Another strange thing noticed from the logs.
> On another machine's log, I saw:
> 060417 190528 parsing file:/local/hadoop/conf2/hadoop-site.xml
> 060417 190528 task_r_24d8k4 copy failed: task_m_ehz5q1 from node1262.foo.com/72.30.117.220:60040
> java.io.IOException: timed out waiting for response
>         at org.apache.hadoop.ipc.Client.call(Client.java:305)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
>         at org.apache.hadoop.mapred.$Proxy2.getFile(Unknown Source)
>         at org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:110)
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:66)
> 060417 190528 task_r_24d8k4 0.11523809% reduce > copy > task_m_ehz5q1@node1262.foo.com:60040
> 060417 190528 task_r_24d8k4 Copying task_m_epatk8 output from node1387.foo.com.
>                                
> which is expected. However, before this line, 
> I saw another copy activity in the log:
> 060417 103608 parsing file:/local/hadoop/conf2/hadoop-site.xml
> 060417 103608 task_r_a4yl3t Copying task_m_ehz5q1 output from node1262.foo.com.
> 060417 103608 parsing file:/local/hadoop/conf2/hadoop-default.xml
> And the task task_r_a4yl3t does not belong to the concerned job, 
> according to the Web interface. That is strange.
> And I checked a few other machines where some reduce tasks ran, 
> and I saw the same thing.
> I suspect there was a conflict in job ID. If two jobs had the same ID, 
> when one closes, it may also mark the other as "closed" too, thus trggering map tasks
> to clean up prematurely.
> A simple way to avoid potential jobid conflict is to use sequential numbers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira