You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by xeon Mailinglist <xe...@gmail.com> on 2016/08/11 16:38:23 UTC

Where is the temp output data of a map or reduce tasks

With MapReduce v2 (Yarn), the output data that comes out from a map or a
reduce task is saved in the local disk or the HDFS when all the tasks
finish.

Since tasks end at different times, I was expecting that the data were
written as a task finish. For example, task 0 finish and so the output is
written, but task 1 and task 2 are still running. Now task 2 finish the
output is written, and task 1 is still running. Finally, task 1 finish and
the last output is written. But this does not happen. The outputs only
appear in the local disk or HDFS when all the tasks finish.

I want to access the task output as the data is being produced. Where is
the output data before all the tasks finish?


After I have set these params in `mapred-site.xml`


<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>

<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>

I still can't found where the intermediate output or the final output is
saved as they are produced by the tasks.

I have listed all directories in `hdfs dfs -ls -R /` and in the `tmp` dir I
have only found the job configuration files.

    drwx------   - root supergroup          0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
    -rw-r--r--   1 root supergroup          0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
    -rw-r--r--   1 root supergroup          0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
    -rw-r--r--  10 root supergroup     112872 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
    -rw-r--r--  10 root supergroup       6641 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
    -rw-r--r--   1 root supergroup        797 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
    -rw-r--r--   1 root supergroup      88675 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
    -rw-r--r--   1 root supergroup     439848 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
    -rw-r--r--   1 root supergroup     105176 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml

 Where is the output saved? I am talking about the output that it is stored
as it is being produced by the tasks, and not the final output that comes
when all map or reduce tasks finish.