You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by xeon Mailinglist <xe...@gmail.com> on 2016/08/11 16:38:23 UTC
Where is the temp output data of a map or reduce tasks
With MapReduce v2 (Yarn), the output data that comes out from a map or a
reduce task is saved in the local disk or the HDFS when all the tasks
finish.
Since tasks end at different times, I was expecting that the data were
written as a task finish. For example, task 0 finish and so the output is
written, but task 1 and task 2 are still running. Now task 2 finish the
output is written, and task 1 is still running. Finally, task 1 finish and
the last output is written. But this does not happen. The outputs only
appear in the local disk or HDFS when all the tasks finish.
I want to access the task output as the data is being produced. Where is
the output data before all the tasks finish?
After I have set these params in `mapred-site.xml`
<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>
I still can't found where the intermediate output or the final output is
saved as they are produced by the tasks.
I have listed all directories in `hdfs dfs -ls -R /` and in the `tmp` dir I
have only found the job configuration files.
drwx------ - root supergroup 0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
-rw-r--r-- 10 root supergroup 112872 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
-rw-r--r-- 10 root supergroup 6641 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
-rw-r--r-- 1 root supergroup 797 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
-rw-r--r-- 1 root supergroup 88675 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
-rw-r--r-- 1 root supergroup 439848 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
-rw-r--r-- 1 root supergroup 105176 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml
Where is the output saved? I am talking about the output that it is stored
as it is being produced by the tasks, and not the final output that comes
when all map or reduce tasks finish.