You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Dick King (JIRA)" <ji...@apache.org> on 2010/07/22 22:56:54 UTC

[jira] Updated: (MAPREDUCE-1073) Progress reported for pipes tasks is incorrect.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dick King updated MAPREDUCE-1073:
---------------------------------

    Attachment: MAPREDUCE-1073--yhadoop20--2010-07-22.patch

The previous versions of this attachment missed one point.

The basic problem is that with the existing code base the progress is based on the records read from the input split, but there is buffering in the way pipes works.  This makes the tasks appear to have made more progress than they deserve to have made, in jobs where the input splits are small.

To make speculation work under pipes with small input splits, two conditions have to be met:

1: The pipes code has to have an API to report progress, and has to use it.  The old patch met this goal.  You incant {{(&context)->serProgress(float)}} within {{HadoopPipes::Mapper.map(HadoopPipes::MapContext& context)}} .  This does require that you have a way of measuring progress,which I consider likely because this is only needed when the input splits are small, which implies that the "input data" is really a signal to get the real data somewhere else [or to generate it].

2: The job has to be able to say that the progress that would otherwise be inferred from input split reads has to be ignored.  This newest version of the patch does that; you can either call {{JobConf.setRecordReaderProgressDisabled(true)}}, or set the attribute {{mapred.job.disable.record.reader.progress}} to {{true}} .

This patch addresses the second point.  I did not mark it available because it needs a forward port.  I attached it to this issue for comments, and for the record.

> Progress reported for pipes tasks is incorrect.
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1073
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1073
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: pipes
>    Affects Versions: 0.20.1
>            Reporter: Sreekanth Ramakrishnan
>            Assignee: Dick King
>         Attachments: mapreduce-1073--2010-03-31.patch, mapreduce-1073--2010-04-06.patch, MAPREDUCE-1073--yhadoop20--2010-07-22.patch, MAPREDUCE-1073_yhadoop20.patch
>
>
> Currently in pipes, {{org.apache.hadoop.mapred.pipes.PipesMapRunner.run(RecordReader<K1, V1>, OutputCollector<K2, V2>, Reporter)}} we do the following:
> {code}
>         while (input.next(key, value)) {
>           downlink.mapItem(key, value);
>           if(skipping) {
>             downlink.flush();
>           }
>         }
> {code}
> This would result in consumption of all the records for current task and taking task progress to 100% whereas the actual pipes application would be trailing behind. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.