You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2007/05/30 03:04:15 UTC

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499999 ] 

Runping Qi commented on HADOOP-1440:
------------------------------------


To address the output file name problem associated with option  -reducer NONE, the only change you need to make
is to change the value for the finalName in the constructor of class DirectMapOutputCollector in MapTask.java 

    public DirectMapOutputCollector(TaskUmbilicalProtocol umbilical,
        JobConf job, Reporter reporter) throws IOException {
      this.umbilical = umbilical;
      this.job = job;
      this.reporter = reporter;
-     String finalName = getTipId();
+    String finalName = job.get("map.input.file") +  "_" + getTipId();
      FileSystem fs = FileSystem.get(this.job);

      out = job.getOutputFormat().getRecordWriter(fs, job, finalName, reporter);
    }
This way, the output file names will be the same order as the input file names.
Of course, you will run into a problem that the file names will become longer and longer.
So you actually want to control it in a way like:

    String finalName = getTipId();
    if (need keep same order and file was not splited) {
        finalName = job.get("map.input.file");
    } else if (need keep same order) {
        finalName = job.get("map.input.file") +  "_" + getTipId();
   }


> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.