You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Alan Said <Al...@dai-labor.de> on 2010/11/29 18:54:52 UTC

Job with mulitple input paths and mappers

Hi all,
I'm trying to get https://issues.apache.org/jira/browse/MAHOUT-106 running in Mahout 0.4 and Hadoop 0.20.2. I'm however stuck at a point where a job with multiple input paths and mappers is created, as show in the code below.

    MultipleInputs.addInputPath(psz, new Path(sumSUQStarPath).makeQualified(fsPsz), SequenceFileInputFormat.class, Psz.PszSumSUQStarMapper.class);
    MultipleInputs.addInputPath(psz, new Path(sumUQStarPath).makeQualified(fsPsz), SequenceFileInputFormat.class, Psz.PszSumUQStarMapper.class);

    prepareJobConfWithMultipleInputs(psz,
                                         pszNextPath,
                                         VarIntWritable.class,
                                         LongFloatWritable.class,
                                         Psz.PszReducer.class,
                                         VarLongWritable.class,
                                         IntFloatWritable.class,
                                         SequenceFileOutputFormat.class);
    JobClient.runJob(psz);

I'm not quite sure how this should be written for the current API's.
AbstractJob's current prepareJob method can handle multiple input paths via org.apache.hadoop.fs.Path, not sure how to do with the extra mapper though.

Any help would be appreciated.

Thanks,
Alan

Re: Job with mulitple input paths and mappers

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Alan,

AFAIK, there's no elegant solution, you have to create a mapper that can 
somehow differentiate the inputs (you may need to add some kind of 
identifier to your data) and apply different mapping logics according to 
that.

You can have a look at RecommenderJob to see howto safely build the 
combined input pathes:

       /* necessary to make this job (having a combined input path) work 
on Amazon S3 */
       Configuration partialMultiplyConf = 
partialMultiply.getConfiguration();
       FileSystem fs = FileSystem.get(tempDirPath.toUri(), 
partialMultiplyConf);
       prePartialMultiplyPath1 = prePartialMultiplyPath1.makeQualified(fs);
       prePartialMultiplyPath2 = prePartialMultiplyPath2.makeQualified(fs);
       FileInputFormat.setInputPaths(partialMultiply, 
prePartialMultiplyPath1, prePartialMultiplyPath2);
       partialMultiply.waitForCompletion(true);

--sebastian


On 29.11.2010 18:54, Alan Said wrote:
> Hi all,
> I'm trying to get https://issues.apache.org/jira/browse/MAHOUT-106 running in Mahout 0.4 and Hadoop 0.20.2. I'm however stuck at a point where a job with multiple input paths and mappers is created, as show in the code below.
>
>      MultipleInputs.addInputPath(psz, new Path(sumSUQStarPath).makeQualified(fsPsz), SequenceFileInputFormat.class, Psz.PszSumSUQStarMapper.class);
>      MultipleInputs.addInputPath(psz, new Path(sumUQStarPath).makeQualified(fsPsz), SequenceFileInputFormat.class, Psz.PszSumUQStarMapper.class);
>
>      prepareJobConfWithMultipleInputs(psz,
>                                           pszNextPath,
>                                           VarIntWritable.class,
>                                           LongFloatWritable.class,
>                                           Psz.PszReducer.class,
>                                           VarLongWritable.class,
>                                           IntFloatWritable.class,
>                                           SequenceFileOutputFormat.class);
>      JobClient.runJob(psz);
>
> I'm not quite sure how this should be written for the current API's.
> AbstractJob's current prepareJob method can handle multiple input paths via org.apache.hadoop.fs.Path, not sure how to do with the extra mapper though.
>
> Any help would be appreciated.
>
> Thanks,
> Alan
>
>
>