You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Han Hui Wen (JIRA)" <ji...@apache.org> on 2010/08/13 12:17:19 UTC

[jira] Updated: (MAHOUT-475) Replace Mapper with MultithreadedMapper to run job pairwiseSimilarity

     [ https://issues.apache.org/jira/browse/MAHOUT-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Hui Wen  updated MAHOUT-475:
--------------------------------

        Summary: Replace Mapper with  MultithreadedMapper  to run job pairwiseSimilarity   (was: Replace Mapper with  MultithreadedMapper  to implement org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.CooccurrencesMapper)
    Description: 
Because CooccurrencesMapper has huge computing,
Maybe we can replace  Mapper with  MultithreadedMapper.



And call the mapper
original:
{code}
    if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      Job pairwiseSimilarity = prepareJob(weightsPath,
                               pairwiseSimilarityPath,
                               SequenceFileInputFormat.class,
                               CooccurrencesMapper.class,
                               WeightedRowPair.class,
                               Cooccurrence.class,
                               SimilarityReducer.class,
                               SimilarityMatrixEntryKey.class,
                               MatrixEntryWritable.class,
                               SequenceFileOutputFormat.class);

      Configuration pairwiseConf = pairwiseSimilarity.getConfiguration();
      pairwiseConf.set(DISTRIBUTED_SIMILARITY_CLASSNAME, distributedSimilarityClassname);
      pairwiseConf.setInt(NUMBER_OF_COLUMNS, numberOfColumns);
      pairwiseSimilarity.waitForCompletion(true);
    }
{code}

new:
{code}
    if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      Job pairwiseSimilarity = prepareJob(weightsPath,
                               pairwiseSimilarityPath,
                               SequenceFileInputFormat.class,
                               CooccurrencesMapper.class,
                               WeightedRowPair.class,
                               Cooccurrence.class,
                               SimilarityReducer.class,
                               SimilarityMatrixEntryKey.class,
                               MatrixEntryWritable.class,
                               SequenceFileOutputFormat.class);

      
      Configuration pairwiseConf = pairwiseSimilarity.getConfiguration();
      pairwiseConf.set(DISTRIBUTED_SIMILARITY_CLASSNAME, distributedSimilarityClassname);
      pairwiseConf.setInt(NUMBER_OF_COLUMNS, numberOfColumns);
      MultithreadedMapper.setMapperClass(pairwiseSimilarity, CooccurrencesMapper.class);
      MultithreadedMapper.setNumberOfThreads(pairwiseSimilarity, numMapThreads);
      SequenceFileOutputFormat.setCompressOutput(pairwiseSimilarity, true);
      SequenceFileOutputFormat.setOutputCompressorClass(pairwiseSimilarity, GzipCodec.class);
      SequenceFileOutputFormat.setOutputCompressionType(pairwiseSimilarity, CompressionType.BLOCK);

      pairwiseSimilarity.waitForCompletion(true);
    }
{code}

  was:
Because CooccurrencesMapper has huge computing,
Maybe we can replace  Mapper with  MultithreadedMapper.

Original:
{code}
 public static class CooccurrencesMapper
      extends Mapper<VarIntWritable,WeightedOccurrenceArray,WeightedRowPair,Cooccurrence>
{code}

new:
{code}
 public static class CooccurrencesMapper
      extends MultithreadedMapper<VarIntWritable,WeightedOccurrenceArray,WeightedRowPair,Cooccurrence>
{code}

And call the mapper
original:
{code}
    if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      Job pairwiseSimilarity = prepareJob(weightsPath,
                               pairwiseSimilarityPath,
                               SequenceFileInputFormat.class,
                               CooccurrencesMapper.class,
                               WeightedRowPair.class,
                               Cooccurrence.class,
                               SimilarityReducer.class,
                               SimilarityMatrixEntryKey.class,
                               MatrixEntryWritable.class,
                               SequenceFileOutputFormat.class);

      Configuration pairwiseConf = pairwiseSimilarity.getConfiguration();
      pairwiseConf.set(DISTRIBUTED_SIMILARITY_CLASSNAME, distributedSimilarityClassname);
      pairwiseConf.setInt(NUMBER_OF_COLUMNS, numberOfColumns);
      pairwiseSimilarity.waitForCompletion(true);
    }
{code}

new:
{code}
    if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      Job pairwiseSimilarity = prepareJob(weightsPath,
                               pairwiseSimilarityPath,
                               SequenceFileInputFormat.class,
                               CooccurrencesMapper.class,
                               WeightedRowPair.class,
                               Cooccurrence.class,
                               SimilarityReducer.class,
                               SimilarityMatrixEntryKey.class,
                               MatrixEntryWritable.class,
                               SequenceFileOutputFormat.class);

      Configuration pairwiseConf = pairwiseSimilarity.getConfiguration();
      pairwiseConf.set(DISTRIBUTED_SIMILARITY_CLASSNAME, distributedSimilarityClassname);
      pairwiseConf.setInt(NUMBER_OF_COLUMNS, numberOfColumns);
      CooccurrencesMapper.setNumberOfThreads(n); //n should about be less than core counts of the machine.
      pairwiseSimilarity.waitForCompletion(true);
      
    }
{code}


> Replace Mapper with  MultithreadedMapper  to run job pairwiseSimilarity 
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-475
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-475
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>             Fix For: 0.4
>
>
> Because CooccurrencesMapper has huge computing,
> Maybe we can replace  Mapper with  MultithreadedMapper.
> And call the mapper
> original:
> {code}
>     if (shouldRunNextPhase(parsedArgs, currentPhase)) {
>       Job pairwiseSimilarity = prepareJob(weightsPath,
>                                pairwiseSimilarityPath,
>                                SequenceFileInputFormat.class,
>                                CooccurrencesMapper.class,
>                                WeightedRowPair.class,
>                                Cooccurrence.class,
>                                SimilarityReducer.class,
>                                SimilarityMatrixEntryKey.class,
>                                MatrixEntryWritable.class,
>                                SequenceFileOutputFormat.class);
>       Configuration pairwiseConf = pairwiseSimilarity.getConfiguration();
>       pairwiseConf.set(DISTRIBUTED_SIMILARITY_CLASSNAME, distributedSimilarityClassname);
>       pairwiseConf.setInt(NUMBER_OF_COLUMNS, numberOfColumns);
>       pairwiseSimilarity.waitForCompletion(true);
>     }
> {code}
> new:
> {code}
>     if (shouldRunNextPhase(parsedArgs, currentPhase)) {
>       Job pairwiseSimilarity = prepareJob(weightsPath,
>                                pairwiseSimilarityPath,
>                                SequenceFileInputFormat.class,
>                                CooccurrencesMapper.class,
>                                WeightedRowPair.class,
>                                Cooccurrence.class,
>                                SimilarityReducer.class,
>                                SimilarityMatrixEntryKey.class,
>                                MatrixEntryWritable.class,
>                                SequenceFileOutputFormat.class);
>       
>       Configuration pairwiseConf = pairwiseSimilarity.getConfiguration();
>       pairwiseConf.set(DISTRIBUTED_SIMILARITY_CLASSNAME, distributedSimilarityClassname);
>       pairwiseConf.setInt(NUMBER_OF_COLUMNS, numberOfColumns);
>       MultithreadedMapper.setMapperClass(pairwiseSimilarity, CooccurrencesMapper.class);
>       MultithreadedMapper.setNumberOfThreads(pairwiseSimilarity, numMapThreads);
>       SequenceFileOutputFormat.setCompressOutput(pairwiseSimilarity, true);
>       SequenceFileOutputFormat.setOutputCompressorClass(pairwiseSimilarity, GzipCodec.class);
>       SequenceFileOutputFormat.setOutputCompressionType(pairwiseSimilarity, CompressionType.BLOCK);
>       pairwiseSimilarity.waitForCompletion(true);
>     }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.