You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2013/06/18 22:27:20 UTC

Does RowSimilarity job support down-sampling

I was reading the RowSimilarityJob and it doesn't appear that it does
down-sampling on the original data to minimize the performance impact of
perversely prolific users.

The issue is that if a single user has 100,000 items in their history, we
learn nothing more than if we picked 300 of those while the former would
result in processing 10 billion cooccurrences and the latter would result
in 100,000.  This factor of 10,000 is so large that it can make a big
difference in performance.

I had thought that the code had this down-sampling in place.

If not, I can add row based down-sampling quite easily.

Re: Does RowSimilarity job support down-sampling

Posted by Dan Filimon <da...@gmail.com>.

I think you can get what you need through the --maxPrefsForUser flag.
Any user with more than that many will only keep a random sample of that size.



On Jun 18, 2013, at 23:27, Ted Dunning <te...@gmail.com> wrote:

> I was reading the RowSimilarityJob and it doesn't appear that it does
> down-sampling on the original data to minimize the performance impact of
> perversely prolific users.
> 
> The issue is that if a single user has 100,000 items in their history, we
> learn nothing more than if we picked 300 of those while the former would
> result in processing 10 billion cooccurrences and the latter would result
> in 100,000.  This factor of 10,000 is so large that it can make a big
> difference in performance.
> 
> I had thought that the code had this down-sampling in place.
> 
> If not, I can add row based down-sampling quite easily.

Re: Does RowSimilarity job support down-sampling

Posted by Sebastian Schelter <ss...@apache.org>.

On 19.06.2013 01:29, Ted Dunning wrote:
> On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> We could also move the sampling directly to RowSimilarityJob if people
>> consider this more useful.
> 
> It will have a large effect on the time for the RowSimilarityJob for some
> data.

I put the sampling into PreparePreferenceMatrixJob, because I considered
it to be usecase specific for recommendations.

> Does anybody have an idea about how much of the total time is in
> RowSimilarityJob?

What do you mean by total time? Compared to the rest of the jobs in
ItemSimilarityJob and RecommenderJob?

-sebastian

Re: Does RowSimilarity job support down-sampling

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter <ss...@apache.org> wrote:

> We could also move the sampling directly to RowSimilarityJob if people
> consider this more useful.
>

It will have a large effect on the time for the RowSimilarityJob for some
data.

Does anybody have an idea about how much of the total time is in
RowSimilarityJob?

Re: Does RowSimilarity job support down-sampling

Posted by Sebastian Schelter <ss...@apache.org>.

Hi,

RowSimilarityJob by itself does not do down-sampling.

The down-sampling is done by the ToItemVectorsMapper in the
PreparePreferenceMatrixJob which is responsible for preparing the inputs
(the matrix of interactions between users and items) for
ItemSimilarityJob and RecommenderJob. As Sean noted, the option
"maxPrefsPerUser" controls the sampling. By default, we use a 1000
samples per user.

We could also move the sampling directly to RowSimilarityJob if people
consider this more useful.

Best,
Sebastian


On 18.06.2013 22:50, Ted Dunning wrote:
> But RecommenderJob seems to call RowSimilarityJob first.  That is where
> sampling needs to be done.
> 
>       //calculate the co-occurrence matrix
>       ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
>         "--input", new Path(prepPath,
> PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
>         "--output", similarityMatrixPath.toString(),
>         "--numberOfColumns", String.valueOf(numberOfUsers),
>         "--similarityClassname", similarityClassname,
>         "--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem),
>         "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
>         "--threshold", String.valueOf(threshold),Hi
>         "--tempDir", getTempPath().toString(),
>       });
> 
>       // write out the similarity matrix if the user specified that behavior
>       if (hasOption("outputPathForSimilarityMatrix")) {
>         Path outputPathForSimilarityMatrix = new
> Path(getOption("outputPathForSimilarityMatrix"));
> 
>         Job outputSimilarityMatrix = prepareJob(similarityMatrixPath,
> outputPathForSimilarityMatrix,
>             SequenceFileInputFormat.class,
> ItemSimilarityJob.MostSimilarItemPairsMapper.class,
>             EntityEntityWritable.class, DoubleWritable.class,
> ItemSimilarityJob.MostSimilarItemPairsReducer.class,
>             EntityEntityWritable.class, DoubleWritable.class,
> TextOutputFormat.class);
> 
>         Configuration mostSimilarItemsConf =
> outputSimilarityMatrix.getConfiguration();
>         mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
>             new Path(prepPath,
> PreparePreferenceMatrixJob.ITEMID_INDEX).toString());
> 
> mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM,
> maxSimilaritiesPerItem);
>         outputSimilarityMatrix.waitForCompletion(true);
>       }
>     }
> 
> 
> 
> 
> On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen <sr...@gmail.com> wrote:
> 
>> No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
>> setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.
>>
>> On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>> Ahh... only effective in RecommenderJob.
>>
>

Re: Does RowSimilarity job support down-sampling

Posted by Ted Dunning <te...@gmail.com>.

But RecommenderJob seems to call RowSimilarityJob first.  That is where
sampling needs to be done.

      //calculate the co-occurrence matrix
      ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
        "--input", new Path(prepPath,
PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
        "--output", similarityMatrixPath.toString(),
        "--numberOfColumns", String.valueOf(numberOfUsers),
        "--similarityClassname", similarityClassname,
        "--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem),
        "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
        "--threshold", String.valueOf(threshold),
        "--tempDir", getTempPath().toString(),
      });

      // write out the similarity matrix if the user specified that behavior
      if (hasOption("outputPathForSimilarityMatrix")) {
        Path outputPathForSimilarityMatrix = new
Path(getOption("outputPathForSimilarityMatrix"));

        Job outputSimilarityMatrix = prepareJob(similarityMatrixPath,
outputPathForSimilarityMatrix,
            SequenceFileInputFormat.class,
ItemSimilarityJob.MostSimilarItemPairsMapper.class,
            EntityEntityWritable.class, DoubleWritable.class,
ItemSimilarityJob.MostSimilarItemPairsReducer.class,
            EntityEntityWritable.class, DoubleWritable.class,
TextOutputFormat.class);

        Configuration mostSimilarItemsConf =
outputSimilarityMatrix.getConfiguration();
        mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
            new Path(prepPath,
PreparePreferenceMatrixJob.ITEMID_INDEX).toString());

mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM,
maxSimilaritiesPerItem);
        outputSimilarityMatrix.waitForCompletion(true);
      }
    }




On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen <sr...@gmail.com> wrote:

> No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
> setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.
>
> On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Ahh... only effective in RecommenderJob.
>

Re: Does RowSimilarity job support down-sampling

Posted by Sean Owen <sr...@gmail.com>.

No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.

On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning <te...@gmail.com> wrote:
> Ahh... only effective in RecommenderJob.

Re: Does RowSimilarity job support down-sampling

Posted by Ted Dunning <te...@gmail.com>.

Ahh... only effective in RecommenderJob.




On Tue, Jun 18, 2013 at 10:40 PM, Ted Dunning <te...@gmail.com> wrote:

> My recollection as well.
>
> I will read the code again.  Didn't see where that happens.
>
>
> On Tue, Jun 18, 2013 at 10:34 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> This is the "maxPrefsPerUser" option IIRC.
>>
>> On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > I was reading the RowSimilarityJob and it doesn't appear that it does
>> > down-sampling on the original data to minimize the performance impact of
>> > perversely prolific users.
>> >
>> > The issue is that if a single user has 100,000 items in their history,
>> we
>> > learn nothing more than if we picked 300 of those while the former would
>> > result in processing 10 billion cooccurrences and the latter would
>> result
>> > in 100,000.  This factor of 10,000 is so large that it can make a big
>> > difference in performance.
>> >
>> > I had thought that the code had this down-sampling in place.
>> >
>> > If not, I can add row based down-sampling quite easily.
>>
>
>

Re: Does RowSimilarity job support down-sampling

Posted by Ted Dunning <te...@gmail.com>.

My recollection as well.

I will read the code again.  Didn't see where that happens.


On Tue, Jun 18, 2013 at 10:34 PM, Sean Owen <sr...@gmail.com> wrote:

> This is the "maxPrefsPerUser" option IIRC.
>
> On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > I was reading the RowSimilarityJob and it doesn't appear that it does
> > down-sampling on the original data to minimize the performance impact of
> > perversely prolific users.
> >
> > The issue is that if a single user has 100,000 items in their history, we
> > learn nothing more than if we picked 300 of those while the former would
> > result in processing 10 billion cooccurrences and the latter would result
> > in 100,000.  This factor of 10,000 is so large that it can make a big
> > difference in performance.
> >
> > I had thought that the code had this down-sampling in place.
> >
> > If not, I can add row based down-sampling quite easily.
>

Re: Does RowSimilarity job support down-sampling

Posted by Sean Owen <sr...@gmail.com>.

This is the "maxPrefsPerUser" option IIRC.

On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning <te...@gmail.com> wrote:
> I was reading the RowSimilarityJob and it doesn't appear that it does
> down-sampling on the original data to minimize the performance impact of
> perversely prolific users.
>
> The issue is that if a single user has 100,000 items in their history, we
> learn nothing more than if we picked 300 of those while the former would
> result in processing 10 billion cooccurrences and the latter would result
> in 100,000.  This factor of 10,000 is so large that it can make a big
> difference in performance.
>
> I had thought that the code had this down-sampling in place.
>
> If not, I can add row based down-sampling quite easily.