You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sebastian Schelter <ss...@apache.org> on 2011/08/13 11:31:45 UTC

RowSimilarityJob improvements

Hi there,

I invested a lot of work in improving RowSimilarityJob recently and I'm 
writing to get some feedback on my intermediate results. I also dug out 
an old mailinglist discussion about the matrix form of itembased 
recommendation that helped me a lot, I recommend reading this to anyone 
who wants to dive into the details here:

http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html

I managed to rework the algorithm as it was proposed in the mailinglist 
discussion before. The main goal was to drastically reduce the amount of 
data that needs to be shipped over the network in the cooccurrence 
computation phase. This is accomplished by making heavy use of a 
combiner for all supported similarity measures, by not attaching the 
"norms" of the rows to the cooccurrence pairs anymore and by emitting 
cooccurrence "stripes" instead of cooccurrence pairs.

A similarity measure must now be expressed by four simple functions that 
are invoked in different stages of the algorithm

normalize(row) (invoked once at the beginning, can transform the input 
row vectors)

norm(row) (invoked once at the beginning, maps an input vector a double)

aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB) 
(invoked for each pair of cooccurring elements of two input vectors, can 
map them to a double, these doubles are summed then)

similarity(summedAggregations,normOfRowA,normOfRowB,numColumns)
(invoked at the end with the results of the other functions, should 
compute the final similarity)

A short list how our commonly used measures can be expressed this way:

cosine:

normalize(...) normalizes the vectors to unit length
aggregate(...) multiplies valueA and valueB
similarity(...) just needs to return the summedAggregations (which is 
the dot-product)

pearson:

same as cosine, but normalize(...) centers the vector elements first 
(without touching 0s as this would make the input matrix dense)

cooccurrence-count / overlap:

aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations only

tanimoto/jaccard:

norm(...) returns the l0 norm (number of non-zero elements)
aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations / (normOfRowA + normOfRowB - 
summedDot);

loglikelihood:

same as tanimoto, but we compute LLR in similarity(...) from 
summedAggregations, normOfRowA,normOfRowB,numColumns

cityblock/manhattan:

same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB 
- 2 * summedAggregations);

euclidean distance:

norm(...) returns the summed squares of the input vector
aggregate(...) aggregate(...) multiplies valueA and valueB
similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots + 
normB)

unfortunately we cant use the formula from 
o.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here which 
also incorporates the number of "overlapping" dimensions.

I'm currently running tests on a small cluster using several rating 
datasets (18M music ratings from lastfm, 100M movie ratings from 
netflix, 250M music ratings from yahoo's KDD cup), results are pretty ok 
so far, but I'm still searching for ways to improve the algorithm.

One thing I'm currently looking into is how to sample the input. Ted has 
stated that you usually only need to look at a few hundred or thousand 
ratings per item as you don't learn anything new from the rest. Would it 
be sufficient to randomly sample the ratings of an item then? That's 
what I'm currently doing but I wonder whether there are more clever ways 
to do this.

There is a sampling heuristic in the distributed recommender that caught 
my attention a long time ago, can anyone give any details on it? I 
basically understand what it's doing but not exactly why that's a good 
way to sample. Here's the original implementation:

http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188

Another issue I'm looking into is whether it would help to specify a 
similarity threshold. There is some work on "All Pairs Similarity 
Search" which is closely related to cooccurrence computation that seems 
to have some nice pruning strategies based on a threshold.

For example imagine your similarity measure is cooccurrence count and 
you only want to have pairs with a count of 5, than you would not need 
to consider all pairs where one the vectors has less than 5 non zero 
entries.

Something like this could be a way to further improve the algorithm's 
running time, especially as the resulting similarity values seem to 
follow a heavily tailed distribution. It should be possible to find 
pruning heuristics for all measures, yet I don't have an idea yet how to 
find a good a-priori threshold.

I think about writing a small paper about my works on that (still need 
to see if I can "sell" it to my professor :) ). I hope everyone who 
contributed to the discussion here is ok with that, I don't want to 
"steal" ideas.

--sebastian

On 20.07.2011 17:40, Ted Dunning wrote:
> Not if you are doing Euclidean distance.
>
> On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll (JIRA)<ji...@apache.org>wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]
>>
>> Grant Ingersoll commented on MAHOUT-767:
>> ----------------------------------------
>>
>> Why do we need the norms?  Why can't we assume the user has already
>> normalized?
>>
>>> Improve RowSimilarityJob performance for count-based distance measures
>>> ----------------------------------------------------------------------
>>>
>>>                  Key: MAHOUT-767
>>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-767
>>>              Project: Mahout
>>>           Issue Type: Improvement
>>>             Reporter: Grant Ingersoll
>>>              Fix For: 0.6
>>>
>>>
>>> (See
>> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for background)
>>> Currently, the RowSimilarityJob defers the calculation of the similarity
>> metric until the reduce phase, while emitting many Cooccurrence objects.
>>   For similarity metrics that are algebraic (
>> http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
>> be able to do much of the computation during the Mapper part of this phase
>> and also take advantage of a Combiner.
>>> We should use a marker interface to know whether a similarity metric is
>> algebraic and then make use of an appropriate Mapper implementation,
>> otherwise we can fall back on our existing implementation.
>>
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>
>

Re: RowSimilarityJob improvements

Posted by Ted Dunning <te...@gmail.com>.

I have heard various arguments that favor retaining the most recent interactions or favor a fair sample or favor taking the earliest interactions. These can even be combined with biased samples. I haven't seen much difference between these approaches. I think at the lack of difference is largely due to the fact that the sampling falls most heavily on items that we care very little about in recommendations since they are the most popular items that are obviously getting plenty of traffic anyway. 

Sent from my iPad

On Aug 13, 2011, at 2:31 AM, Sebastian Schelter <ss...@apache.org> wrote:

> One thing I'm currently looking into is how to sample the input. Ted has stated that you usually only need to look at a few hundred or thousand ratings per item as you don't learn anything new from the rest. Would it be sufficient to randomly sample the ratings of an item then? That's what I'm currently doing but I wonder whether there are more clever ways to do this.

Re: RowSimilarityJob improvements

Posted by Grant Ingersoll <gs...@apache.org>.

+1

On Aug 13, 2011, at 3:04 PM, Ted Dunning wrote:

> Frankly you are doing the work. You should take the credit.  Ideas without elaboration and testing deserve a little attribution but authorship credit should go to the ones doing the real work.  
> 
> Sent from my iPad
> 
> On Aug 13, 2011, at 3:25 AM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>>> 
>>> I think about writing a small paper about my works on that (still need to see if I can "sell" it to my professor :) ). I hope everyone who contributed to the discussion here is ok with that, I don't want to "steal" ideas.
>>> 
>> 
>> Well, you can always be the primary and add others.  :-)

--------------------------
Grant Ingersoll

Re: RowSimilarityJob improvements

Posted by Ted Dunning <te...@gmail.com>.

Frankly you are doing the work. You should take the credit.  Ideas without elaboration and testing deserve a little attribution but authorship credit should go to the ones doing the real work.  

Sent from my iPad

On Aug 13, 2011, at 3:25 AM, Grant Ingersoll <gs...@apache.org> wrote:

>> 
>> I think about writing a small paper about my works on that (still need to see if I can "sell" it to my professor :) ). I hope everyone who contributed to the discussion here is ok with that, I don't want to "steal" ideas.
>> 
> 
> Well, you can always be the primary and add others.  :-)

Re: RowSimilarityJob improvements

Posted by Grant Ingersoll <gs...@apache.org>.

Awesome!  Some thoughts inline.

On Aug 13, 2011, at 5:31 AM, Sebastian Schelter wrote:

> Hi there,
> 
> I invested a lot of work in improving RowSimilarityJob recently and I'm writing to get some feedback on my intermediate results. I also dug out an old mailinglist discussion about the matrix form of itembased recommendation that helped me a lot, I recommend reading this to anyone who wants to dive into the details here:
> 
> http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html
> 
> I managed to rework the algorithm as it was proposed in the mailinglist discussion before. The main goal was to drastically reduce the amount of data that needs to be shipped over the network in the cooccurrence computation phase. This is accomplished by making heavy use of a combiner for all supported similarity measures, by not attaching the "norms" of the rows to the cooccurrence pairs anymore and by emitting cooccurrence "stripes" instead of cooccurrence pairs.
> 
> A similarity measure must now be expressed by four simple functions that are invoked in different stages of the algorithm
> 
> normalize(row) (invoked once at the beginning, can transform the input row vectors)
> 
> norm(row) (invoked once at the beginning, maps an input vector a double)
> 
> aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB) (invoked for each pair of cooccurring elements of two input vectors, can map them to a double, these doubles are summed then)
> 
> similarity(summedAggregations,normOfRowA,normOfRowB,numColumns)
> (invoked at the end with the results of the other functions, should compute the final similarity)
> 
> A short list how our commonly used measures can be expressed this way:
> 
> cosine:
> 
> normalize(...) normalizes the vectors to unit length
> aggregate(...) multiplies valueA and valueB
> similarity(...) just needs to return the summedAggregations (which is the dot-product)
> 
> pearson:
> 
> same as cosine, but normalize(...) centers the vector elements first (without touching 0s as this would make the input matrix dense)
> 
> cooccurrence-count / overlap:
> 
> aggregate(...) returns 1 as we only need to count
> similarity(...) returns summedAggregations only
> 
> tanimoto/jaccard:
> 
> norm(...) returns the l0 norm (number of non-zero elements)
> aggregate(...) returns 1 as we only need to count
> similarity(...) returns summedAggregations / (normOfRowA + normOfRowB - summedDot);
> 
> loglikelihood:
> 
> same as tanimoto, but we compute LLR in similarity(...) from summedAggregations, normOfRowA,normOfRowB,numColumns
> 
> 
> cityblock/manhattan:
> 
> same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB - 2 * summedAggregations);
> 
> euclidean distance:
> 
> norm(...) returns the summed squares of the input vector
> aggregate(...) aggregate(...) multiplies valueA and valueB
> similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots + normB)
> 
> unfortunately we cant use the formula from o.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here which also incorporates the number of "overlapping" dimensions.
> 
> 
> I'm currently running tests on a small cluster using several rating datasets (18M music ratings from lastfm, 100M movie ratings from netflix, 250M music ratings from yahoo's KDD cup), results are pretty ok so far, but I'm still searching for ways to improve the algorithm.
> 
> One thing I'm currently looking into is how to sample the input. Ted has stated that you usually only need to look at a few hundred or thousand ratings per item as you don't learn anything new from the rest. Would it be sufficient to randomly sample the ratings of an item then? That's what I'm currently doing but I wonder whether there are more clever ways to do this.

I assume this will be an option right?  Row Similarity obviously goes beyond just recommenders and sampling shouldn't be required.


> 
> There is a sampling heuristic in the distributed recommender that caught my attention a long time ago, can anyone give any details on it? I basically understand what it's doing but not exactly why that's a good way to sample. Here's the original implementation:
> 
> http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188
> 
> Another issue I'm looking into is whether it would help to specify a similarity threshold. There is some work on "All Pairs Similarity Search" which is closely related to cooccurrence computation that seems to have some nice pruning strategies based on a threshold.

+1

> 
> For example imagine your similarity measure is cooccurrence count and you only want to have pairs with a count of 5, than you would not need to consider all pairs where one the vectors has less than 5 non zero entries.
> 
> Something like this could be a way to further improve the algorithm's running time, especially as the resulting similarity values seem to follow a heavily tailed distribution. It should be possible to find pruning heuristics for all measures, yet I don't have an idea yet how to find a good a-priori threshold.

I could also see pruning at the very end on the final distance score which would potentially save some IO.


> 
> I think about writing a small paper about my works on that (still need to see if I can "sell" it to my professor :) ). I hope everyone who contributed to the discussion here is ok with that, I don't want to "steal" ideas.
> 

Well, you can always be the primary and add others.  :-)


> 
> --sebastian
> 
> On 20.07.2011 17:40, Ted Dunning wrote:
>> Not if you are doing Euclidean distance.
>> 
>> On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll (JIRA)<ji...@apache.org>wrote:
>> 
>>> 
>>>    [
>>> https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]
>>> 
>>> Grant Ingersoll commented on MAHOUT-767:
>>> ----------------------------------------
>>> 
>>> Why do we need the norms?  Why can't we assume the user has already
>>> normalized?
>>> 
>>>> Improve RowSimilarityJob performance for count-based distance measures
>>>> ----------------------------------------------------------------------
>>>> 
>>>>                 Key: MAHOUT-767
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
>>>>             Project: Mahout
>>>>          Issue Type: Improvement
>>>>            Reporter: Grant Ingersoll
>>>>             Fix For: 0.6
>>>> 
>>>> 
>>>> (See
>>> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for background)
>>>> Currently, the RowSimilarityJob defers the calculation of the similarity
>>> metric until the reduce phase, while emitting many Cooccurrence objects.
>>>  For similarity metrics that are algebraic (
>>> http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
>>> be able to do much of the computation during the Mapper part of this phase
>>> and also take advantage of a Combiner.
>>>> We should use a marker interface to know whether a similarity metric is
>>> algebraic and then make use of an appropriate Mapper implementation,
>>> otherwise we can fall back on our existing implementation.
>>> 
>>> --
>>> This message is automatically generated by JIRA.
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>> 
>>> 
>>> 
>> 
> 

--------------------------
Grant Ingersoll