You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sebastian Schelter <ss...@apache.org> on 2011/08/13 11:31:45 UTC
RowSimilarityJob improvements
Hi there,
I invested a lot of work in improving RowSimilarityJob recently and I'm
writing to get some feedback on my intermediate results. I also dug out
an old mailinglist discussion about the matrix form of itembased
recommendation that helped me a lot, I recommend reading this to anyone
who wants to dive into the details here:
http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html
I managed to rework the algorithm as it was proposed in the mailinglist
discussion before. The main goal was to drastically reduce the amount of
data that needs to be shipped over the network in the cooccurrence
computation phase. This is accomplished by making heavy use of a
combiner for all supported similarity measures, by not attaching the
"norms" of the rows to the cooccurrence pairs anymore and by emitting
cooccurrence "stripes" instead of cooccurrence pairs.
A similarity measure must now be expressed by four simple functions that
are invoked in different stages of the algorithm
normalize(row) (invoked once at the beginning, can transform the input
row vectors)
norm(row) (invoked once at the beginning, maps an input vector a double)
aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB)
(invoked for each pair of cooccurring elements of two input vectors, can
map them to a double, these doubles are summed then)
similarity(summedAggregations,normOfRowA,normOfRowB,numColumns)
(invoked at the end with the results of the other functions, should
compute the final similarity)
A short list how our commonly used measures can be expressed this way:
cosine:
normalize(...) normalizes the vectors to unit length
aggregate(...) multiplies valueA and valueB
similarity(...) just needs to return the summedAggregations (which is
the dot-product)
pearson:
same as cosine, but normalize(...) centers the vector elements first
(without touching 0s as this would make the input matrix dense)
cooccurrence-count / overlap:
aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations only
tanimoto/jaccard:
norm(...) returns the l0 norm (number of non-zero elements)
aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations / (normOfRowA + normOfRowB -
summedDot);
loglikelihood:
same as tanimoto, but we compute LLR in similarity(...) from
summedAggregations, normOfRowA,normOfRowB,numColumns
cityblock/manhattan:
same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB
- 2 * summedAggregations);
euclidean distance:
norm(...) returns the summed squares of the input vector
aggregate(...) aggregate(...) multiplies valueA and valueB
similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots +
normB)
unfortunately we cant use the formula from
o.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here which
also incorporates the number of "overlapping" dimensions.
I'm currently running tests on a small cluster using several rating
datasets (18M music ratings from lastfm, 100M movie ratings from
netflix, 250M music ratings from yahoo's KDD cup), results are pretty ok
so far, but I'm still searching for ways to improve the algorithm.
One thing I'm currently looking into is how to sample the input. Ted has
stated that you usually only need to look at a few hundred or thousand
ratings per item as you don't learn anything new from the rest. Would it
be sufficient to randomly sample the ratings of an item then? That's
what I'm currently doing but I wonder whether there are more clever ways
to do this.
There is a sampling heuristic in the distributed recommender that caught
my attention a long time ago, can anyone give any details on it? I
basically understand what it's doing but not exactly why that's a good
way to sample. Here's the original implementation:
http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188
Another issue I'm looking into is whether it would help to specify a
similarity threshold. There is some work on "All Pairs Similarity
Search" which is closely related to cooccurrence computation that seems
to have some nice pruning strategies based on a threshold.
For example imagine your similarity measure is cooccurrence count and
you only want to have pairs with a count of 5, than you would not need
to consider all pairs where one the vectors has less than 5 non zero
entries.
Something like this could be a way to further improve the algorithm's
running time, especially as the resulting similarity values seem to
follow a heavily tailed distribution. It should be possible to find
pruning heuristics for all measures, yet I don't have an idea yet how to
find a good a-priori threshold.
I think about writing a small paper about my works on that (still need
to see if I can "sell" it to my professor :) ). I hope everyone who
contributed to the discussion here is ok with that, I don't want to
"steal" ideas.
--sebastian
On 20.07.2011 17:40, Ted Dunning wrote:
> Not if you are doing Euclidean distance.
>
> On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll (JIRA)<ji...@apache.org>wrote:
>
>>
>> [
>> https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]
>>
>> Grant Ingersoll commented on MAHOUT-767:
>> ----------------------------------------
>>
>> Why do we need the norms? Why can't we assume the user has already
>> normalized?
>>
>>> Improve RowSimilarityJob performance for count-based distance measures
>>> ----------------------------------------------------------------------
>>>
>>> Key: MAHOUT-767
>>> URL: https://issues.apache.org/jira/browse/MAHOUT-767
>>> Project: Mahout
>>> Issue Type: Improvement
>>> Reporter: Grant Ingersoll
>>> Fix For: 0.6
>>>
>>>
>>> (See
>> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for background)
>>> Currently, the RowSimilarityJob defers the calculation of the similarity
>> metric until the reduce phase, while emitting many Cooccurrence objects.
>> For similarity metrics that are algebraic (
>> http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
>> be able to do much of the computation during the Mapper part of this phase
>> and also take advantage of a Combiner.
>>> We should use a marker interface to know whether a similarity metric is
>> algebraic and then make use of an appropriate Mapper implementation,
>> otherwise we can fall back on our existing implementation.
>>
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>
>
Re: RowSimilarityJob improvements
Posted by Ted Dunning <te...@gmail.com>.
I have heard various arguments that favor retaining the most recent interactions or favor a fair sample or favor taking the earliest interactions. These can even be combined with biased samples. I haven't seen much difference between these approaches. I think at the lack of difference is largely due to the fact that the sampling falls most heavily on items that we care very little about in recommendations since they are the most popular items that are obviously getting plenty of traffic anyway.
Sent from my iPad
On Aug 13, 2011, at 2:31 AM, Sebastian Schelter <ss...@apache.org> wrote:
> One thing I'm currently looking into is how to sample the input. Ted has stated that you usually only need to look at a few hundred or thousand ratings per item as you don't learn anything new from the rest. Would it be sufficient to randomly sample the ratings of an item then? That's what I'm currently doing but I wonder whether there are more clever ways to do this.
Re: RowSimilarityJob improvements
Posted by Grant Ingersoll <gs...@apache.org>.
+1
On Aug 13, 2011, at 3:04 PM, Ted Dunning wrote:
> Frankly you are doing the work. You should take the credit. Ideas without elaboration and testing deserve a little attribution but authorship credit should go to the ones doing the real work.
>
> Sent from my iPad
>
> On Aug 13, 2011, at 3:25 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
>>>
>>> I think about writing a small paper about my works on that (still need to see if I can "sell" it to my professor :) ). I hope everyone who contributed to the discussion here is ok with that, I don't want to "steal" ideas.
>>>
>>
>> Well, you can always be the primary and add others. :-)
--------------------------
Grant Ingersoll
Re: RowSimilarityJob improvements
Posted by Ted Dunning <te...@gmail.com>.
Frankly you are doing the work. You should take the credit. Ideas without elaboration and testing deserve a little attribution but authorship credit should go to the ones doing the real work.
Sent from my iPad
On Aug 13, 2011, at 3:25 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> I think about writing a small paper about my works on that (still need to see if I can "sell" it to my professor :) ). I hope everyone who contributed to the discussion here is ok with that, I don't want to "steal" ideas.
>>
>
> Well, you can always be the primary and add others. :-)
Re: RowSimilarityJob improvements
Posted by Grant Ingersoll <gs...@apache.org>.
Awesome! Some thoughts inline.
On Aug 13, 2011, at 5:31 AM, Sebastian Schelter wrote:
> Hi there,
>
> I invested a lot of work in improving RowSimilarityJob recently and I'm writing to get some feedback on my intermediate results. I also dug out an old mailinglist discussion about the matrix form of itembased recommendation that helped me a lot, I recommend reading this to anyone who wants to dive into the details here:
>
> http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html
>
> I managed to rework the algorithm as it was proposed in the mailinglist discussion before. The main goal was to drastically reduce the amount of data that needs to be shipped over the network in the cooccurrence computation phase. This is accomplished by making heavy use of a combiner for all supported similarity measures, by not attaching the "norms" of the rows to the cooccurrence pairs anymore and by emitting cooccurrence "stripes" instead of cooccurrence pairs.
>
> A similarity measure must now be expressed by four simple functions that are invoked in different stages of the algorithm
>
> normalize(row) (invoked once at the beginning, can transform the input row vectors)
>
> norm(row) (invoked once at the beginning, maps an input vector a double)
>
> aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB) (invoked for each pair of cooccurring elements of two input vectors, can map them to a double, these doubles are summed then)
>
> similarity(summedAggregations,normOfRowA,normOfRowB,numColumns)
> (invoked at the end with the results of the other functions, should compute the final similarity)
>
> A short list how our commonly used measures can be expressed this way:
>
> cosine:
>
> normalize(...) normalizes the vectors to unit length
> aggregate(...) multiplies valueA and valueB
> similarity(...) just needs to return the summedAggregations (which is the dot-product)
>
> pearson:
>
> same as cosine, but normalize(...) centers the vector elements first (without touching 0s as this would make the input matrix dense)
>
> cooccurrence-count / overlap:
>
> aggregate(...) returns 1 as we only need to count
> similarity(...) returns summedAggregations only
>
> tanimoto/jaccard:
>
> norm(...) returns the l0 norm (number of non-zero elements)
> aggregate(...) returns 1 as we only need to count
> similarity(...) returns summedAggregations / (normOfRowA + normOfRowB - summedDot);
>
> loglikelihood:
>
> same as tanimoto, but we compute LLR in similarity(...) from summedAggregations, normOfRowA,normOfRowB,numColumns
>
>
> cityblock/manhattan:
>
> same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB - 2 * summedAggregations);
>
> euclidean distance:
>
> norm(...) returns the summed squares of the input vector
> aggregate(...) aggregate(...) multiplies valueA and valueB
> similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots + normB)
>
> unfortunately we cant use the formula from o.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here which also incorporates the number of "overlapping" dimensions.
>
>
> I'm currently running tests on a small cluster using several rating datasets (18M music ratings from lastfm, 100M movie ratings from netflix, 250M music ratings from yahoo's KDD cup), results are pretty ok so far, but I'm still searching for ways to improve the algorithm.
>
> One thing I'm currently looking into is how to sample the input. Ted has stated that you usually only need to look at a few hundred or thousand ratings per item as you don't learn anything new from the rest. Would it be sufficient to randomly sample the ratings of an item then? That's what I'm currently doing but I wonder whether there are more clever ways to do this.
I assume this will be an option right? Row Similarity obviously goes beyond just recommenders and sampling shouldn't be required.
>
> There is a sampling heuristic in the distributed recommender that caught my attention a long time ago, can anyone give any details on it? I basically understand what it's doing but not exactly why that's a good way to sample. Here's the original implementation:
>
> http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188
>
> Another issue I'm looking into is whether it would help to specify a similarity threshold. There is some work on "All Pairs Similarity Search" which is closely related to cooccurrence computation that seems to have some nice pruning strategies based on a threshold.
+1
>
> For example imagine your similarity measure is cooccurrence count and you only want to have pairs with a count of 5, than you would not need to consider all pairs where one the vectors has less than 5 non zero entries.
>
> Something like this could be a way to further improve the algorithm's running time, especially as the resulting similarity values seem to follow a heavily tailed distribution. It should be possible to find pruning heuristics for all measures, yet I don't have an idea yet how to find a good a-priori threshold.
I could also see pruning at the very end on the final distance score which would potentially save some IO.
>
> I think about writing a small paper about my works on that (still need to see if I can "sell" it to my professor :) ). I hope everyone who contributed to the discussion here is ok with that, I don't want to "steal" ideas.
>
Well, you can always be the primary and add others. :-)
>
> --sebastian
>
> On 20.07.2011 17:40, Ted Dunning wrote:
>> Not if you are doing Euclidean distance.
>>
>> On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll (JIRA)<ji...@apache.org>wrote:
>>
>>>
>>> [
>>> https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]
>>>
>>> Grant Ingersoll commented on MAHOUT-767:
>>> ----------------------------------------
>>>
>>> Why do we need the norms? Why can't we assume the user has already
>>> normalized?
>>>
>>>> Improve RowSimilarityJob performance for count-based distance measures
>>>> ----------------------------------------------------------------------
>>>>
>>>> Key: MAHOUT-767
>>>> URL: https://issues.apache.org/jira/browse/MAHOUT-767
>>>> Project: Mahout
>>>> Issue Type: Improvement
>>>> Reporter: Grant Ingersoll
>>>> Fix For: 0.6
>>>>
>>>>
>>>> (See
>>> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for background)
>>>> Currently, the RowSimilarityJob defers the calculation of the similarity
>>> metric until the reduce phase, while emitting many Cooccurrence objects.
>>> For similarity metrics that are algebraic (
>>> http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
>>> be able to do much of the computation during the Mapper part of this phase
>>> and also take advantage of a Combiner.
>>>> We should use a marker interface to know whether a similarity metric is
>>> algebraic and then make use of an appropriate Mapper implementation,
>>> otherwise we can fall back on our existing implementation.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>
>>>
>>
>
--------------------------
Grant Ingersoll