You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by yamo93 <ya...@gmail.com> on 2012/10/01 10:09:21 UTC

Re: Need to reduce execution time of RowSimilarityJob

I tried your suggestion.

I genereted a CSV file with (term, docId, distance) and I used the 
method mostSimilarItems with UncenteredCosineSimilarity.

But this seems to produce wrong results, i don't understand why ?

Any idea ?

On 09/20/2012 11:21 AM, Sean Owen wrote:
> Yes, what he means is that item-item similarity is just a subset of
> what a recommender does. You can use an in memory rec based on
> item-item similarity and just ask it for those similarities.
>
> On Thu, Sep 20, 2012 at 10:19 AM, yamo93 <ya...@gmail.com> wrote:
>> Exactly, but i need to improve perf and i thought that an in-memory impl
>> will be a solution (as mentionned in my first post).
>>
>> So Seb suggested to use recommender for that but i'm afraid he sent me on a
>> wrong way ...
>>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

In MIA, it seems impossible to setPreferenceInferrer for ItemSimilarity, 
isn't it ?

On 10/01/2012 12:37 PM, Sean Owen wrote:
> Yes, this is one of the weaknesses of this particular flavor of this
> particular similarity metric. The more sparse, the worse the problem
> is in general. There are some band-aid solutions like applying some
> kind of weight against similarities based on small intersection size.
> Or you can pretend that missing values are 0 (PreferenceInferrer),
> which can introduce its own problems, or perhaps some mean value.
>
> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>> Thanks for replying.
>>
>> So, documents with only one word in common have more chance to be similar
>> than documents with more words in common, right ?
>>
>>
>>
>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>> Similar items, right? You should look at the vectors that have 1.0
>>> similarity and see if they are in fact collinear. This is still by far
>>> the most likely explanation. Remember that the vector similarity is
>>> computed over elements that exist in both vectors only. They just have
>>> to have 2 identical values for this to happen.
>>>
>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>> It sounds like a bug somewhere.
>>>>
>>>>
>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>> angle between them). It's possible there are several of these, and so
>>>>> their 1.0 similarities dominate the result.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>> I saw something strange : all recommended items, returned by
>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>> Is it normal ?
>>

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

This is not true.


On 01.10.2012 21:52, bangbig wrote:
> I think it's better to understand how the RowSimilarityJob gets the result.
> For two items, 
> itemA, 0, 0,   a1, a2, a3, 0
> itemB, 0, b1, b2, b3, 0  , 0
> when computing, it just uses the blue parts of the vectors.
> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)  * sqrt(b2*b2 + b3*b3))
> 1) if itemA and itemB have just one common word, the result is 1;
> 2) if the values of the vectors are almost the same, the value would also be nearly 1;
> and for the two cases above, I think you can consider to use association rules to consider the problem.
> 
> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>> It seems that RowSimilarityJob does not have the same weakness, but i 
>> also use CosineSimilarity. Why ?
>>
>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>> Yes, this is one of the weaknesses of this particular flavor of this
>>> particular similarity metric. The more sparse, the worse the problem
>>> is in general. There are some band-aid solutions like applying some
>>> kind of weight against similarities based on small intersection size.
>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>> which can introduce its own problems, or perhaps some mean value.
>>>
>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>> Thanks for replying.
>>>>
>>>> So, documents with only one word in common have more chance to be similar
>>>> than documents with more words in common, right ?
>>>>
>>>>
>>>>
>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>> similarity and see if they are in fact collinear. This is still by far
>>>>> the most likely explanation. Remember that the vector similarity is
>>>>> computed over elements that exist in both vectors only. They just have
>>>>> to have 2 identical values for this to happen.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>> It sounds like a bug somewhere.
>>>>>>
>>>>>>
>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>> angle between them). It's possible there are several of these, and so
>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>> Is it normal ?
>>>>
>>
>

Re:Re: Need to reduce execution time of RowSimilarityJob

Posted by bangbig <li...@163.com>.

I think I will try to implement a version this night!
I have wrote one package that works directly on Hadoop, dealing with data extracted from HIVE.

At 2012-10-03 03:01:32,yamo93 <ya...@gmail.com> wrote:
>You'll find in attachment a class that implements cosine distance as in 
>hadoop. I've just implemented the core method : itemSimilarity.
>
>On 10/02/2012 02:59 PM, yamo93 wrote:
>> Ok, i'll try this evening.
>>
>> On 10/02/2012 02:39 PM, Sebastian Schelter wrote:
>>> Would you like to create a patch for this?
>>>
>>> On 02.10.2012 14:36, yamo93 wrote:
>>>> +1 for the implementation over all entries.
>>>>
>>>> On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>>>>> I don't see why documents with only one word in common should have a
>>>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked 
>>>>> if you
>>>>> specify a threshold for the similarity.
>>>>>
>>>>> UncenteredCosineSimilarity works on matching entries only, which is
>>>>> problematic for documents, as empty entries have a meaning (0 term
>>>>> occurrences) as opposed to collaborative filtering data.
>>>>>
>>>>> Maybe we should remove UncenteredCosine andd create another similarity
>>>>> implementation that computes the cosine correctly over all entries.
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 02.10.2012 10:08, yamo93 wrote:
>>>>>> Hello Seb,
>>>>>>
>>>>>> In my comprehension, the algorithm is the same (except the 
>>>>>> normalization
>>>>>> part) as UncenteredCosine (with the drawback that vectors with 
>>>>>> only one
>>>>>> word in common have a distance of 1.0)... but the result are quite
>>>>>> different (is this just an effect of the consider() method which 
>>>>>> remove
>>>>>> irrelevant values ?) ...
>>>>>>
>>>>>> I looked at the code but there is quite nothing in
>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity, 
>>>>>>
>>>>>>
>>>>>> the code seems to be in SimilarityReducer which is not so simple to
>>>>>> understand ...
>>>>>>
>>>>>> Thanks for helping,
>>>>>>
>>>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>>>>> similarity between the whole vectors.
>>>>>>>
>>>>>>> see
>>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> for details
>>>>>>>
>>>>>>> At first both vectors are scaled to unit length in normalize() and
>>>>>>> after
>>>>>>> this their dot product in similarity() (which can be computed from
>>>>>>> elements that exist in both vectors) gives the cosine between those.
>>>>>>>
>>>>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>>>>> result.
>>>>>>>> For two items,
>>>>>>>> itemA, 0, 0, a1, a2, a3, 0
>>>>>>>> itemB, 0, b1, b2, b3, 0 , 0
>>>>>>>> when computing, it just uses the blue parts of the vectors.
>>>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>>>>> * sqrt(b2*b2 + b3*b3))
>>>>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>>>>> 2) if the values of the vectors are almost the same, the value 
>>>>>>>> would
>>>>>>>> also be nearly 1;
>>>>>>>> and for the two cases above, I think you can consider to use
>>>>>>>> association rules to consider the problem.
>>>>>>>>
>>>>>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>>>>>> It seems that RowSimilarityJob does not have the same weakness, 
>>>>>>>>> but i
>>>>>>>>> also use CosineSimilarity. Why ?
>>>>>>>>>
>>>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>>>>> Yes, this is one of the weaknesses of this particular flavor 
>>>>>>>>>> of this
>>>>>>>>>> particular similarity metric. The more sparse, the worse the 
>>>>>>>>>> problem
>>>>>>>>>> is in general. There are some band-aid solutions like applying 
>>>>>>>>>> some
>>>>>>>>>> kind of weight against similarities based on small intersection
>>>>>>>>>> size.
>>>>>>>>>> Or you can pretend that missing values are 0 
>>>>>>>>>> (PreferenceInferrer),
>>>>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>>> Thanks for replying.
>>>>>>>>>>>
>>>>>>>>>>> So, documents with only one word in common have more chance 
>>>>>>>>>>> to be
>>>>>>>>>>> similar
>>>>>>>>>>> than documents with more words in common, right ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>>>>> Similar items, right? You should look at the vectors that 
>>>>>>>>>>>> have 1.0
>>>>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>>>>> by far
>>>>>>>>>>>> the most likely explanation. Remember that the vector
>>>>>>>>>>>> similarity is
>>>>>>>>>>>> computed over elements that exist in both vectors only. They 
>>>>>>>>>>>> just
>>>>>>>>>>>> have
>>>>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> For each item, i have 10 recommended items with a value of 
>>>>>>>>>>>>> 1.0.
>>>>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum 
>>>>>>>>>>>>>> similarity and
>>>>>>>>>>>>>> occurs when two vector are just a scalar multiple of each
>>>>>>>>>>>>>> other (0
>>>>>>>>>>>>>> angle between them). It's possible there are several of 
>>>>>>>>>>>>>> these,
>>>>>>>>>>>>>> and so
>>>>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>>>>> Is it normal ?
>>
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

Please don't send patches to the mailing list. Here's a guide that tells
you how to contribute to Mahout, it involves issuing a jira ticket:

https://cwiki.apache.org/MAHOUT/how-to-contribute.html

Your patch uses Java7 while Mahout is based on Java 6. Furthermore, you
can't simply throw UnsupportedOperationExceptions for most methods.

--sebastian


On 02.10.2012 21:01, yamo93 wrote:
> You'll find in attachment a class that implements cosine distance as in
> hadoop. I've just implemented the core method : itemSimilarity.
> 
> On 10/02/2012 02:59 PM, yamo93 wrote:
>> Ok, i'll try this evening.
>>
>> On 10/02/2012 02:39 PM, Sebastian Schelter wrote:
>>> Would you like to create a patch for this?
>>>
>>> On 02.10.2012 14:36, yamo93 wrote:
>>>> +1 for the implementation over all entries.
>>>>
>>>> On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>>>>> I don't see why documents with only one word in common should have a
>>>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked
>>>>> if you
>>>>> specify a threshold for the similarity.
>>>>>
>>>>> UncenteredCosineSimilarity works on matching entries only, which is
>>>>> problematic for documents, as empty entries have a meaning (0 term
>>>>> occurrences) as opposed to collaborative filtering data.
>>>>>
>>>>> Maybe we should remove UncenteredCosine andd create another similarity
>>>>> implementation that computes the cosine correctly over all entries.
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 02.10.2012 10:08, yamo93 wrote:
>>>>>> Hello Seb,
>>>>>>
>>>>>> In my comprehension, the algorithm is the same (except the
>>>>>> normalization
>>>>>> part) as UncenteredCosine (with the drawback that vectors with
>>>>>> only one
>>>>>> word in common have a distance of 1.0)... but the result are quite
>>>>>> different (is this just an effect of the consider() method which
>>>>>> remove
>>>>>> irrelevant values ?) ...
>>>>>>
>>>>>> I looked at the code but there is quite nothing in
>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>>>>>>
>>>>>>
>>>>>> the code seems to be in SimilarityReducer which is not so simple to
>>>>>> understand ...
>>>>>>
>>>>>> Thanks for helping,
>>>>>>
>>>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>>>>> similarity between the whole vectors.
>>>>>>>
>>>>>>> see
>>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> for details
>>>>>>>
>>>>>>> At first both vectors are scaled to unit length in normalize() and
>>>>>>> after
>>>>>>> this their dot product in similarity() (which can be computed from
>>>>>>> elements that exist in both vectors) gives the cosine between those.
>>>>>>>
>>>>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>>>>> result.
>>>>>>>> For two items,
>>>>>>>> itemA, 0, 0, a1, a2, a3, 0
>>>>>>>> itemB, 0, b1, b2, b3, 0 , 0
>>>>>>>> when computing, it just uses the blue parts of the vectors.
>>>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>>>>> * sqrt(b2*b2 + b3*b3))
>>>>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>>>>> 2) if the values of the vectors are almost the same, the value
>>>>>>>> would
>>>>>>>> also be nearly 1;
>>>>>>>> and for the two cases above, I think you can consider to use
>>>>>>>> association rules to consider the problem.
>>>>>>>>
>>>>>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>>>>>> It seems that RowSimilarityJob does not have the same weakness,
>>>>>>>>> but i
>>>>>>>>> also use CosineSimilarity. Why ?
>>>>>>>>>
>>>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>>>>> Yes, this is one of the weaknesses of this particular flavor
>>>>>>>>>> of this
>>>>>>>>>> particular similarity metric. The more sparse, the worse the
>>>>>>>>>> problem
>>>>>>>>>> is in general. There are some band-aid solutions like applying
>>>>>>>>>> some
>>>>>>>>>> kind of weight against similarities based on small intersection
>>>>>>>>>> size.
>>>>>>>>>> Or you can pretend that missing values are 0
>>>>>>>>>> (PreferenceInferrer),
>>>>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>>> Thanks for replying.
>>>>>>>>>>>
>>>>>>>>>>> So, documents with only one word in common have more chance
>>>>>>>>>>> to be
>>>>>>>>>>> similar
>>>>>>>>>>> than documents with more words in common, right ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>>>>> Similar items, right? You should look at the vectors that
>>>>>>>>>>>> have 1.0
>>>>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>>>>> by far
>>>>>>>>>>>> the most likely explanation. Remember that the vector
>>>>>>>>>>>> similarity is
>>>>>>>>>>>> computed over elements that exist in both vectors only. They
>>>>>>>>>>>> just
>>>>>>>>>>>> have
>>>>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> For each item, i have 10 recommended items with a value of
>>>>>>>>>>>>> 1.0.
>>>>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum
>>>>>>>>>>>>>> similarity and
>>>>>>>>>>>>>> occurs when two vector are just a scalar multiple of each
>>>>>>>>>>>>>> other (0
>>>>>>>>>>>>>> angle between them). It's possible there are several of
>>>>>>>>>>>>>> these,
>>>>>>>>>>>>>> and so
>>>>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>>>>> Is it normal ?
>>
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

You'll find in attachment a class that implements cosine distance as in 
hadoop. I've just implemented the core method : itemSimilarity.

On 10/02/2012 02:59 PM, yamo93 wrote:
> Ok, i'll try this evening.
>
> On 10/02/2012 02:39 PM, Sebastian Schelter wrote:
>> Would you like to create a patch for this?
>>
>> On 02.10.2012 14:36, yamo93 wrote:
>>> +1 for the implementation over all entries.
>>>
>>> On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>>>> I don't see why documents with only one word in common should have a
>>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked 
>>>> if you
>>>> specify a threshold for the similarity.
>>>>
>>>> UncenteredCosineSimilarity works on matching entries only, which is
>>>> problematic for documents, as empty entries have a meaning (0 term
>>>> occurrences) as opposed to collaborative filtering data.
>>>>
>>>> Maybe we should remove UncenteredCosine andd create another similarity
>>>> implementation that computes the cosine correctly over all entries.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 02.10.2012 10:08, yamo93 wrote:
>>>>> Hello Seb,
>>>>>
>>>>> In my comprehension, the algorithm is the same (except the 
>>>>> normalization
>>>>> part) as UncenteredCosine (with the drawback that vectors with 
>>>>> only one
>>>>> word in common have a distance of 1.0)... but the result are quite
>>>>> different (is this just an effect of the consider() method which 
>>>>> remove
>>>>> irrelevant values ?) ...
>>>>>
>>>>> I looked at the code but there is quite nothing in
>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity, 
>>>>>
>>>>>
>>>>> the code seems to be in SimilarityReducer which is not so simple to
>>>>> understand ...
>>>>>
>>>>> Thanks for helping,
>>>>>
>>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>>>> similarity between the whole vectors.
>>>>>>
>>>>>> see
>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity 
>>>>>>
>>>>>>
>>>>>>
>>>>>> for details
>>>>>>
>>>>>> At first both vectors are scaled to unit length in normalize() and
>>>>>> after
>>>>>> this their dot product in similarity() (which can be computed from
>>>>>> elements that exist in both vectors) gives the cosine between those.
>>>>>>
>>>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>>>> result.
>>>>>>> For two items,
>>>>>>> itemA, 0, 0, a1, a2, a3, 0
>>>>>>> itemB, 0, b1, b2, b3, 0 , 0
>>>>>>> when computing, it just uses the blue parts of the vectors.
>>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>>>> * sqrt(b2*b2 + b3*b3))
>>>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>>>> 2) if the values of the vectors are almost the same, the value 
>>>>>>> would
>>>>>>> also be nearly 1;
>>>>>>> and for the two cases above, I think you can consider to use
>>>>>>> association rules to consider the problem.
>>>>>>>
>>>>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>>>>> It seems that RowSimilarityJob does not have the same weakness, 
>>>>>>>> but i
>>>>>>>> also use CosineSimilarity. Why ?
>>>>>>>>
>>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>>>> Yes, this is one of the weaknesses of this particular flavor 
>>>>>>>>> of this
>>>>>>>>> particular similarity metric. The more sparse, the worse the 
>>>>>>>>> problem
>>>>>>>>> is in general. There are some band-aid solutions like applying 
>>>>>>>>> some
>>>>>>>>> kind of weight against similarities based on small intersection
>>>>>>>>> size.
>>>>>>>>> Or you can pretend that missing values are 0 
>>>>>>>>> (PreferenceInferrer),
>>>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>> Thanks for replying.
>>>>>>>>>>
>>>>>>>>>> So, documents with only one word in common have more chance 
>>>>>>>>>> to be
>>>>>>>>>> similar
>>>>>>>>>> than documents with more words in common, right ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>>>> Similar items, right? You should look at the vectors that 
>>>>>>>>>>> have 1.0
>>>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>>>> by far
>>>>>>>>>>> the most likely explanation. Remember that the vector
>>>>>>>>>>> similarity is
>>>>>>>>>>> computed over elements that exist in both vectors only. They 
>>>>>>>>>>> just
>>>>>>>>>>> have
>>>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> For each item, i have 10 recommended items with a value of 
>>>>>>>>>>>> 1.0.
>>>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum 
>>>>>>>>>>>>> similarity and
>>>>>>>>>>>>> occurs when two vector are just a scalar multiple of each
>>>>>>>>>>>>> other (0
>>>>>>>>>>>>> angle between them). It's possible there are several of 
>>>>>>>>>>>>> these,
>>>>>>>>>>>>> and so
>>>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>>>> Is it normal ?
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

Ok, i'll try this evening.

On 10/02/2012 02:39 PM, Sebastian Schelter wrote:
> Would you like to create a patch for this?
>
> On 02.10.2012 14:36, yamo93 wrote:
>> +1 for the implementation over all entries.
>>
>> On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>>> I don't see why documents with only one word in common should have a
>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
>>> specify a threshold for the similarity.
>>>
>>> UncenteredCosineSimilarity works on matching entries only, which is
>>> problematic for documents, as empty entries have a meaning (0 term
>>> occurrences) as opposed to collaborative filtering data.
>>>
>>> Maybe we should remove UncenteredCosine andd create another similarity
>>> implementation that computes the cosine correctly over all entries.
>>>
>>> --sebastian
>>>
>>>
>>> On 02.10.2012 10:08, yamo93 wrote:
>>>> Hello Seb,
>>>>
>>>> In my comprehension, the algorithm is the same (except the normalization
>>>> part) as UncenteredCosine (with the drawback that vectors with only one
>>>> word in common have a distance of 1.0)... but the result are quite
>>>> different (is this just an effect of the consider() method which remove
>>>> irrelevant values ?) ...
>>>>
>>>> I looked at the code but there is quite nothing in
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>>>>
>>>> the code seems to be in SimilarityReducer which is not so simple to
>>>> understand ...
>>>>
>>>> Thanks for helping,
>>>>
>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>>> similarity between the whole vectors.
>>>>>
>>>>> see
>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>>>
>>>>>
>>>>> for details
>>>>>
>>>>> At first both vectors are scaled to unit length in normalize() and
>>>>> after
>>>>> this their dot product in similarity() (which can be computed from
>>>>> elements that exist in both vectors) gives the cosine between those.
>>>>>
>>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>>> result.
>>>>>> For two items,
>>>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>>>> when computing, it just uses the blue parts of the vectors.
>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>>> * sqrt(b2*b2 + b3*b3))
>>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>>> 2) if the values of the vectors are almost the same, the value would
>>>>>> also be nearly 1;
>>>>>> and for the two cases above, I think you can consider to use
>>>>>> association rules to consider the problem.
>>>>>>
>>>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>>>> also use CosineSimilarity. Why ?
>>>>>>>
>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>>>> kind of weight against similarities based on small intersection
>>>>>>>> size.
>>>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>> Thanks for replying.
>>>>>>>>>
>>>>>>>>> So, documents with only one word in common have more chance to be
>>>>>>>>> similar
>>>>>>>>> than documents with more words in common, right ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>>> by far
>>>>>>>>>> the most likely explanation. Remember that the vector
>>>>>>>>>> similarity is
>>>>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>>>>> have
>>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>>>>> occurs when two vector are just a scalar multiple of each
>>>>>>>>>>>> other (0
>>>>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>>>>> and so
>>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>>> Is it normal ?

Re:Re: Need to reduce execution time of RowSimilarityJob

Posted by bangbig <li...@163.com>.

Yes, you are right!






At 2012-10-02 04:25:09,"Sebastian Schelter" <ss...@apache.org> wrote:
>The cosine similarity as computed by RowSimilarityJob is the cosine
>similarity between the whole vectors.
>
>see
>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>for details
>
>At first both vectors are scaled to unit length in normalize() and after
>this their dot product in similarity() (which can be computed from
>elements that exist in both vectors) gives the cosine between those.
>
>On 01.10.2012 21:52, bangbig wrote:
>> I think it's better to understand how the RowSimilarityJob gets the result.
>> For two items, 
>> itemA, 0, 0,   a1, a2, a3, 0
>> itemB, 0, b1, b2, b3, 0  , 0
>> when computing, it just uses the blue parts of the vectors.
>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)  * sqrt(b2*b2 + b3*b3))
>> 1) if itemA and itemB have just one common word, the result is 1;
>> 2) if the values of the vectors are almost the same, the value would also be nearly 1;
>> and for the two cases above, I think you can consider to use association rules to consider the problem.
>> 
>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>> It seems that RowSimilarityJob does not have the same weakness, but i 
>>> also use CosineSimilarity. Why ?
>>>
>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>> particular similarity metric. The more sparse, the worse the problem
>>>> is in general. There are some band-aid solutions like applying some
>>>> kind of weight against similarities based on small intersection size.
>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>> which can introduce its own problems, or perhaps some mean value.
>>>>
>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>> Thanks for replying.
>>>>>
>>>>> So, documents with only one word in common have more chance to be similar
>>>>> than documents with more words in common, right ?
>>>>>
>>>>>
>>>>>
>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>> similarity and see if they are in fact collinear. This is still by far
>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>> computed over elements that exist in both vectors only. They just have
>>>>>> to have 2 identical values for this to happen.
>>>>>>
>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>> It sounds like a bug somewhere.
>>>>>>>
>>>>>>>
>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>> angle between them). It's possible there are several of these, and so
>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>> Is it normal ?
>>>>>
>>>
>> 
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

Would you like to create a patch for this?

On 02.10.2012 14:36, yamo93 wrote:
> +1 for the implementation over all entries.
> 
> On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>> I don't see why documents with only one word in common should have a
>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
>> specify a threshold for the similarity.
>>
>> UncenteredCosineSimilarity works on matching entries only, which is
>> problematic for documents, as empty entries have a meaning (0 term
>> occurrences) as opposed to collaborative filtering data.
>>
>> Maybe we should remove UncenteredCosine andd create another similarity
>> implementation that computes the cosine correctly over all entries.
>>
>> --sebastian
>>
>>
>> On 02.10.2012 10:08, yamo93 wrote:
>>> Hello Seb,
>>>
>>> In my comprehension, the algorithm is the same (except the normalization
>>> part) as UncenteredCosine (with the drawback that vectors with only one
>>> word in common have a distance of 1.0)... but the result are quite
>>> different (is this just an effect of the consider() method which remove
>>> irrelevant values ?) ...
>>>
>>> I looked at the code but there is quite nothing in
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>>>
>>> the code seems to be in SimilarityReducer which is not so simple to
>>> understand ...
>>>
>>> Thanks for helping,
>>>
>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>> similarity between the whole vectors.
>>>>
>>>> see
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>>
>>>>
>>>> for details
>>>>
>>>> At first both vectors are scaled to unit length in normalize() and
>>>> after
>>>> this their dot product in similarity() (which can be computed from
>>>> elements that exist in both vectors) gives the cosine between those.
>>>>
>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>> result.
>>>>> For two items,
>>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>>> when computing, it just uses the blue parts of the vectors.
>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>> * sqrt(b2*b2 + b3*b3))
>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>> 2) if the values of the vectors are almost the same, the value would
>>>>> also be nearly 1;
>>>>> and for the two cases above, I think you can consider to use
>>>>> association rules to consider the problem.
>>>>>
>>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>>> also use CosineSimilarity. Why ?
>>>>>>
>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>>> kind of weight against similarities based on small intersection
>>>>>>> size.
>>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>> Thanks for replying.
>>>>>>>>
>>>>>>>> So, documents with only one word in common have more chance to be
>>>>>>>> similar
>>>>>>>> than documents with more words in common, right ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>> by far
>>>>>>>>> the most likely explanation. Remember that the vector
>>>>>>>>> similarity is
>>>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>>>> have
>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>>>> occurs when two vector are just a scalar multiple of each
>>>>>>>>>>> other (0
>>>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>>>> and so
>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>> Is it normal ?
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

+1 for the implementation over all entries.

On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
> I don't see why documents with only one word in common should have a
> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
> specify a threshold for the similarity.
>
> UncenteredCosineSimilarity works on matching entries only, which is
> problematic for documents, as empty entries have a meaning (0 term
> occurrences) as opposed to collaborative filtering data.
>
> Maybe we should remove UncenteredCosine andd create another similarity
> implementation that computes the cosine correctly over all entries.
>
> --sebastian
>
>
> On 02.10.2012 10:08, yamo93 wrote:
>> Hello Seb,
>>
>> In my comprehension, the algorithm is the same (except the normalization
>> part) as UncenteredCosine (with the drawback that vectors with only one
>> word in common have a distance of 1.0)... but the result are quite
>> different (is this just an effect of the consider() method which remove
>> irrelevant values ?) ...
>>
>> I looked at the code but there is quite nothing in
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>> the code seems to be in SimilarityReducer which is not so simple to
>> understand ...
>>
>> Thanks for helping,
>>
>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>> similarity between the whole vectors.
>>>
>>> see
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>
>>> for details
>>>
>>> At first both vectors are scaled to unit length in normalize() and after
>>> this their dot product in similarity() (which can be computed from
>>> elements that exist in both vectors) gives the cosine between those.
>>>
>>> On 01.10.2012 21:52, bangbig wrote:
>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>> result.
>>>> For two items,
>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>> when computing, it just uses the blue parts of the vectors.
>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>> * sqrt(b2*b2 + b3*b3))
>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>> 2) if the values of the vectors are almost the same, the value would
>>>> also be nearly 1;
>>>> and for the two cases above, I think you can consider to use
>>>> association rules to consider the problem.
>>>>
>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>> also use CosineSimilarity. Why ?
>>>>>
>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>> kind of weight against similarities based on small intersection size.
>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>
>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>> Thanks for replying.
>>>>>>>
>>>>>>> So, documents with only one word in common have more chance to be
>>>>>>> similar
>>>>>>> than documents with more words in common, right ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>> by far
>>>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>>> have
>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>>> and so
>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>> Is it normal ?

Re:Re:Re: Need to reduce execution time of RowSimilarityJob

Posted by bangbig <li...@163.com>.

By the way,  note that distance is not the same thing as similarity. 

At 2012-10-02 19:43:58,bangbig <li...@163.com> wrote:
>Yes, you get it.
>I thought RowSimilarityJob was from taste when I write the previous email.
>
>At 2012-10-02 19:26:48,yamo93 <ya...@gmail.com> wrote:
>>Ok, i think i understood.
>>
>>Let's take an example with two vectors (1,1,1) and (0,1,0).
>>With UncenteredCosineSimilarity (as implemented in taste), the distance is 1
>>With Cosine (as implemented in RowSimilarityJob), the distance is 1/sqrt(3)
>>
>>Ok ?
>>
>>On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>>> I don't see why documents with only one word in common should have a
>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
>>> specify a threshold for the similarity.
>>>
>>> UncenteredCosineSimilarity works on matching entries only, which is
>>> problematic for documents, as empty entries have a meaning (0 term
>>> occurrences) as opposed to collaborative filtering data.
>>>
>>> Maybe we should remove UncenteredCosine andd create another similarity
>>> implementation that computes the cosine correctly over all entries.
>>>
>>> --sebastian
>>>
>>>
>>> On 02.10.2012 10:08, yamo93 wrote:
>>>> Hello Seb,
>>>>
>>>> In my comprehension, the algorithm is the same (except the normalization
>>>> part) as UncenteredCosine (with the drawback that vectors with only one
>>>> word in common have a distance of 1.0)... but the result are quite
>>>> different (is this just an effect of the consider() method which remove
>>>> irrelevant values ?) ...
>>>>
>>>> I looked at the code but there is quite nothing in
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>>>> the code seems to be in SimilarityReducer which is not so simple to
>>>> understand ...
>>>>
>>>> Thanks for helping,
>>>>
>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>>> similarity between the whole vectors.
>>>>>
>>>>> see
>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>>>
>>>>> for details
>>>>>
>>>>> At first both vectors are scaled to unit length in normalize() and after
>>>>> this their dot product in similarity() (which can be computed from
>>>>> elements that exist in both vectors) gives the cosine between those.
>>>>>
>>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>>> result.
>>>>>> For two items,
>>>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>>>> when computing, it just uses the blue parts of the vectors.
>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>>> * sqrt(b2*b2 + b3*b3))
>>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>>> 2) if the values of the vectors are almost the same, the value would
>>>>>> also be nearly 1;
>>>>>> and for the two cases above, I think you can consider to use
>>>>>> association rules to consider the problem.
>>>>>>
>>>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>>>> also use CosineSimilarity. Why ?
>>>>>>>
>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>>>> kind of weight against similarities based on small intersection size.
>>>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>> Thanks for replying.
>>>>>>>>>
>>>>>>>>> So, documents with only one word in common have more chance to be
>>>>>>>>> similar
>>>>>>>>> than documents with more words in common, right ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>>> by far
>>>>>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>>>>> have
>>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>>>>> and so
>>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>>> Is it normal ?
>>

Re:Re: Need to reduce execution time of RowSimilarityJob

Posted by bangbig <li...@163.com>.

Yes, you get it.
I thought RowSimilarityJob was from taste when I write the previous email.

At 2012-10-02 19:26:48,yamo93 <ya...@gmail.com> wrote:
>Ok, i think i understood.
>
>Let's take an example with two vectors (1,1,1) and (0,1,0).
>With UncenteredCosineSimilarity (as implemented in taste), the distance is 1
>With Cosine (as implemented in RowSimilarityJob), the distance is 1/sqrt(3)
>
>Ok ?
>
>On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>> I don't see why documents with only one word in common should have a
>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
>> specify a threshold for the similarity.
>>
>> UncenteredCosineSimilarity works on matching entries only, which is
>> problematic for documents, as empty entries have a meaning (0 term
>> occurrences) as opposed to collaborative filtering data.
>>
>> Maybe we should remove UncenteredCosine andd create another similarity
>> implementation that computes the cosine correctly over all entries.
>>
>> --sebastian
>>
>>
>> On 02.10.2012 10:08, yamo93 wrote:
>>> Hello Seb,
>>>
>>> In my comprehension, the algorithm is the same (except the normalization
>>> part) as UncenteredCosine (with the drawback that vectors with only one
>>> word in common have a distance of 1.0)... but the result are quite
>>> different (is this just an effect of the consider() method which remove
>>> irrelevant values ?) ...
>>>
>>> I looked at the code but there is quite nothing in
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>>> the code seems to be in SimilarityReducer which is not so simple to
>>> understand ...
>>>
>>> Thanks for helping,
>>>
>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>> similarity between the whole vectors.
>>>>
>>>> see
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>>
>>>> for details
>>>>
>>>> At first both vectors are scaled to unit length in normalize() and after
>>>> this their dot product in similarity() (which can be computed from
>>>> elements that exist in both vectors) gives the cosine between those.
>>>>
>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>> result.
>>>>> For two items,
>>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>>> when computing, it just uses the blue parts of the vectors.
>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>> * sqrt(b2*b2 + b3*b3))
>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>> 2) if the values of the vectors are almost the same, the value would
>>>>> also be nearly 1;
>>>>> and for the two cases above, I think you can consider to use
>>>>> association rules to consider the problem.
>>>>>
>>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>>> also use CosineSimilarity. Why ?
>>>>>>
>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>>> kind of weight against similarities based on small intersection size.
>>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>> Thanks for replying.
>>>>>>>>
>>>>>>>> So, documents with only one word in common have more chance to be
>>>>>>>> similar
>>>>>>>> than documents with more words in common, right ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>> by far
>>>>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>>>> have
>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>>>> and so
>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>> Is it normal ?
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

Ok, i think i understood.

Let's take an example with two vectors (1,1,1) and (0,1,0).
With UncenteredCosineSimilarity (as implemented in taste), the distance is 1
With Cosine (as implemented in RowSimilarityJob), the distance is 1/sqrt(3)

Ok ?

On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
> I don't see why documents with only one word in common should have a
> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
> specify a threshold for the similarity.
>
> UncenteredCosineSimilarity works on matching entries only, which is
> problematic for documents, as empty entries have a meaning (0 term
> occurrences) as opposed to collaborative filtering data.
>
> Maybe we should remove UncenteredCosine andd create another similarity
> implementation that computes the cosine correctly over all entries.
>
> --sebastian
>
>
> On 02.10.2012 10:08, yamo93 wrote:
>> Hello Seb,
>>
>> In my comprehension, the algorithm is the same (except the normalization
>> part) as UncenteredCosine (with the drawback that vectors with only one
>> word in common have a distance of 1.0)... but the result are quite
>> different (is this just an effect of the consider() method which remove
>> irrelevant values ?) ...
>>
>> I looked at the code but there is quite nothing in
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>> the code seems to be in SimilarityReducer which is not so simple to
>> understand ...
>>
>> Thanks for helping,
>>
>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>> similarity between the whole vectors.
>>>
>>> see
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>
>>> for details
>>>
>>> At first both vectors are scaled to unit length in normalize() and after
>>> this their dot product in similarity() (which can be computed from
>>> elements that exist in both vectors) gives the cosine between those.
>>>
>>> On 01.10.2012 21:52, bangbig wrote:
>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>> result.
>>>> For two items,
>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>> when computing, it just uses the blue parts of the vectors.
>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>> * sqrt(b2*b2 + b3*b3))
>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>> 2) if the values of the vectors are almost the same, the value would
>>>> also be nearly 1;
>>>> and for the two cases above, I think you can consider to use
>>>> association rules to consider the problem.
>>>>
>>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>> also use CosineSimilarity. Why ?
>>>>>
>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>> kind of weight against similarities based on small intersection size.
>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>
>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>> Thanks for replying.
>>>>>>>
>>>>>>> So, documents with only one word in common have more chance to be
>>>>>>> similar
>>>>>>> than documents with more words in common, right ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>> by far
>>>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>>> have
>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>>> and so
>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>> Is it normal ?

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

I don't see why documents with only one word in common should have a
similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
specify a threshold for the similarity.

UncenteredCosineSimilarity works on matching entries only, which is
problematic for documents, as empty entries have a meaning (0 term
occurrences) as opposed to collaborative filtering data.

Maybe we should remove UncenteredCosine andd create another similarity
implementation that computes the cosine correctly over all entries.

--sebastian


On 02.10.2012 10:08, yamo93 wrote:
> Hello Seb,
> 
> In my comprehension, the algorithm is the same (except the normalization
> part) as UncenteredCosine (with the drawback that vectors with only one
> word in common have a distance of 1.0)... but the result are quite
> different (is this just an effect of the consider() method which remove
> irrelevant values ?) ...
> 
> I looked at the code but there is quite nothing in
> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
> the code seems to be in SimilarityReducer which is not so simple to
> understand ...
> 
> Thanks for helping,
> 
> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>> The cosine similarity as computed by RowSimilarityJob is the cosine
>> similarity between the whole vectors.
>>
>> see
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>
>> for details
>>
>> At first both vectors are scaled to unit length in normalize() and after
>> this their dot product in similarity() (which can be computed from
>> elements that exist in both vectors) gives the cosine between those.
>>
>> On 01.10.2012 21:52, bangbig wrote:
>>> I think it's better to understand how the RowSimilarityJob gets the
>>> result.
>>> For two items,
>>> itemA, 0, 0,   a1, a2, a3, 0
>>> itemB, 0, b1, b2, b3, 0  , 0
>>> when computing, it just uses the blue parts of the vectors.
>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2) 
>>> * sqrt(b2*b2 + b3*b3))
>>> 1) if itemA and itemB have just one common word, the result is 1;
>>> 2) if the values of the vectors are almost the same, the value would
>>> also be nearly 1;
>>> and for the two cases above, I think you can consider to use
>>> association rules to consider the problem.
>>>
>>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>> also use CosineSimilarity. Why ?
>>>>
>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>> is in general. There are some band-aid solutions like applying some
>>>>> kind of weight against similarities based on small intersection size.
>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>> Thanks for replying.
>>>>>>
>>>>>> So, documents with only one word in common have more chance to be
>>>>>> similar
>>>>>> than documents with more words in common, right ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>> by far
>>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>> have
>>>>>>> to have 2 identical values for this to happen.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>> and so
>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>> Is it normal ?
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

Hello Seb,

In my comprehension, the algorithm is the same (except the normalization 
part) as UncenteredCosine (with the drawback that vectors with only one 
word in common have a distance of 1.0)... but the result are quite 
different (is this just an effect of the consider() method which remove 
irrelevant values ?) ...

I looked at the code but there is quite nothing in 
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity, 
the code seems to be in SimilarityReducer which is not so simple to 
understand ...

Thanks for helping,

On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
> The cosine similarity as computed by RowSimilarityJob is the cosine
> similarity between the whole vectors.
>
> see
> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
> for details
>
> At first both vectors are scaled to unit length in normalize() and after
> this their dot product in similarity() (which can be computed from
> elements that exist in both vectors) gives the cosine between those.
>
> On 01.10.2012 21:52, bangbig wrote:
>> I think it's better to understand how the RowSimilarityJob gets the result.
>> For two items,
>> itemA, 0, 0,   a1, a2, a3, 0
>> itemB, 0, b1, b2, b3, 0  , 0
>> when computing, it just uses the blue parts of the vectors.
>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)  * sqrt(b2*b2 + b3*b3))
>> 1) if itemA and itemB have just one common word, the result is 1;
>> 2) if the values of the vectors are almost the same, the value would also be nearly 1;
>> and for the two cases above, I think you can consider to use association rules to consider the problem.
>>
>> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>> also use CosineSimilarity. Why ?
>>>
>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>> particular similarity metric. The more sparse, the worse the problem
>>>> is in general. There are some band-aid solutions like applying some
>>>> kind of weight against similarities based on small intersection size.
>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>> which can introduce its own problems, or perhaps some mean value.
>>>>
>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>>> Thanks for replying.
>>>>>
>>>>> So, documents with only one word in common have more chance to be similar
>>>>> than documents with more words in common, right ?
>>>>>
>>>>>
>>>>>
>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>> similarity and see if they are in fact collinear. This is still by far
>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>> computed over elements that exist in both vectors only. They just have
>>>>>> to have 2 identical values for this to happen.
>>>>>>
>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>> It sounds like a bug somewhere.
>>>>>>>
>>>>>>>
>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>> angle between them). It's possible there are several of these, and so
>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>> Is it normal ?

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

The cosine similarity as computed by RowSimilarityJob is the cosine
similarity between the whole vectors.

see
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
for details

At first both vectors are scaled to unit length in normalize() and after
this their dot product in similarity() (which can be computed from
elements that exist in both vectors) gives the cosine between those.

On 01.10.2012 21:52, bangbig wrote:
> I think it's better to understand how the RowSimilarityJob gets the result.
> For two items, 
> itemA, 0, 0,   a1, a2, a3, 0
> itemB, 0, b1, b2, b3, 0  , 0
> when computing, it just uses the blue parts of the vectors.
> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)  * sqrt(b2*b2 + b3*b3))
> 1) if itemA and itemB have just one common word, the result is 1;
> 2) if the values of the vectors are almost the same, the value would also be nearly 1;
> and for the two cases above, I think you can consider to use association rules to consider the problem.
> 
> At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>> It seems that RowSimilarityJob does not have the same weakness, but i 
>> also use CosineSimilarity. Why ?
>>
>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>> Yes, this is one of the weaknesses of this particular flavor of this
>>> particular similarity metric. The more sparse, the worse the problem
>>> is in general. There are some band-aid solutions like applying some
>>> kind of weight against similarities based on small intersection size.
>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>> which can introduce its own problems, or perhaps some mean value.
>>>
>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>>> Thanks for replying.
>>>>
>>>> So, documents with only one word in common have more chance to be similar
>>>> than documents with more words in common, right ?
>>>>
>>>>
>>>>
>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>> similarity and see if they are in fact collinear. This is still by far
>>>>> the most likely explanation. Remember that the vector similarity is
>>>>> computed over elements that exist in both vectors only. They just have
>>>>> to have 2 identical values for this to happen.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>> It sounds like a bug somewhere.
>>>>>>
>>>>>>
>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>> angle between them). It's possible there are several of these, and so
>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>> Is it normal ?
>>>>
>>
>

Re:Re: Need to reduce execution time of RowSimilarityJob

Posted by bangbig <li...@163.com>.

I think it's better to understand how the RowSimilarityJob gets the result.
For two items, 
itemA, 0, 0,   a1, a2, a3, 0
itemB, 0, b1, b2, b3, 0  , 0
when computing, it just uses the blue parts of the vectors.
the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)  * sqrt(b2*b2 + b3*b3))
1) if itemA and itemB have just one common word, the result is 1;
2) if the values of the vectors are almost the same, the value would also be nearly 1;
and for the two cases above, I think you can consider to use association rules to consider the problem.

At 2012-10-01 20:53:16,yamo93 <ya...@gmail.com> wrote:
>It seems that RowSimilarityJob does not have the same weakness, but i 
>also use CosineSimilarity. Why ?
>
>On 10/01/2012 12:37 PM, Sean Owen wrote:
>> Yes, this is one of the weaknesses of this particular flavor of this
>> particular similarity metric. The more sparse, the worse the problem
>> is in general. There are some band-aid solutions like applying some
>> kind of weight against similarities based on small intersection size.
>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>> which can introduce its own problems, or perhaps some mean value.
>>
>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>>> Thanks for replying.
>>>
>>> So, documents with only one word in common have more chance to be similar
>>> than documents with more words in common, right ?
>>>
>>>
>>>
>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>> Similar items, right? You should look at the vectors that have 1.0
>>>> similarity and see if they are in fact collinear. This is still by far
>>>> the most likely explanation. Remember that the vector similarity is
>>>> computed over elements that exist in both vectors only. They just have
>>>> to have 2 identical values for this to happen.
>>>>
>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>> It sounds like a bug somewhere.
>>>>>
>>>>>
>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>> angle between them). It's possible there are several of these, and so
>>>>>> their 1.0 similarities dominate the result.
>>>>>>
>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>> Is it normal ?
>>>
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

It seems that RowSimilarityJob does not have the same weakness, but i 
also use CosineSimilarity. Why ?

On 10/01/2012 12:37 PM, Sean Owen wrote:
> Yes, this is one of the weaknesses of this particular flavor of this
> particular similarity metric. The more sparse, the worse the problem
> is in general. There are some band-aid solutions like applying some
> kind of weight against similarities based on small intersection size.
> Or you can pretend that missing values are 0 (PreferenceInferrer),
> which can introduce its own problems, or perhaps some mean value.
>
> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>> Thanks for replying.
>>
>> So, documents with only one word in common have more chance to be similar
>> than documents with more words in common, right ?
>>
>>
>>
>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>> Similar items, right? You should look at the vectors that have 1.0
>>> similarity and see if they are in fact collinear. This is still by far
>>> the most likely explanation. Remember that the vector similarity is
>>> computed over elements that exist in both vectors only. They just have
>>> to have 2 identical values for this to happen.
>>>
>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>> It sounds like a bug somewhere.
>>>>
>>>>
>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>> angle between them). It's possible there are several of these, and so
>>>>> their 1.0 similarities dominate the result.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>> I saw something strange : all recommended items, returned by
>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>> Is it normal ?
>>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

Ok. You're right : i have some spareness on my data.

Do you advise another similarity algorithm for text-based ?

On 10/01/2012 12:37 PM, Sean Owen wrote:
> Yes, this is one of the weaknesses of this particular flavor of this
> particular similarity metric. The more sparse, the worse the problem
> is in general. There are some band-aid solutions like applying some
> kind of weight against similarities based on small intersection size.
> Or you can pretend that missing values are 0 (PreferenceInferrer),
> which can introduce its own problems, or perhaps some mean value.
>
> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
>> Thanks for replying.
>>
>> So, documents with only one word in common have more chance to be similar
>> than documents with more words in common, right ?
>>
>>
>>
>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>> Similar items, right? You should look at the vectors that have 1.0
>>> similarity and see if they are in fact collinear. This is still by far
>>> the most likely explanation. Remember that the vector similarity is
>>> computed over elements that exist in both vectors only. They just have
>>> to have 2 identical values for this to happen.
>>>
>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>> It sounds like a bug somewhere.
>>>>
>>>>
>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>> angle between them). It's possible there are several of these, and so
>>>>> their 1.0 similarities dominate the result.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>> I saw something strange : all recommended items, returned by
>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>> Is it normal ?
>>

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sean Owen <sr...@gmail.com>.

Yes, this is one of the weaknesses of this particular flavor of this
particular similarity metric. The more sparse, the worse the problem
is in general. There are some band-aid solutions like applying some
kind of weight against similarities based on small intersection size.
Or you can pretend that missing values are 0 (PreferenceInferrer),
which can introduce its own problems, or perhaps some mean value.

On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <ya...@gmail.com> wrote:
> Thanks for replying.
>
> So, documents with only one word in common have more chance to be similar
> than documents with more words in common, right ?
>
>
>
> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>
>> Similar items, right? You should look at the vectors that have 1.0
>> similarity and see if they are in fact collinear. This is still by far
>> the most likely explanation. Remember that the vector similarity is
>> computed over elements that exist in both vectors only. They just have
>> to have 2 identical values for this to happen.
>>
>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>>>
>>> For each item, i have 10 recommended items with a value of 1.0.
>>> It sounds like a bug somewhere.
>>>
>>>
>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>
>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>> occurs when two vector are just a scalar multiple of each other (0
>>>> angle between them). It's possible there are several of these, and so
>>>> their 1.0 similarities dominate the result.
>>>>
>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>>>
>>>>> I saw something strange : all recommended items, returned by
>>>>> mostSimilarItems(), have a value of 1.0.
>>>>> Is it normal ?
>
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

Thanks for replying.

So, documents with only one word in common have more chance to be 
similar than documents with more words in common, right ?


On 10/01/2012 11:28 AM, Sean Owen wrote:
> Similar items, right? You should look at the vectors that have 1.0
> similarity and see if they are in fact collinear. This is still by far
> the most likely explanation. Remember that the vector similarity is
> computed over elements that exist in both vectors only. They just have
> to have 2 identical values for this to happen.
>
> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
>> For each item, i have 10 recommended items with a value of 1.0.
>> It sounds like a bug somewhere.
>>
>>
>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>> It's possible this is correct. 1.0 is the maximum similarity and
>>> occurs when two vector are just a scalar multiple of each other (0
>>> angle between them). It's possible there are several of these, and so
>>> their 1.0 similarities dominate the result.
>>>
>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>> I saw something strange : all recommended items, returned by
>>>> mostSimilarItems(), have a value of 1.0.
>>>> Is it normal ?

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sean Owen <sr...@gmail.com>.

Similar items, right? You should look at the vectors that have 1.0
similarity and see if they are in fact collinear. This is still by far
the most likely explanation. Remember that the vector similarity is
computed over elements that exist in both vectors only. They just have
to have 2 identical values for this to happen.

On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <ya...@gmail.com> wrote:
> For each item, i have 10 recommended items with a value of 1.0.
> It sounds like a bug somewhere.
>
>
> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>
>> It's possible this is correct. 1.0 is the maximum similarity and
>> occurs when two vector are just a scalar multiple of each other (0
>> angle between them). It's possible there are several of these, and so
>> their 1.0 similarities dominate the result.
>>
>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>>>
>>> I saw something strange : all recommended items, returned by
>>> mostSimilarItems(), have a value of 1.0.
>>> Is it normal ?

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

For each item, i have 10 recommended items with a value of 1.0.
It sounds like a bug somewhere.

On 10/01/2012 11:06 AM, Sean Owen wrote:
> It's possible this is correct. 1.0 is the maximum similarity and
> occurs when two vector are just a scalar multiple of each other (0
> angle between them). It's possible there are several of these, and so
> their 1.0 similarities dominate the result.
>
> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
>> I saw something strange : all recommended items, returned by
>> mostSimilarItems(), have a value of 1.0.
>> Is it normal ?
>>
>>
>> On 10/01/2012 10:39 AM, Sean Owen wrote:
>>> This is probably because the Hadoop job does some sampling and pruning
>>> whereas the non-Hadoop generally doesn't.
>>

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sean Owen <sr...@gmail.com>.

It's possible this is correct. 1.0 is the maximum similarity and
occurs when two vector are just a scalar multiple of each other (0
angle between them). It's possible there are several of these, and so
their 1.0 similarities dominate the result.

On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <ya...@gmail.com> wrote:
> I saw something strange : all recommended items, returned by
> mostSimilarItems(), have a value of 1.0.
> Is it normal ?
>
>
> On 10/01/2012 10:39 AM, Sean Owen wrote:
>>
>> This is probably because the Hadoop job does some sampling and pruning
>> whereas the non-Hadoop generally doesn't.
>
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

I saw something strange : all recommended items, returned by 
mostSimilarItems(), have a value of 1.0.
Is it normal ?

On 10/01/2012 10:39 AM, Sean Owen wrote:
> This is probably because the Hadoop job does some sampling and pruning
> whereas the non-Hadoop generally doesn't.

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sean Owen <sr...@gmail.com>.

This is probably because the Hadoop job does some sampling and pruning
whereas the non-Hadoop generally doesn't.

On Mon, Oct 1, 2012 at 9:38 AM, yamo93 <ya...@gmail.com> wrote:
> Because the results are not the same as with RowSimilarityJob and because
> similar documents have few words in common.
>
>
> On 10/01/2012 10:24 AM, Sean Owen wrote:
>>
>> There's not really any information here. Why do you think the result is
>> wrong?
>>
>> On Mon, Oct 1, 2012 at 9:09 AM, yamo93 <ya...@gmail.com> wrote:
>>>
>>> I tried your suggestion.
>>>
>>> I genereted a CSV file with (term, docId, distance) and I used the method
>>> mostSimilarItems with UncenteredCosineSimilarity.
>>>
>>> But this seems to produce wrong results, i don't understand why ?
>>>
>>> Any idea ?
>
>

Re: Need to reduce execution time of RowSimilarityJob

Posted by yamo93 <ya...@gmail.com>.

Because the results are not the same as with RowSimilarityJob and 
because similar documents have few words in common.

On 10/01/2012 10:24 AM, Sean Owen wrote:
> There's not really any information here. Why do you think the result is wrong?
>
> On Mon, Oct 1, 2012 at 9:09 AM, yamo93 <ya...@gmail.com> wrote:
>> I tried your suggestion.
>>
>> I genereted a CSV file with (term, docId, distance) and I used the method
>> mostSimilarItems with UncenteredCosineSimilarity.
>>
>> But this seems to produce wrong results, i don't understand why ?
>>
>> Any idea ?

Re: Need to reduce execution time of RowSimilarityJob

Posted by Sean Owen <sr...@gmail.com>.

There's not really any information here. Why do you think the result is wrong?

On Mon, Oct 1, 2012 at 9:09 AM, yamo93 <ya...@gmail.com> wrote:
> I tried your suggestion.
>
> I genereted a CSV file with (term, docId, distance) and I used the method
> mostSimilarItems with UncenteredCosineSimilarity.
>
> But this seems to produce wrong results, i don't understand why ?
>
> Any idea ?