You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Greg H <gr...@gmail.com> on 2011/11/24 02:14:34 UTC

ItemSimilarityJob's results differ from non-distributed version

Hello,

I've been using Mahout's item-based recommender on several different
implicit datasets. First I computed the item-item similarities by just
passing in an ItemSimilarity and DataModel to the GenericItemSimilarity
class but lately I've been using ItemSimilarityJob to calculate them on a
Hadoop cluster. However, I've found that there is a significant difference
in the results of my experiments depending on which method I use to
calculate the similarities.

For example, when I use the public MovieLens 1M dataset (which I've
converted into an implicit dataset), calculating the similarities with:

ItemBasedRecommender recommender = new
GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new
TanimotoCoefficientSimilarity(dataModel), dataModel));

gives 0.25 precision when splitting the data for each user at the ratio of
80% - 20% and then looking at only the top 5 recommended items. However,
when I compute the similarities with ItemSimilarityJob using the following
command:

hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
-Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output
--similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT
--maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true

I only get 0.23 precision. The results differ even more significantly in
some other datasets that I've been working with. I know that
ItemSimilarityJob prunes some items so I've tried many different settings
for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves
the results it still never matches the non-distributed version. Shouldn't
the results be the same no matter which version I use to calculate the
similarities?

Thank you,
Greg

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Greg,

Thank you for your time debugging this!

Maybe we should simply make TanimotoCoefficientSimilarity return
Double.NaN in case of no overlap?

--sebastian

On 29.11.2011 06:28, Greg H wrote:
> Sorry for taking so long to reply but I think I found where the problem is.
> After comparing the similarities more closely I found out that the problem
> wasn't with ItemSimilarityJob. The difference in the number of similarities
> being computed was simply because ItemSimilarityJob is excluding the
> item-item similarities that equal zero. However, this causes a problem when
> using GenericItemBasedRecommender to make recommendations.
> 
> The problem starts from line 345 of GenericItemBasedRecommender:
> 
> if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
>   average.addDatum(estimate);
> }
> 
> This works as expected if you use ItemSimilarityJob because the estimate
> for dissimilar items will be NaN and this causes the final average to be
> zero. However, if you use the non-distributed version to calculate the
> similarities the estimate for dissimilar items will be zero, which will not
> necessarily cause the final average to be zero thus still allowing this
> item to be used even though it is not similar to all of the user's
> preferred items. So either the non-distributed version needs to be modified
> to not store similarities equaling zero, or the code above needs to be
> changed to handle the case where estimate is zero.
> 
> After fixing this calculating similarities using the distributed and
> non-distributed version gives the same results in my experiments.
> 
> Thanks,
> Greg
> 
> On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <te...@gmail.com> wrote:
> 
>> This is good advice, but in many cases, the number of positive ratings far
>> outnumbers the number of negatives so the negatives may have limited
>> impact.
>>
>> You absolutely should produce a histogram to see what kinds of ratings you
>> are getting and how many users are producing them.
>>
>> You should also consider implicit feedback.  This tends to be much higher
>> value than ratings for two reasons:
>>
>> 1) there is usually 100x more of it
>>
>> 2) you generally care more about what people do than what they say.
>>  Implicit feedback is based on what people do and ratings are what they
>> say.  Inferring what people will do based on what they say is more
>> difficult than inferring what they will do based on what they do.
>>
>> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>> A small tip on the side: For implicit data, you should also include
>>> "negative" ratings as those still contain a lot of information about the
>>> taste and will of engagement of the user. No need to use only the 3+
>>> ratings.
>>>
>>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sebastian Schelter <ss...@apache.org>.

On 29.11.2011 11:52, Sean Owen wrote:
> Yeah the non-distributed implementation returns NaN in this case, which is
> a bit of an abuse, since it is defined to be 0. In practice I have always
> thought that is the right thing to do for consistency with other
> implementations, where no overlap means undefined similarity. You could
> argue it either way.

I created https://issues.apache.org/jira/browse/MAHOUT-902 for this.
Should make a nice starter issue to work on, therefore I labeled it with
MAHOUT_INTRO_CONTRIBUTE

--sebastian


> 
> On Tue, Nov 29, 2011 at 9:14 AM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> Hi Greg,
>>
>> Thank you for your time debugging this!
>>
>> Maybe we should simply make TanimotoCoefficientSimilarity return
>> Double.NaN in case of no overlap?
>>
>> --sebastian
>>
>> On 29.11.2011 06:28, Greg H wrote:
>>> Sorry for taking so long to reply but I think I found where the problem
>> is.
>>> After comparing the similarities more closely I found out that the
>> problem
>>> wasn't with ItemSimilarityJob. The difference in the number of
>> similarities
>>> being computed was simply because ItemSimilarityJob is excluding the
>>> item-item similarities that equal zero. However, this causes a problem
>> when
>>> using GenericItemBasedRecommender to make recommendations.
>>>
>>> The problem starts from line 345 of GenericItemBasedRecommender:
>>>
>>> if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
>>>   average.addDatum(estimate);
>>> }
>>>
>>> This works as expected if you use ItemSimilarityJob because the estimate
>>> for dissimilar items will be NaN and this causes the final average to be
>>> zero. However, if you use the non-distributed version to calculate the
>>> similarities the estimate for dissimilar items will be zero, which will
>> not
>>> necessarily cause the final average to be zero thus still allowing this
>>> item to be used even though it is not similar to all of the user's
>>> preferred items. So either the non-distributed version needs to be
>> modified
>>> to not store similarities equaling zero, or the code above needs to be
>>> changed to handle the case where estimate is zero.
>>>
>>> After fixing this calculating similarities using the distributed and
>>> non-distributed version gives the same results in my experiments.
>>>
>>> Thanks,
>>> Greg
>>>
>>> On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>>
>>>> This is good advice, but in many cases, the number of positive ratings
>> far
>>>> outnumbers the number of negatives so the negatives may have limited
>>>> impact.
>>>>
>>>> You absolutely should produce a histogram to see what kinds of ratings
>> you
>>>> are getting and how many users are producing them.
>>>>
>>>> You should also consider implicit feedback.  This tends to be much
>> higher
>>>> value than ratings for two reasons:
>>>>
>>>> 1) there is usually 100x more of it
>>>>
>>>> 2) you generally care more about what people do than what they say.
>>>>  Implicit feedback is based on what people do and ratings are what they
>>>> say.  Inferring what people will do based on what they say is more
>>>> difficult than inferring what they will do based on what they do.
>>>>
>>>> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <ss...@apache.org>
>>>> wrote:
>>>>
>>>>> A small tip on the side: For implicit data, you should also include
>>>>> "negative" ratings as those still contain a lot of information about
>> the
>>>>> taste and will of engagement of the user. No need to use only the 3+
>>>>> ratings.
>>>>>
>>>>
>>>
>>
>>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sean Owen <sr...@gmail.com>.

Yeah the non-distributed implementation returns NaN in this case, which is
a bit of an abuse, since it is defined to be 0. In practice I have always
thought that is the right thing to do for consistency with other
implementations, where no overlap means undefined similarity. You could
argue it either way.

On Tue, Nov 29, 2011 at 9:14 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Greg,
>
> Thank you for your time debugging this!
>
> Maybe we should simply make TanimotoCoefficientSimilarity return
> Double.NaN in case of no overlap?
>
> --sebastian
>
> On 29.11.2011 06:28, Greg H wrote:
> > Sorry for taking so long to reply but I think I found where the problem
> is.
> > After comparing the similarities more closely I found out that the
> problem
> > wasn't with ItemSimilarityJob. The difference in the number of
> similarities
> > being computed was simply because ItemSimilarityJob is excluding the
> > item-item similarities that equal zero. However, this causes a problem
> when
> > using GenericItemBasedRecommender to make recommendations.
> >
> > The problem starts from line 345 of GenericItemBasedRecommender:
> >
> > if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
> >   average.addDatum(estimate);
> > }
> >
> > This works as expected if you use ItemSimilarityJob because the estimate
> > for dissimilar items will be NaN and this causes the final average to be
> > zero. However, if you use the non-distributed version to calculate the
> > similarities the estimate for dissimilar items will be zero, which will
> not
> > necessarily cause the final average to be zero thus still allowing this
> > item to be used even though it is not similar to all of the user's
> > preferred items. So either the non-distributed version needs to be
> modified
> > to not store similarities equaling zero, or the code above needs to be
> > changed to handle the case where estimate is zero.
> >
> > After fixing this calculating similarities using the distributed and
> > non-distributed version gives the same results in my experiments.
> >
> > Thanks,
> > Greg
> >
> > On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> This is good advice, but in many cases, the number of positive ratings
> far
> >> outnumbers the number of negatives so the negatives may have limited
> >> impact.
> >>
> >> You absolutely should produce a histogram to see what kinds of ratings
> you
> >> are getting and how many users are producing them.
> >>
> >> You should also consider implicit feedback.  This tends to be much
> higher
> >> value than ratings for two reasons:
> >>
> >> 1) there is usually 100x more of it
> >>
> >> 2) you generally care more about what people do than what they say.
> >>  Implicit feedback is based on what people do and ratings are what they
> >> say.  Inferring what people will do based on what they say is more
> >> difficult than inferring what they will do based on what they do.
> >>
> >> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <ss...@apache.org>
> >> wrote:
> >>
> >>> A small tip on the side: For implicit data, you should also include
> >>> "negative" ratings as those still contain a lot of information about
> the
> >>> taste and will of engagement of the user. No need to use only the 3+
> >>> ratings.
> >>>
> >>
> >
>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Greg,

Thank you for your time debugging this!

Maybe we should simply make TanimotoCoefficientSimilarity return
Double.NaN in case of no overlap?

--sebastian

On 29.11.2011 06:28, Greg H wrote:
> Sorry for taking so long to reply but I think I found where the problem is.
> After comparing the similarities more closely I found out that the problem
> wasn't with ItemSimilarityJob. The difference in the number of similarities
> being computed was simply because ItemSimilarityJob is excluding the
> item-item similarities that equal zero. However, this causes a problem when
> using GenericItemBasedRecommender to make recommendations.
> 
> The problem starts from line 345 of GenericItemBasedRecommender:
> 
> if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
>   average.addDatum(estimate);
> }
> 
> This works as expected if you use ItemSimilarityJob because the estimate
> for dissimilar items will be NaN and this causes the final average to be
> zero. However, if you use the non-distributed version to calculate the
> similarities the estimate for dissimilar items will be zero, which will not
> necessarily cause the final average to be zero thus still allowing this
> item to be used even though it is not similar to all of the user's
> preferred items. So either the non-distributed version needs to be modified
> to not store similarities equaling zero, or the code above needs to be
> changed to handle the case where estimate is zero.
> 
> After fixing this calculating similarities using the distributed and
> non-distributed version gives the same results in my experiments.
> 
> Thanks,
> Greg
> 
> On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <te...@gmail.com> wrote:
> 
>> This is good advice, but in many cases, the number of positive ratings far
>> outnumbers the number of negatives so the negatives may have limited
>> impact.
>>
>> You absolutely should produce a histogram to see what kinds of ratings you
>> are getting and how many users are producing them.
>>
>> You should also consider implicit feedback.  This tends to be much higher
>> value than ratings for two reasons:
>>
>> 1) there is usually 100x more of it
>>
>> 2) you generally care more about what people do than what they say.
>>  Implicit feedback is based on what people do and ratings are what they
>> say.  Inferring what people will do based on what they say is more
>> difficult than inferring what they will do based on what they do.
>>
>> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>> A small tip on the side: For implicit data, you should also include
>>> "negative" ratings as those still contain a lot of information about the
>>> taste and will of engagement of the user. No need to use only the 3+
>>> ratings.
>>>
>>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Greg H <gr...@gmail.com>.

Sorry for taking so long to reply but I think I found where the problem is.
After comparing the similarities more closely I found out that the problem
wasn't with ItemSimilarityJob. The difference in the number of similarities
being computed was simply because ItemSimilarityJob is excluding the
item-item similarities that equal zero. However, this causes a problem when
using GenericItemBasedRecommender to make recommendations.

The problem starts from line 345 of GenericItemBasedRecommender:

if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
  average.addDatum(estimate);
}

This works as expected if you use ItemSimilarityJob because the estimate
for dissimilar items will be NaN and this causes the final average to be
zero. However, if you use the non-distributed version to calculate the
similarities the estimate for dissimilar items will be zero, which will not
necessarily cause the final average to be zero thus still allowing this
item to be used even though it is not similar to all of the user's
preferred items. So either the non-distributed version needs to be modified
to not store similarities equaling zero, or the code above needs to be
changed to handle the case where estimate is zero.

After fixing this calculating similarities using the distributed and
non-distributed version gives the same results in my experiments.

Thanks,
Greg

On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <te...@gmail.com> wrote:

> This is good advice, but in many cases, the number of positive ratings far
> outnumbers the number of negatives so the negatives may have limited
> impact.
>
> You absolutely should produce a histogram to see what kinds of ratings you
> are getting and how many users are producing them.
>
> You should also consider implicit feedback.  This tends to be much higher
> value than ratings for two reasons:
>
> 1) there is usually 100x more of it
>
> 2) you generally care more about what people do than what they say.
>  Implicit feedback is based on what people do and ratings are what they
> say.  Inferring what people will do based on what they say is more
> difficult than inferring what they will do based on what they do.
>
> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > A small tip on the side: For implicit data, you should also include
> > "negative" ratings as those still contain a lot of information about the
> > taste and will of engagement of the user. No need to use only the 3+
> > ratings.
> >
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Ted Dunning <te...@gmail.com>.

This is good advice, but in many cases, the number of positive ratings far
outnumbers the number of negatives so the negatives may have limited impact.

You absolutely should produce a histogram to see what kinds of ratings you
are getting and how many users are producing them.

You should also consider implicit feedback.  This tends to be much higher
value than ratings for two reasons:

1) there is usually 100x more of it

2) you generally care more about what people do than what they say.
 Implicit feedback is based on what people do and ratings are what they
say.  Inferring what people will do based on what they say is more
difficult than inferring what they will do based on what they do.

On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <ss...@apache.org> wrote:

> A small tip on the side: For implicit data, you should also include
> "negative" ratings as those still contain a lot of information about the
> taste and will of engagement of the user. No need to use only the 3+
> ratings.
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Greg,

Can you give an example of two items that where similar in the
non-distributed case but did not appear in the distributed version?

A small tip on the side: For implicit data, you should also include
"negative" ratings as those still contain a lot of information about the
taste and will of engagement of the user. No need to use only the 3+
ratings.

--sebastian

On 25.11.2011 09:27, Greg H wrote:
> Hi Sebastian,
> 
> I converted the dataset by simply keeping all user/item pairs that had a
> rating of above 3. I'm also using GenericItemBasedRecommender's
> mostSimilarItems method instead of the recommend method to make
> recommendations.
> 
> I'm certainly open to suggestions on better evaluation metrics. I'm just
> using the top 5 because it was easy to implement.
> 
> Thanks,
> Greg
> 
> On Fri, Nov 25, 2011 at 4:03 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> Hi Greg,
>>
>> You should get the same results, can you describe exactly how you
>> converted the dataset? I'd like to try this myself, maybe you found some
>> subtle bug.
>>
>> I also have doubts whether taking the precision of the top 5 recommended
>> items is really a good quality measure.
>>
>> --sebastian
>>
>> On 25.11.2011 02:41, Greg H wrote:
>>> Thanks for the replies Sebastian and Sean. I looked at the similarity
>>> values and they are the same, but ItemSimilarityJob is calculating fewer
>> of
>>> them. So it must be still doing some sort of sampling. I thought that I
>>> could force it to use all of the data by setting maxPrefsPerUser
>>> sufficiently large. Could there be another reason for it not to calculate
>>> all of the similarity values?
>>>
>>> I also tried to use a smaller amount of similarItemsPerItem but this
>> leads
>>> to worse results.
>>>
>>> Thanks again,
>>> Greg
>>>
>>
>>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Greg H <gr...@gmail.com>.

Hi Sebastian,

I converted the dataset by simply keeping all user/item pairs that had a
rating of above 3. I'm also using GenericItemBasedRecommender's
mostSimilarItems method instead of the recommend method to make
recommendations.

I'm certainly open to suggestions on better evaluation metrics. I'm just
using the top 5 because it was easy to implement.

Thanks,
Greg

On Fri, Nov 25, 2011 at 4:03 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Greg,
>
> You should get the same results, can you describe exactly how you
> converted the dataset? I'd like to try this myself, maybe you found some
> subtle bug.
>
> I also have doubts whether taking the precision of the top 5 recommended
> items is really a good quality measure.
>
> --sebastian
>
> On 25.11.2011 02:41, Greg H wrote:
> > Thanks for the replies Sebastian and Sean. I looked at the similarity
> > values and they are the same, but ItemSimilarityJob is calculating fewer
> of
> > them. So it must be still doing some sort of sampling. I thought that I
> > could force it to use all of the data by setting maxPrefsPerUser
> > sufficiently large. Could there be another reason for it not to calculate
> > all of the similarity values?
> >
> > I also tried to use a smaller amount of similarItemsPerItem but this
> leads
> > to worse results.
> >
> > Thanks again,
> > Greg
> >
>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Greg,

You should get the same results, can you describe exactly how you
converted the dataset? I'd like to try this myself, maybe you found some
subtle bug.

I also have doubts whether taking the precision of the top 5 recommended
items is really a good quality measure.

--sebastian

On 25.11.2011 02:41, Greg H wrote:
> Thanks for the replies Sebastian and Sean. I looked at the similarity
> values and they are the same, but ItemSimilarityJob is calculating fewer of
> them. So it must be still doing some sort of sampling. I thought that I
> could force it to use all of the data by setting maxPrefsPerUser
> sufficiently large. Could there be another reason for it not to calculate
> all of the similarity values?
> 
> I also tried to use a smaller amount of similarItemsPerItem but this leads
> to worse results.
> 
> Thanks again,
> Greg
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Greg H <gr...@gmail.com>.

Thanks for the replies Sebastian and Sean. I looked at the similarity
values and they are the same, but ItemSimilarityJob is calculating fewer of
them. So it must be still doing some sort of sampling. I thought that I
could force it to use all of the data by setting maxPrefsPerUser
sufficiently large. Could there be another reason for it not to calculate
all of the similarity values?

I also tried to use a smaller amount of similarItemsPerItem but this leads
to worse results.

Thanks again,
Greg

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sebastian Schelter <ss...@apache.org>.

Hmm, in the movielens 1M dataset the max number of ratings per user is
approx 1000, so with a maxPrefs=10000 setting, there should be no
sampling involved. The sampling would also not explain the decline in
quality that Greg is seeing, as long as it would not be too aggressive.

--sebastian

On 24.11.2011 09:29, Sean Owen wrote:
> Isn't that probably the difference, that the distributed job is sampling
> the data? I would not expect an exact match.
> 
> On Thu, Nov 24, 2011 at 7:51 AM, Sebastian Schelter <ssc.open@googlemail.com
>> wrote:
> 
>> Another remark: Using a small amount of similarItemsPerItem (say 30-50)
>> should give better results than using all.
>>
>> --sebastian
>>
>>
>> On 24.11.2011 02:14, Greg H wrote:
>>> Hello,
>>>
>>> I've been using Mahout's item-based recommender on several different
>>> implicit datasets. First I computed the item-item similarities by just
>>> passing in an ItemSimilarity and DataModel to the GenericItemSimilarity
>>> class but lately I've been using ItemSimilarityJob to calculate them on a
>>> Hadoop cluster. However, I've found that there is a significant
>> difference
>>> in the results of my experiments depending on which method I use to
>>> calculate the similarities.
>>>
>>> For example, when I use the public MovieLens 1M dataset (which I've
>>> converted into an implicit dataset), calculating the similarities with:
>>>
>>> ItemBasedRecommender recommender = new
>>> GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new
>>> TanimotoCoefficientSimilarity(dataModel), dataModel));
>>>
>>> gives 0.25 precision when splitting the data for each user at the ratio
>> of
>>> 80% - 20% and then looking at only the top 5 recommended items. However,
>>> when I compute the similarities with ItemSimilarityJob using the
>> following
>>> command:
>>>
>>> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>> -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output
>>> --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT
>>> --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true
>>>
>>> I only get 0.23 precision. The results differ even more significantly in
>>> some other datasets that I've been working with. I know that
>>> ItemSimilarityJob prunes some items so I've tried many different settings
>>> for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves
>>> the results it still never matches the non-distributed version. Shouldn't
>>> the results be the same no matter which version I use to calculate the
>>> similarities?
>>>
>>> Thank you,
>>> Greg
>>>
>>
>>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sean Owen <sr...@gmail.com>.

Isn't that probably the difference, that the distributed job is sampling
the data? I would not expect an exact match.

On Thu, Nov 24, 2011 at 7:51 AM, Sebastian Schelter <ssc.open@googlemail.com
> wrote:

> Another remark: Using a small amount of similarItemsPerItem (say 30-50)
> should give better results than using all.
>
> --sebastian
>
>
> On 24.11.2011 02:14, Greg H wrote:
> > Hello,
> >
> > I've been using Mahout's item-based recommender on several different
> > implicit datasets. First I computed the item-item similarities by just
> > passing in an ItemSimilarity and DataModel to the GenericItemSimilarity
> > class but lately I've been using ItemSimilarityJob to calculate them on a
> > Hadoop cluster. However, I've found that there is a significant
> difference
> > in the results of my experiments depending on which method I use to
> > calculate the similarities.
> >
> > For example, when I use the public MovieLens 1M dataset (which I've
> > converted into an implicit dataset), calculating the similarities with:
> >
> > ItemBasedRecommender recommender = new
> > GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new
> > TanimotoCoefficientSimilarity(dataModel), dataModel));
> >
> > gives 0.25 precision when splitting the data for each user at the ratio
> of
> > 80% - 20% and then looking at only the top 5 recommended items. However,
> > when I compute the similarities with ItemSimilarityJob using the
> following
> > command:
> >
> > hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
> > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> > -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output
> > --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT
> > --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true
> >
> > I only get 0.23 precision. The results differ even more significantly in
> > some other datasets that I've been working with. I know that
> > ItemSimilarityJob prunes some items so I've tried many different settings
> > for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves
> > the results it still never matches the non-distributed version. Shouldn't
> > the results be the same no matter which version I use to calculate the
> > similarities?
> >
> > Thank you,
> > Greg
> >
>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sebastian Schelter <ss...@googlemail.com>.

Another remark: Using a small amount of similarItemsPerItem (say 30-50)
should give better results than using all.

--sebastian


On 24.11.2011 02:14, Greg H wrote:
> Hello,
> 
> I've been using Mahout's item-based recommender on several different
> implicit datasets. First I computed the item-item similarities by just
> passing in an ItemSimilarity and DataModel to the GenericItemSimilarity
> class but lately I've been using ItemSimilarityJob to calculate them on a
> Hadoop cluster. However, I've found that there is a significant difference
> in the results of my experiments depending on which method I use to
> calculate the similarities.
> 
> For example, when I use the public MovieLens 1M dataset (which I've
> converted into an implicit dataset), calculating the similarities with:
> 
> ItemBasedRecommender recommender = new
> GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new
> TanimotoCoefficientSimilarity(dataModel), dataModel));
> 
> gives 0.25 precision when splitting the data for each user at the ratio of
> 80% - 20% and then looking at only the top 5 recommended items. However,
> when I compute the similarities with ItemSimilarityJob using the following
> command:
> 
> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output
> --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT
> --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true
> 
> I only get 0.23 precision. The results differ even more significantly in
> some other datasets that I've been working with. I know that
> ItemSimilarityJob prunes some items so I've tried many different settings
> for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves
> the results it still never matches the non-distributed version. Shouldn't
> the results be the same no matter which version I use to calculate the
> similarities?
> 
> Thank you,
> Greg
>

Re: ItemSimilarityJob's results differ from non-distributed version

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Greg,

do you see a difference in the actual similarity values that are computed?

--sebastian

On 24.11.2011 02:14, Greg H wrote:
> Hello,
> 
> I've been using Mahout's item-based recommender on several different
> implicit datasets. First I computed the item-item similarities by just
> passing in an ItemSimilarity and DataModel to the GenericItemSimilarity
> class but lately I've been using ItemSimilarityJob to calculate them on a
> Hadoop cluster. However, I've found that there is a significant difference
> in the results of my experiments depending on which method I use to
> calculate the similarities.
> 
> For example, when I use the public MovieLens 1M dataset (which I've
> converted into an implicit dataset), calculating the similarities with:
> 
> ItemBasedRecommender recommender = new
> GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new
> TanimotoCoefficientSimilarity(dataModel), dataModel));
> 
> gives 0.25 precision when splitting the data for each user at the ratio of
> 80% - 20% and then looking at only the top 5 recommended items. However,
> when I compute the similarities with ItemSimilarityJob using the following
> command:
> 
> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output
> --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT
> --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true
> 
> I only get 0.23 precision. The results differ even more significantly in
> some other datasets that I've been working with. I know that
> ItemSimilarityJob prunes some items so I've tried many different settings
> for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves
> the results it still never matches the non-distributed version. Shouldn't
> the results be the same no matter which version I use to calculate the
> similarities?
> 
> Thank you,
> Greg
>