You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Davide Pozza <da...@gmail.com> on 2012/09/21 10:19:53 UTC

Correct way for merging different data sources

Dear all
this is probably a newbie question...

>From a tipical ecommerce scenario I can obtain the following kind of data
which can be used for recommending products:

1) Users bought items - without ratings (csv format: USER_ID,ITEM_ID)
2) User viewed items (csv format: USER_ID,ITEM_ID, RATING) where RATING
could represent the number of views
3) User likes (csv format: USER_ID,ITEM_ID, RATING) where RATING is a
number form 1 to 5
4) User wishlist - without ratings (csv format: USER_ID,ITEM_ID)

My question is: which is the right way to build my recommendations by using
all these available infos in order to show a generic section "Other items
you could be interested on"?

I suppose I should create different recommenders for each kind of data and
then merge their results (the resulting score for a single recommended item
will be the sum of the score assigned by each single recommender). Is this
the right way?

Thanks!

-- 
Davide Pozza

Re: Correct way for merging different data sources

Posted by Sean Owen <sr...@gmail.com>.

Agree. These different data points, intuitively, should be combined
and not treated as separate. I am pretty certain that splitting,
recommending independently, and recombining will yield a result that
is less than the sum of its parts, literally.

They are not ratings individually, but, you could invent some scale
where a view is "0.1" and a purchase of $20 is a "20" or something.
Then the reasonable thing to do is sum them up over your data set to
get a user-item "strength" score. That can be used like a rating or
preference value. You can put it through a rating-based recommender.

The result will probably be OK. This kind of input starts to mis-match
the assumptions that some of the common similarity metrics make, like
Pearson correlation.

If this is your input model, I strongly suggest you have a look at this paper:
http://www2.research.att.com/~yifanhu/PUB/cf.pdf

This is a simple model that uses the input not like ratings to be
predicted, but as weights in an approximation process. Heavily
preferred items are heavily weighted such that the model really tries
to predict that the user and item are connected. It's simple, almost
simplistic, but suits the nature of this kind of input better.

And it's a common type of input, much more than 'real' ratings. These
are among the reasons that this is what is implemented in Myrrix
(myrrix.com), which you may also want to experiment with. This
approach exists in Mahout too under the name "Parallel ALS", though
you'd have to modify it a bit to transform the input in the way
described in the paper.

Sean

On Fri, Sep 21, 2012 at 9:52 AM, Julian Ortega <jo...@gmail.com> wrote:
> One could argue that the rating is really just an indication of how strong
> the preference from the user is to the item, so the stronger the
> preference, the higher the rating value should be.
>
> For instance, you could say that a purchase is the strongest indication of
> preference and that it will have a value of 10. Then you could say that
> adding to the wishlist is your second most strong indicator of preference
> and have that with a value of 5. The view would be the less strong
> indication and you can have that with a value of say 1. I wouldn't know how
> to go about representing the likes, since they already have their own
> scale, but if you just had a list of people who liked certain items (kind
> of like Facebook, you either liked or you didn't do anything), you could
> say that this indicator of preference could have a value of 3.
>
> Those are just example values, you would need to determine how much
> stronger you want the different indicators of preference to be in relation
> to one another.
>
> Cheers
>
> On Fri, Sep 21, 2012 at 10:19 AM, Davide Pozza <da...@gmail.com>wrote:
>
>> Dear all
>> this is probably a newbie question...
>>
>> From a tipical ecommerce scenario I can obtain the following kind of data
>> which can be used for recommending products:
>>
>> 1) Users bought items - without ratings (csv format: USER_ID,ITEM_ID)
>> 2) User viewed items (csv format: USER_ID,ITEM_ID, RATING) where RATING
>> could represent the number of views
>> 3) User likes (csv format: USER_ID,ITEM_ID, RATING) where RATING is a
>> number form 1 to 5
>> 4) User wishlist - without ratings (csv format: USER_ID,ITEM_ID)
>>
>> My question is: which is the right way to build my recommendations by using
>> all these available infos in order to show a generic section "Other items
>> you could be interested on"?
>>
>> I suppose I should create different recommenders for each kind of data and
>> then merge their results (the resulting score for a single recommended item
>> will be the sum of the score assigned by each single recommender). Is this
>> the right way?
>>
>> Thanks!
>>
>> --
>> Davide Pozza
>>

Re: Correct way for merging different data sources

Posted by Julian Ortega <jo...@gmail.com>.

One could argue that the rating is really just an indication of how strong
the preference from the user is to the item, so the stronger the
preference, the higher the rating value should be.

For instance, you could say that a purchase is the strongest indication of
preference and that it will have a value of 10. Then you could say that
adding to the wishlist is your second most strong indicator of preference
and have that with a value of 5. The view would be the less strong
indication and you can have that with a value of say 1. I wouldn't know how
to go about representing the likes, since they already have their own
scale, but if you just had a list of people who liked certain items (kind
of like Facebook, you either liked or you didn't do anything), you could
say that this indicator of preference could have a value of 3.

Those are just example values, you would need to determine how much
stronger you want the different indicators of preference to be in relation
to one another.

Cheers

On Fri, Sep 21, 2012 at 10:19 AM, Davide Pozza <da...@gmail.com>wrote:

> Dear all
> this is probably a newbie question...
>
> From a tipical ecommerce scenario I can obtain the following kind of data
> which can be used for recommending products:
>
> 1) Users bought items - without ratings (csv format: USER_ID,ITEM_ID)
> 2) User viewed items (csv format: USER_ID,ITEM_ID, RATING) where RATING
> could represent the number of views
> 3) User likes (csv format: USER_ID,ITEM_ID, RATING) where RATING is a
> number form 1 to 5
> 4) User wishlist - without ratings (csv format: USER_ID,ITEM_ID)
>
> My question is: which is the right way to build my recommendations by using
> all these available infos in order to show a generic section "Other items
> you could be interested on"?
>
> I suppose I should create different recommenders for each kind of data and
> then merge their results (the resulting score for a single recommended item
> will be the sum of the score assigned by each single recommender). Is this
> the right way?
>
> Thanks!
>
> --
> Davide Pozza
>