You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hiroyuki Yamada <mo...@gmail.com> on 2016/02/25 12:20:29 UTC

which is a more appropriate form of ratings ?

Hello.

I just started working on CF in MLlib.
I am using trainImplicit because I only have implicit ratings like page
views.

I am wondering which is a more appropriate form of ratings.
Let's assume that view count is regarded as a rating and
user 1 sees page 1 3 times and sees page 2 twice and so on.

In this case, I think ratings can be formatted like the following 2 cases.
(of course it is a RDD actually)

A:
user_id,page_id,rating(page view)
1,1,0.3
1,2,0.2
...

B:
user_id,page_id,rating(page view)
1,1,0.1
1,1,0.1
1,1,0.1
1,2,0.1
1,2,0.1
...

It is allowed to have like B ?
If it is, which is better ? ( is there any difference between them ?)

Best,
Hiro

Re: which is a more appropriate form of ratings ?

Posted by Hiroyuki Yamada <mo...@gmail.com>.

Thanks very much, Nick and Sabarish.
That helps me a lot.

Regards,
*Hiro*

On Thu, Feb 25, 2016 at 8:52 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> Yes, ALS requires the aggregated version (A). You can use decimal or whole
> numbers for the rating, depending on your application, as for implicit data
> they are not "ratings" but rather "weights".
>
> A common approach is to apply different weightings to different user
> events (such as 1.0 for a page view, 5.0 for a purchase, 2.0 for a like,
> etc). That allows all user event data to be aggregated together in a fairly
> principled manner. The weights however need to be specified upfront in
> order to do that aggregation (they could be selected via cross-validation,
> domain knowledge or the relative frequency of each event within a dataset,
> for example).
>
>
> On Thu, 25 Feb 2016 at 13:26 Sabarish Sasidharan <sa...@gmail.com>
> wrote:
>
>> I believe the ALS algo expects the ratings to be aggregated (A). I don't
>> see why you have to use decimals for rating.
>>
>> Regards
>> Sab
>>
>> On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada <mo...@gmail.com>
>> wrote:
>>
>>> Hello.
>>>
>>> I just started working on CF in MLlib.
>>> I am using trainImplicit because I only have implicit ratings like page
>>> views.
>>>
>>> I am wondering which is a more appropriate form of ratings.
>>> Let's assume that view count is regarded as a rating and
>>> user 1 sees page 1 3 times and sees page 2 twice and so on.
>>>
>>> In this case, I think ratings can be formatted like the following 2
>>> cases. (of course it is a RDD actually)
>>>
>>> A:
>>> user_id,page_id,rating(page view)
>>> 1,1,0.3
>>> 1,2,0.2
>>> ...
>>>
>>> B:
>>> user_id,page_id,rating(page view)
>>> 1,1,0.1
>>> 1,1,0.1
>>> 1,1,0.1
>>> 1,2,0.1
>>> 1,2,0.1
>>> ...
>>>
>>> It is allowed to have like B ?
>>> If it is, which is better ? ( is there any difference between them ?)
>>>
>>> Best,
>>> Hiro
>>>
>>>
>>>
>>>
>>

Re: which is a more appropriate form of ratings ?

Posted by Nick Pentreath <ni...@gmail.com>.

Yes, ALS requires the aggregated version (A). You can use decimal or whole
numbers for the rating, depending on your application, as for implicit data
they are not "ratings" but rather "weights".

A common approach is to apply different weightings to different user events
(such as 1.0 for a page view, 5.0 for a purchase, 2.0 for a like, etc).
That allows all user event data to be aggregated together in a fairly
principled manner. The weights however need to be specified upfront in
order to do that aggregation (they could be selected via cross-validation,
domain knowledge or the relative frequency of each event within a dataset,
for example).

On Thu, 25 Feb 2016 at 13:26 Sabarish Sasidharan <sa...@gmail.com>
wrote:

> I believe the ALS algo expects the ratings to be aggregated (A). I don't
> see why you have to use decimals for rating.
>
> Regards
> Sab
>
> On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada <mo...@gmail.com>
> wrote:
>
>> Hello.
>>
>> I just started working on CF in MLlib.
>> I am using trainImplicit because I only have implicit ratings like page
>> views.
>>
>> I am wondering which is a more appropriate form of ratings.
>> Let's assume that view count is regarded as a rating and
>> user 1 sees page 1 3 times and sees page 2 twice and so on.
>>
>> In this case, I think ratings can be formatted like the following 2
>> cases. (of course it is a RDD actually)
>>
>> A:
>> user_id,page_id,rating(page view)
>> 1,1,0.3
>> 1,2,0.2
>> ...
>>
>> B:
>> user_id,page_id,rating(page view)
>> 1,1,0.1
>> 1,1,0.1
>> 1,1,0.1
>> 1,2,0.1
>> 1,2,0.1
>> ...
>>
>> It is allowed to have like B ?
>> If it is, which is better ? ( is there any difference between them ?)
>>
>> Best,
>> Hiro
>>
>>
>>
>>
>

Re: which is a more appropriate form of ratings ?

Posted by Sabarish Sasidharan <sa...@gmail.com>.

I believe the ALS algo expects the ratings to be aggregated (A). I don't
see why you have to use decimals for rating.

Regards
Sab

On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada <mo...@gmail.com> wrote:

> Hello.
>
> I just started working on CF in MLlib.
> I am using trainImplicit because I only have implicit ratings like page
> views.
>
> I am wondering which is a more appropriate form of ratings.
> Let's assume that view count is regarded as a rating and
> user 1 sees page 1 3 times and sees page 2 twice and so on.
>
> In this case, I think ratings can be formatted like the following 2 cases.
> (of course it is a RDD actually)
>
> A:
> user_id,page_id,rating(page view)
> 1,1,0.3
> 1,2,0.2
> ...
>
> B:
> user_id,page_id,rating(page view)
> 1,1,0.1
> 1,1,0.1
> 1,1,0.1
> 1,2,0.1
> 1,2,0.1
> ...
>
> It is allowed to have like B ?
> If it is, which is better ? ( is there any difference between them ?)
>
> Best,
> Hiro
>
>
>
>