You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Varnit Khanna <va...@gmail.com> on 2011/05/20 19:31:44 UTC

taking mahout into production

Hi,
I have been considering using mahout for our recommendation engine
needs and had couple of questions about using it in production.

Use Case:
We need to provide recommendation on video assets (similar to hulu) to
couple of million users and we have over 100K assets. Since we are
experiencing growth both in users and assets I am planning to use
mahout on hadoop.

Preference Data:
Currently we do not have a ratings system built into our video
player/page but we do have logs on user impressions on video assets
which I will be feeding into RecommenderJob. Until we build a ratings
system I am planning on using the following preference data:

Impressions | Rating
                1 |  (empty)
                2 | 2
                3 | 3
                4 | 4
            >=5 | 5

Does this preference data make sense? I will be using the standard
RecommenderJob to generate recommendations until I get a better
understanding of mahout.

Questions:
1) What will be the best approach to deal with cold start on new
assets and users?
2) Is it typical to parse the entire dataset in production to generate
recommendations for new assets and users or can it be done
incrementally?
3) What is a better approach for this use case item or user based CF?
Also at some point in the future we would like to generate
recommendations on news assets so a single system might be beneficial.

Thanks
-varnit

Re: taking mahout into production

Posted by Lance Norskog <go...@gmail.com>.

For using Mahout in production you need a feedback loop. The
implementers are drawn to sexy things like great algorithms, and can
print out a bunch of numbers and say, "ok, that looks right". I keep
hacking up ways to interpret and view what Mahout spits out, and I'm
not happy with any of them.

On Fri, May 20, 2011 at 7:11 PM, Ted Dunning <te...@gmail.com> wrote:
> Also, from a practical point of view, people rarely watch videos repeatedly,
> even if they like them and want to see more.
>
> (people - excluding two year olds who will watch something they like until
> it wears out)
>
> On Fri, May 20, 2011 at 7:04 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> I agree that ratings contain relatively little data. Here you're not using
>> direct ratings, but inferring some notion of rating from impressions. Does
>> your scheme make sense? It's not illogical but not one I would choose. To
>> me, there is the most "information" in the jump from 0 impressions to 1.
>> There are a universe of things you don't look at; the fact that you look at
>> something at all is much more significant. Looking at something 2, 3, 10,
>> 100 times from there means something more, but not much more in comparison.
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: taking mahout into production

Posted by Lance Norskog <go...@gmail.com>.

The existence of a rating, no matter what it is, generates an
emotional engagement. "2.7? What idiots hate this? The kitten is a
genius!".

When I was involved in such a system, I wanted to randomly generate
ratings. There is no SLA in a consumer site where you watch videos for
free. You might get away with this if you only do a random sample of
your quality videos.

On Sat, May 21, 2011 at 12:37 PM, Ted Dunning <te...@gmail.com> wrote:
> I bet the name becomes very appropriate very quickly.
>
> The other category of repeated viewing is click-spamming.  They are very
> much worth ignoring as well.
>
> In any case, I have found that it is very important to almost entirely
> ignore the number of times that somebody interacts with a media item (music
> and video are what I have worked on) and instead look at the number of user
> who have done such interactions.  Recommendation quality goes up
> substantially with this step.
>
> On Sat, May 21, 2011 at 4:19 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>>
>> On May 20, 2011, at 10:11 PM, Ted Dunning wrote:
>>
>> > Also, from a practical point of view, people rarely watch videos
>> repeatedly,
>> > even if they like them and want to see more.
>> >
>> > (people - excluding two year olds who will watch something they like
>> until
>> > it wears out)
>>
>> I would extend that from 2 y.o. to about 18 y.o, but for sure at least 9
>> y.o. based on first hand experience, esp. w/ ones that are popular in their
>> peer group.  I think every time my son has a friend over, he shows them the
>> "Annoying Orange" on YouTube ("Hey Apple!")
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: taking mahout into production

Posted by Ted Dunning <te...@gmail.com>.

I bet the name becomes very appropriate very quickly.

The other category of repeated viewing is click-spamming.  They are very
much worth ignoring as well.

In any case, I have found that it is very important to almost entirely
ignore the number of times that somebody interacts with a media item (music
and video are what I have worked on) and instead look at the number of user
who have done such interactions.  Recommendation quality goes up
substantially with this step.

On Sat, May 21, 2011 at 4:19 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On May 20, 2011, at 10:11 PM, Ted Dunning wrote:
>
> > Also, from a practical point of view, people rarely watch videos
> repeatedly,
> > even if they like them and want to see more.
> >
> > (people - excluding two year olds who will watch something they like
> until
> > it wears out)
>
> I would extend that from 2 y.o. to about 18 y.o, but for sure at least 9
> y.o. based on first hand experience, esp. w/ ones that are popular in their
> peer group.  I think every time my son has a friend over, he shows them the
> "Annoying Orange" on YouTube ("Hey Apple!")
>
>

Re: taking mahout into production

Posted by Grant Ingersoll <gs...@apache.org>.

On May 20, 2011, at 10:11 PM, Ted Dunning wrote:

> Also, from a practical point of view, people rarely watch videos repeatedly,
> even if they like them and want to see more.
> 
> (people - excluding two year olds who will watch something they like until
> it wears out)

I would extend that from 2 y.o. to about 18 y.o, but for sure at least 9 y.o. based on first hand experience, esp. w/ ones that are popular in their peer group.  I think every time my son has a friend over, he shows them the "Annoying Orange" on YouTube ("Hey Apple!")

Re: taking mahout into production

Posted by Ted Dunning <te...@gmail.com>.

Also, from a practical point of view, people rarely watch videos repeatedly,
even if they like them and want to see more.

(people - excluding two year olds who will watch something they like until
it wears out)

On Fri, May 20, 2011 at 7:04 PM, Sean Owen <sr...@gmail.com> wrote:

> I agree that ratings contain relatively little data. Here you're not using
> direct ratings, but inferring some notion of rating from impressions. Does
> your scheme make sense? It's not illogical but not one I would choose. To
> me, there is the most "information" in the jump from 0 impressions to 1.
> There are a universe of things you don't look at; the fact that you look at
> something at all is much more significant. Looking at something 2, 3, 10,
> 100 times from there means something more, but not much more in comparison.
>

Re: taking mahout into production

Posted by Sean Owen <sr...@gmail.com>.

I agree that ratings contain relatively little data. Here you're not using
direct ratings, but inferring some notion of rating from impressions. Does
your scheme make sense? It's not illogical but not one I would choose. To
me, there is the most "information" in the jump from 0 impressions to 1.
There are a universe of things you don't look at; the fact that you look at
something at all is much more significant. Looking at something 2, 3, 10,
100 times from there means something more, but not much more in comparison.

So, I might suggest using log_2(impressions) or something similar as a
starting point. But I also might try ignoring the impression count itself
entirely.

Cold start: before you have any information at all about the user, there's
not much you can do but recommend some canned, fixed list of top items.

What do you mean by "parse the entire dataset"? Yes it's normal to actually
use all your data. No it's not at all a good idea to read it all every time
you do anything.

I think a recommender based on item-item similarity sounds like a better
starting point here, though either approach might have merit. You can
conceivably use user-user similarities from this domain to create
recommendations in another domain, yes.

On Fri, May 20, 2011 at 6:31 PM, Varnit Khanna <va...@gmail.com> wrote:

> Hi,
> I have been considering using mahout for our recommendation engine
> needs and had couple of questions about using it in production.
>
> Use Case:
> We need to provide recommendation on video assets (similar to hulu) to
> couple of million users and we have over 100K assets. Since we are
> experiencing growth both in users and assets I am planning to use
> mahout on hadoop.
>
> Preference Data:
> Currently we do not have a ratings system built into our video
> player/page but we do have logs on user impressions on video assets
> which I will be feeding into RecommenderJob. Until we build a ratings
> system I am planning on using the following preference data:
>
> Impressions | Rating
>                1 |  (empty)
>                2 | 2
>                3 | 3
>                4 | 4
>            >=5 | 5
>
> Does this preference data make sense? I will be using the standard
> RecommenderJob to generate recommendations until I get a better
> understanding of mahout.
>
> Questions:
> 1) What will be the best approach to deal with cold start on new
> assets and users?
> 2) Is it typical to parse the entire dataset in production to generate
> recommendations for new assets and users or can it be done
> incrementally?
> 3) What is a better approach for this use case item or user based CF?
> Also at some point in the future we would like to generate
> recommendations on news assets so a single system might be beneficial.
>
> Thanks
> -varnit
>

Re: taking mahout into production

Posted by Sebastian Schelter <ss...@apache.org>.

I published an article in my blog at http://ssc.io recently that deals with
scaling recommender systems, i'm sure it has some ideas you could adapt.

--sebastian
Am 20.05.2011 20:02 schrieb "Ted Dunning" <te...@gmail.com>:
> Sean will be able to address scaling and configuration better than I, but
I
> have built video recommendation systems before and found that
>
> a) ratings are nearly worthless, largely because so few people will rate
> things
>
> b) the best preference data we ever found was whether the user viewed the
> asset longer than 30 seconds. This is a binary preference and it helps to
> have it that way since you can make use of a number of economies.
>
> c) some randomization in recommendations is very important so that you
> preserve some exploratory behavior. I implemented this by adding small
> amounts of noise to recommendation scores to perturb the ranking.
>
> On Fri, May 20, 2011 at 10:31 AM, Varnit Khanna <va...@gmail.com> wrote:
>
>> Hi,
>> I have been considering using mahout for our recommendation engine
>> needs and had couple of questions about using it in production.
>>
>> Use Case:
>> We need to provide recommendation on video assets (similar to hulu) to
>> couple of million users and we have over 100K assets. Since we are
>> experiencing growth both in users and assets I am planning to use
>> mahout on hadoop.
>>
>> Preference Data:
>> Currently we do not have a ratings system built into our video
>> player/page but we do have logs on user impressions on video assets
>> which I will be feeding into RecommenderJob. Until we build a ratings
>> system I am planning on using the following preference data:
>>
>> Impressions | Rating
>> 1 | (empty)
>> 2 | 2
>> 3 | 3
>> 4 | 4
>> >=5 | 5
>>
>> Does this preference data make sense? I will be using the standard
>> RecommenderJob to generate recommendations until I get a better
>> understanding of mahout.
>>
>> Questions:
>> 1) What will be the best approach to deal with cold start on new
>> assets and users?
>> 2) Is it typical to parse the entire dataset in production to generate
>> recommendations for new assets and users or can it be done
>> incrementally?
>> 3) What is a better approach for this use case item or user based CF?
>> Also at some point in the future we would like to generate
>> recommendations on news assets so a single system might be beneficial.
>>
>> Thanks
>> -varnit
>>

Re: taking mahout into production

Posted by Ted Dunning <te...@gmail.com>.

Sean will be able to address scaling and configuration better than I, but I
have built video recommendation systems before and found that

a) ratings are nearly worthless, largely because so few people will rate
things

b) the best preference data we ever found was whether the user viewed the
asset longer than 30 seconds.  This is a binary preference and it helps to
have it that way since you can make use of a number of economies.

c) some randomization in recommendations is very important so that you
preserve some exploratory behavior.  I implemented this by adding small
amounts of noise to recommendation scores to perturb the ranking.

On Fri, May 20, 2011 at 10:31 AM, Varnit Khanna <va...@gmail.com> wrote:

> Hi,
> I have been considering using mahout for our recommendation engine
> needs and had couple of questions about using it in production.
>
> Use Case:
> We need to provide recommendation on video assets (similar to hulu) to
> couple of million users and we have over 100K assets. Since we are
> experiencing growth both in users and assets I am planning to use
> mahout on hadoop.
>
> Preference Data:
> Currently we do not have a ratings system built into our video
> player/page but we do have logs on user impressions on video assets
> which I will be feeding into RecommenderJob. Until we build a ratings
> system I am planning on using the following preference data:
>
> Impressions | Rating
>                1 |  (empty)
>                2 | 2
>                3 | 3
>                4 | 4
>            >=5 | 5
>
> Does this preference data make sense? I will be using the standard
> RecommenderJob to generate recommendations until I get a better
> understanding of mahout.
>
> Questions:
> 1) What will be the best approach to deal with cold start on new
> assets and users?
> 2) Is it typical to parse the entire dataset in production to generate
> recommendations for new assets and users or can it be done
> incrementally?
> 3) What is a better approach for this use case item or user based CF?
> Also at some point in the future we would like to generate
> recommendations on news assets so a single system might be beneficial.
>
> Thanks
> -varnit
>