You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Matt Mitchell <go...@gmail.com> on 2012/08/03 23:16:47 UTC

question about distributed recommendations

Hi,

I have a pretty neat, non-distributed recommender running here. In
doing some math on new user growth rate, I thought it might be wise
for me to turn to a distributed approach now, rather than later. Just
to make sure I'm on the right track, I calculated 1.5 GB for in-memory
prefs. Is that pushing it for a single machine recommender?

Am I right in thinking that the distributed approach is a single
algorithm, unlike the many possible choices for non-distributed? Is it
possible to inject content-aware logic within a distributed
recommender? Would I inject that into the map/reduce phase, when the
recommendations are generated? Curious to know if there are any
examples out there?

Thanks!

Re: question about distributed recommendations

Posted by Sean Owen <sr...@gmail.com>.

Yes, or anywhere else you want to publish static results to, if you don't
want to expose HDFS. HDFS isn't good at small random reads, so it would be
a question of bulk-loading shards of results. The MapReduce workers are not
relevant to serving. They would have produced the results, offline, at some
point in the past and the results would be published. HDFS's redundancy can
handle issues of node failure.

*Hadoop* is utterly unsuitable for anything real-time. The idea is to
compute results in batch and serve them after it's done. *HDFS* is probably
OK to serve static data if you are loading chunks at a time, on demand. I
am not sure this is a great architecture but it will work OK, certainly to
moderate scale.

If you're interested in more on what I'm doing, let's follow up off list,
but start here:

http://myrrix.com/design/
http://code.google.com/p/myrrix-recommender/

On Sat, Aug 4, 2012 at 10:04 AM, Matt Mitchell <go...@gmail.com> wrote:

> So, the front-end machine would need access to the HDFS, and then
> query the system in real-time? Each of the map-reduce nodes would need
> to be up and running to produce results right? Also, what happens if
> one of the nodes goes down for some reason?
>
> I haven't spent a lot of time with Hadoop but I'm curious about the
> performance/latency there vs data being in memory?
>
> My system is really only a prototype and mainly a way or me to learn,
> but Myrrix still looks interesting! I'd love to look under the hood
> and see what you've got going :)
>
> Thanks!
> - Matt
>
>

Re: question about distributed recommendations

Posted by Matt Mitchell <go...@gmail.com>.

So, the front-end machine would need access to the HDFS, and then
query the system in real-time? Each of the map-reduce nodes would need
to be up and running to produce results right? Also, what happens if
one of the nodes goes down for some reason?

I haven't spent a lot of time with Hadoop but I'm curious about the
performance/latency there vs data being in memory?

My system is really only a prototype and mainly a way or me to learn,
but Myrrix still looks interesting! I'd love to look under the hood
and see what you've got going :)

Thanks!
- Matt

On Fri, Aug 3, 2012 at 6:21 PM, Sean Owen <sr...@gmail.com> wrote:
> Good good question. One straightforward way to approach things is to
> compute all recommendations offline, in batch, and publish them to some
> location, and then simply read them as needed. Yes your front-end would
> need to access HDFS if the data were on HDFS. The downside is that you
> can't update in real-time, and you spend CPU computing recs for people that
> may never be needed.
>
> The online implementations you've been playing with don't have those two
> problems, but they have scale issues at some point.
>
> But, I think one of these two approaches is probably 'just fine' for 80% of
> use cases.
>
>
> If not, the 'real' answer is a hybrid solution, using Hadoop to do periodic
> model recomputation, offline, and using front-ends to do (at least
> approximate) real-time updates and computation. This sort of system is what
> I'm trying to build with Myrrix (myrrix.com), which you may be interested
> in if you have this kind of problem.
>
>
> On Fri, Aug 3, 2012 at 6:16 PM, Matt Mitchell <go...@gmail.com> wrote:
>
>> Thanks Sean, that makes sense. I'll look into the source and see if I
>> can find learn more.
>>
>> Another question. I understand how the recommendations are created.
>> I'd like to wrap this all up as a web service, but I'm not sure I
>> understand how one would go about doing that? How would one app, fetch
>> recomendations for a user? Does my app need access to the HDFS file
>> system?
>>
>> Thanks again.
>>
>>

Re: question about distributed recommendations

Posted by Matt Mitchell <go...@gmail.com>.

Hi Saikat! Thanks for the details on what you're doing. So your system
does not use Mahout when returning recommendations (db instead), but
it uses the similarity data Mahout generated? Interesting. Is your
key/value db running on your app instance, or is it remote? How many
user preferences and items do you have?

- Matt

On Fri, Aug 3, 2012 at 6:40 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>
> Matt,I'm also deep in the midst of building out such a system, basically I have a most of a system in place that:1) replenishes user ratings data directly into hdfs from analytics2) performs mahout item similarity computations on this data and stores the result back into hdfs3) uses hive to then transform the results of number 2 into a real time low latency key value database, in this case cassandra4) leverages a rest based web service to query that database and serves up results into a UI
>
> I am currently working on the second piece which includes clustering/classifying that  that data based on a set of dynamic features.
>
>> Date: Fri, 3 Aug 2012 18:21:56 -0400
>> Subject: Re: question about distributed recommendations
>> From: srowen@gmail.com
>> To: user@mahout.apache.org
>>
>> Good good question. One straightforward way to approach things is to
>> compute all recommendations offline, in batch, and publish them to some
>> location, and then simply read them as needed. Yes your front-end would
>> need to access HDFS if the data were on HDFS. The downside is that you
>> can't update in real-time, and you spend CPU computing recs for people that
>> may never be needed.
>>
>> The online implementations you've been playing with don't have those two
>> problems, but they have scale issues at some point.
>>
>> But, I think one of these two approaches is probably 'just fine' for 80% of
>> use cases.
>>
>>
>> If not, the 'real' answer is a hybrid solution, using Hadoop to do periodic
>> model recomputation, offline, and using front-ends to do (at least
>> approximate) real-time updates and computation. This sort of system is what
>> I'm trying to build with Myrrix (myrrix.com), which you may be interested
>> in if you have this kind of problem.
>>
>>
>> On Fri, Aug 3, 2012 at 6:16 PM, Matt Mitchell <go...@gmail.com> wrote:
>>
>> > Thanks Sean, that makes sense. I'll look into the source and see if I
>> > can find learn more.
>> >
>> > Another question. I understand how the recommendations are created.
>> > I'd like to wrap this all up as a web service, but I'm not sure I
>> > understand how one would go about doing that? How would one app, fetch
>> > recomendations for a user? Does my app need access to the HDFS file
>> > system?
>> >
>> > Thanks again.
>> >
>> >
>

RE: question about distributed recommendations

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Matt,I'm also deep in the midst of building out such a system, basically I have a most of a system in place that:1) replenishes user ratings data directly into hdfs from analytics2) performs mahout item similarity computations on this data and stores the result back into hdfs3) uses hive to then transform the results of number 2 into a real time low latency key value database, in this case cassandra4) leverages a rest based web service to query that database and serves up results into a UI

I am currently working on the second piece which includes clustering/classifying that  that data based on a set of dynamic features.

> Date: Fri, 3 Aug 2012 18:21:56 -0400
> Subject: Re: question about distributed recommendations
> From: srowen@gmail.com
> To: user@mahout.apache.org
> 
> Good good question. One straightforward way to approach things is to
> compute all recommendations offline, in batch, and publish them to some
> location, and then simply read them as needed. Yes your front-end would
> need to access HDFS if the data were on HDFS. The downside is that you
> can't update in real-time, and you spend CPU computing recs for people that
> may never be needed.
> 
> The online implementations you've been playing with don't have those two
> problems, but they have scale issues at some point.
> 
> But, I think one of these two approaches is probably 'just fine' for 80% of
> use cases.
> 
> 
> If not, the 'real' answer is a hybrid solution, using Hadoop to do periodic
> model recomputation, offline, and using front-ends to do (at least
> approximate) real-time updates and computation. This sort of system is what
> I'm trying to build with Myrrix (myrrix.com), which you may be interested
> in if you have this kind of problem.
> 
> 
> On Fri, Aug 3, 2012 at 6:16 PM, Matt Mitchell <go...@gmail.com> wrote:
> 
> > Thanks Sean, that makes sense. I'll look into the source and see if I
> > can find learn more.
> >
> > Another question. I understand how the recommendations are created.
> > I'd like to wrap this all up as a web service, but I'm not sure I
> > understand how one would go about doing that? How would one app, fetch
> > recomendations for a user? Does my app need access to the HDFS file
> > system?
> >
> > Thanks again.
> >
> >

Re: question about distributed recommendations

Posted by Sean Owen <sr...@gmail.com>.

Good good question. One straightforward way to approach things is to
compute all recommendations offline, in batch, and publish them to some
location, and then simply read them as needed. Yes your front-end would
need to access HDFS if the data were on HDFS. The downside is that you
can't update in real-time, and you spend CPU computing recs for people that
may never be needed.

The online implementations you've been playing with don't have those two
problems, but they have scale issues at some point.

But, I think one of these two approaches is probably 'just fine' for 80% of
use cases.

If not, the 'real' answer is a hybrid solution, using Hadoop to do periodic
model recomputation, offline, and using front-ends to do (at least
approximate) real-time updates and computation. This sort of system is what
I'm trying to build with Myrrix (myrrix.com), which you may be interested
in if you have this kind of problem.

On Fri, Aug 3, 2012 at 6:16 PM, Matt Mitchell <go...@gmail.com> wrote:

> Thanks Sean, that makes sense. I'll look into the source and see if I
> can find learn more.
>
> Another question. I understand how the recommendations are created.
> I'd like to wrap this all up as a web service, but I'm not sure I
> understand how one would go about doing that? How would one app, fetch
> recomendations for a user? Does my app need access to the HDFS file
> system?
>
> Thanks again.
>
>

Re: question about distributed recommendations

Posted by Matt Mitchell <go...@gmail.com>.

Thanks Sean, that makes sense. I'll look into the source and see if I
can find learn more.

Another question. I understand how the recommendations are created.
I'd like to wrap this all up as a web service, but I'm not sure I
understand how one would go about doing that? How would one app, fetch
recomendations for a user? Does my app need access to the HDFS file
system?

Thanks again.

On Fri, Aug 3, 2012 at 5:40 PM, Sean Owen <sr...@gmail.com> wrote:
> Is it reasonable to use 1.5GB of heap for recs? sure -- assuming you can
> allow the JVM to use, say, 2GB or more of heap total.
>
> There are more choices in Mahout for non-distributed recs. The primary
> distributed version is an item-similarity-based approach but you can choose
> from several similarity metrics. There is also a matrix-factorization-based
> approach in there.
>
> You can make the distributed version do anything you want, depends on how
> much code you want to write. You'd have to replace the similarity
> computation bits with your own logic, yes.
>
> Sean
>
> On Fri, Aug 3, 2012 at 5:16 PM, Matt Mitchell <go...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a pretty neat, non-distributed recommender running here. In
>> doing some math on new user growth rate, I thought it might be wise
>> for me to turn to a distributed approach now, rather than later. Just
>> to make sure I'm on the right track, I calculated 1.5 GB for in-memory
>> prefs. Is that pushing it for a single machine recommender?
>>
>> Am I right in thinking that the distributed approach is a single
>> algorithm, unlike the many possible choices for non-distributed? Is it
>> possible to inject content-aware logic within a distributed
>> recommender? Would I inject that into the map/reduce phase, when the
>> recommendations are generated? Curious to know if there are any
>> examples out there?
>>
>> Thanks!
>>

Re: question about distributed recommendations

Posted by Sean Owen <sr...@gmail.com>.

Is it reasonable to use 1.5GB of heap for recs? sure -- assuming you can
allow the JVM to use, say, 2GB or more of heap total.

There are more choices in Mahout for non-distributed recs. The primary
distributed version is an item-similarity-based approach but you can choose
from several similarity metrics. There is also a matrix-factorization-based
approach in there.

You can make the distributed version do anything you want, depends on how
much code you want to write. You'd have to replace the similarity
computation bits with your own logic, yes.

Sean

On Fri, Aug 3, 2012 at 5:16 PM, Matt Mitchell <go...@gmail.com> wrote:

> Hi,
>
> I have a pretty neat, non-distributed recommender running here. In
> doing some math on new user growth rate, I thought it might be wise
> for me to turn to a distributed approach now, rather than later. Just
> to make sure I'm on the right track, I calculated 1.5 GB for in-memory
> prefs. Is that pushing it for a single machine recommender?
>
> Am I right in thinking that the distributed approach is a single
> algorithm, unlike the many possible choices for non-distributed? Is it
> possible to inject content-aware logic within a distributed
> recommender? Would I inject that into the map/reduce phase, when the
> recommendations are generated? Curious to know if there are any
> examples out there?
>
> Thanks!
>