You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Razon, Oren" <or...@intel.com> on 2012/03/22 12:35:10 UTC

Mahout beginner questions...

Hi,
As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...

1. In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
2. My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
3. As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)

Thanks,
Oren

---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

It might or might not be interesting to comment on this discussion in
light of the new product/project I mentioned last night, Myrrix.

It's definitely an example of precisely this two-layered architecture
we've been discussing on this thread. http://myrrix.com/design/

The nice thing about a matrix-factorization-based approach is that
it's feasible to load this entire 'model' into memory -- the two
factored matrices. Everything can be done from these: recommendation,
most-similar, estimates, even fast approximate updates to the model
for new data. Being able to work in memory keeps it fast and simple.

If even those get too big for memory, you can shard across servers, by
user ID (and include only part of the user-feature matrix on each).
Sharding the item-feature matrix gets hard.

Sean

On Thu, Apr 5, 2012 at 8:47 AM, Sebastian Schelter <ss...@apache.org> wrote:
> You don't have to hold the rating matrix in memory. When computing
> recommendations for a user, fetch all his ratings from some datastore
> (database, key-value-store, memcache...) with a single query and use the
> item similarities that are held in-memory to compute the recommendations.
>

Re: Mahout beginner questions...

Posted by Sebastian Schelter <ss...@apache.org>.

You don't have to hold the rating matrix in memory. When computing
recommendations for a user, fetch all his ratings from some datastore
(database, key-value-store, memcache...) with a single query and use the
item similarities that are held in-memory to compute the recommendations.

--sebastian

On 05.04.2012 09:44, Razon, Oren wrote:
> Thanks for the answer, but still...
> I will need to keep in memory the rating matrix so I will be able to utilize the ranking a user gave to items together with the item similarity.
> 
> -----Original Message-----
> From: Sebastian Schelter [mailto:ssc@apache.org] 
> Sent: Thursday, April 05, 2012 10:34
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> Hi Oren,
> 
> If you use an item-based approach, its sufficient to use the top-k
> similar items per item (with k somewhere between 25 and 100). That means
> the data to hold in memory is num_items * k data points.
> 
> While this is a theoretical limitation, it should not be a problem in
> practical scenarios, as you can easily fit some hundred million of that
> datapoints in a few gigabytes of RAM.
> 
> --sebastian
> 
> 
> On 05.04.2012 09:27, Razon, Oren wrote:
>> Ok, so here is the point I still not getting.
>>
>> The architecture we are talking about is to push heavy computation for offline work, for that I could utilize Hadoop part.
>> Beside, having an online part, which will retrieve the recommendation from the pre-computed results or even will do some more computation online to try and adjust the recommendation to current user context. 
>> But as you said for the JDBC connector, in order to serve recommendations fast, the online recommender need to have all pre-computed results in-memory. So isn't it a limitation to scale up? It means that as long as my recommender service  is growing I will need more memory in order to hold it all in-memory in the online part...
>> Am I wrong here?  
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:srowen@gmail.com] 
>> Sent: Thursday, March 22, 2012 17:57
>> To: user@mahout.apache.org
>> Subject: Re: Mahout beginner questions...
>>
>> A distributed and non-distributed recommender are really quite
>> separate. They perform the same task in quite different ways. I don't
>> think you would mix them per se.
>>
>> Depends on what you mean by a model-based recommender... I would call
>> the matrix-factorization-based and clustering-based approaches
>> "model-based" in the sense that they assume the existence of some
>> underlying structure and discover it. There's no Bayesian-style
>> approaches in the code.
>>
>> They scale in different ways; I am not sure they are unilaterally a
>> solution to scale, no. I do agree in general that these have good
>> scaling properties for real-world use cases, like the
>> matrix-factorization approaches.
>>
>>
>> A "real" scalable architecture would have a real-time component and a
>> big distributed computation component. Mahout has elements of both and
>> can be the basis for piecing that together, but it's not a question of
>> strapping together the distributed and non-distributed implementation.
>> It's a bit harder than that.
>>
>>
>> I am actually quite close to being ready to show off something in this
>> area -- I have been working separately on a more complete rec system
>> that has both the real-time element but integrated directly with a
>> distributed element to handle the large-scale computation. I think
>> this is typical of big data architectures. You have (at least) a
>> real-time distributed "Serving Layer" and a big distributed batch
>> "Computation Layer". More on this in about... 2 weeks.
>>
>>
>> On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <or...@intel.com> wrote:
>>> Hi Sean,
>>> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums.
>>> Just to clear my second question...
>>> I want to build a recommender framework that will support different use cases.  So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)?
>>>
>>> BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders.
>>> Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already?
>>>
>>>
>>> -----Original Message-----
>>> From: Sean Owen [mailto:srowen@gmail.com]
>>> Sent: Thursday, March 22, 2012 13:51
>>> To: user@mahout.apache.org
>>> Subject: Re: Mahout beginner questions...
>>>
>>> 1. These are the JDBC-related classes. For example see
>>> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>>>
>>> 2. The distributed and non-distributed code are quite separate. At
>>> this scale I don't think you can use the non-distributed code to a
>>> meaningful degree. For example you could pre-compute item-item
>>> similarities over this data and use a non-distributed item-based
>>> recommender but you probably have enough items that this will strain
>>> memory. You would probably be looking at pre-computing recommendations
>>> in batch.
>>>
>>> 3. I don't think Netezza will help much here. It's still not fast
>>> enough at this scale to use with a real-time recommender (nothing is).
>>> If it's just a place you store data to feed into Hadoop it's not
>>> adding value. All the JDBC-related integrations ultimately load data
>>> into memory and that's out of the question with 500M data points.
>>>
>>> I'd also suggest you have a think about whether you "really" have 500M
>>> data points. Often you can know that most of the data is noise or not
>>> useful, and can get useful recommendations on a fraction of the data
>>> (maybe 5M). That makes a lot of things easier.
>>>
>>> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <or...@intel.com> wrote:
>>>> Hi,
>>>> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>>>>
>>>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
>>>> 2.      My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
>>>>      Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
>>>> 3.      As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)
>>>>
>>>> Thanks,
>>>> Oren
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> Intel Electronics Ltd.
>>>>
>>>> This e-mail and any attachments may contain confidential material for
>>>> the sole use of the intended recipient(s). Any review or distribution
>>>> by others is strictly prohibited. If you are not the intended
>>>> recipient, please contact the sender and delete all copies.
>>> ---------------------------------------------------------------------
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> 
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Thanks for the answer, but still...
I will need to keep in memory the rating matrix so I will be able to utilize the ranking a user gave to items together with the item similarity.

-----Original Message-----
From: Sebastian Schelter [mailto:ssc@apache.org] 
Sent: Thursday, April 05, 2012 10:34
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

Hi Oren,

If you use an item-based approach, its sufficient to use the top-k
similar items per item (with k somewhere between 25 and 100). That means
the data to hold in memory is num_items * k data points.

While this is a theoretical limitation, it should not be a problem in
practical scenarios, as you can easily fit some hundred million of that
datapoints in a few gigabytes of RAM.

--sebastian


On 05.04.2012 09:27, Razon, Oren wrote:
> Ok, so here is the point I still not getting.
> 
> The architecture we are talking about is to push heavy computation for offline work, for that I could utilize Hadoop part.
> Beside, having an online part, which will retrieve the recommendation from the pre-computed results or even will do some more computation online to try and adjust the recommendation to current user context. 
> But as you said for the JDBC connector, in order to serve recommendations fast, the online recommender need to have all pre-computed results in-memory. So isn't it a limitation to scale up? It means that as long as my recommender service  is growing I will need more memory in order to hold it all in-memory in the online part...
> Am I wrong here?  
> 
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com] 
> Sent: Thursday, March 22, 2012 17:57
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> A distributed and non-distributed recommender are really quite
> separate. They perform the same task in quite different ways. I don't
> think you would mix them per se.
> 
> Depends on what you mean by a model-based recommender... I would call
> the matrix-factorization-based and clustering-based approaches
> "model-based" in the sense that they assume the existence of some
> underlying structure and discover it. There's no Bayesian-style
> approaches in the code.
> 
> They scale in different ways; I am not sure they are unilaterally a
> solution to scale, no. I do agree in general that these have good
> scaling properties for real-world use cases, like the
> matrix-factorization approaches.
> 
> 
> A "real" scalable architecture would have a real-time component and a
> big distributed computation component. Mahout has elements of both and
> can be the basis for piecing that together, but it's not a question of
> strapping together the distributed and non-distributed implementation.
> It's a bit harder than that.
> 
> 
> I am actually quite close to being ready to show off something in this
> area -- I have been working separately on a more complete rec system
> that has both the real-time element but integrated directly with a
> distributed element to handle the large-scale computation. I think
> this is typical of big data architectures. You have (at least) a
> real-time distributed "Serving Layer" and a big distributed batch
> "Computation Layer". More on this in about... 2 weeks.
> 
> 
> On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <or...@intel.com> wrote:
>> Hi Sean,
>> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums.
>> Just to clear my second question...
>> I want to build a recommender framework that will support different use cases.  So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)?
>>
>> BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders.
>> Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already?
>>
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:srowen@gmail.com]
>> Sent: Thursday, March 22, 2012 13:51
>> To: user@mahout.apache.org
>> Subject: Re: Mahout beginner questions...
>>
>> 1. These are the JDBC-related classes. For example see
>> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>>
>> 2. The distributed and non-distributed code are quite separate. At
>> this scale I don't think you can use the non-distributed code to a
>> meaningful degree. For example you could pre-compute item-item
>> similarities over this data and use a non-distributed item-based
>> recommender but you probably have enough items that this will strain
>> memory. You would probably be looking at pre-computing recommendations
>> in batch.
>>
>> 3. I don't think Netezza will help much here. It's still not fast
>> enough at this scale to use with a real-time recommender (nothing is).
>> If it's just a place you store data to feed into Hadoop it's not
>> adding value. All the JDBC-related integrations ultimately load data
>> into memory and that's out of the question with 500M data points.
>>
>> I'd also suggest you have a think about whether you "really" have 500M
>> data points. Often you can know that most of the data is noise or not
>> useful, and can get useful recommendations on a fraction of the data
>> (maybe 5M). That makes a lot of things easier.
>>
>> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <or...@intel.com> wrote:
>>> Hi,
>>> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>>>
>>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
>>> 2.      My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
>>>      Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
>>> 3.      As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)
>>>
>>> Thanks,
>>> Oren
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Oren,

If you use an item-based approach, its sufficient to use the top-k
similar items per item (with k somewhere between 25 and 100). That means
the data to hold in memory is num_items * k data points.

While this is a theoretical limitation, it should not be a problem in
practical scenarios, as you can easily fit some hundred million of that
datapoints in a few gigabytes of RAM.

--sebastian


On 05.04.2012 09:27, Razon, Oren wrote:
> Ok, so here is the point I still not getting.
> 
> The architecture we are talking about is to push heavy computation for offline work, for that I could utilize Hadoop part.
> Beside, having an online part, which will retrieve the recommendation from the pre-computed results or even will do some more computation online to try and adjust the recommendation to current user context. 
> But as you said for the JDBC connector, in order to serve recommendations fast, the online recommender need to have all pre-computed results in-memory. So isn't it a limitation to scale up? It means that as long as my recommender service  is growing I will need more memory in order to hold it all in-memory in the online part...
> Am I wrong here?  
> 
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com] 
> Sent: Thursday, March 22, 2012 17:57
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> A distributed and non-distributed recommender are really quite
> separate. They perform the same task in quite different ways. I don't
> think you would mix them per se.
> 
> Depends on what you mean by a model-based recommender... I would call
> the matrix-factorization-based and clustering-based approaches
> "model-based" in the sense that they assume the existence of some
> underlying structure and discover it. There's no Bayesian-style
> approaches in the code.
> 
> They scale in different ways; I am not sure they are unilaterally a
> solution to scale, no. I do agree in general that these have good
> scaling properties for real-world use cases, like the
> matrix-factorization approaches.
> 
> 
> A "real" scalable architecture would have a real-time component and a
> big distributed computation component. Mahout has elements of both and
> can be the basis for piecing that together, but it's not a question of
> strapping together the distributed and non-distributed implementation.
> It's a bit harder than that.
> 
> 
> I am actually quite close to being ready to show off something in this
> area -- I have been working separately on a more complete rec system
> that has both the real-time element but integrated directly with a
> distributed element to handle the large-scale computation. I think
> this is typical of big data architectures. You have (at least) a
> real-time distributed "Serving Layer" and a big distributed batch
> "Computation Layer". More on this in about... 2 weeks.
> 
> 
> On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <or...@intel.com> wrote:
>> Hi Sean,
>> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums.
>> Just to clear my second question...
>> I want to build a recommender framework that will support different use cases.  So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)?
>>
>> BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders.
>> Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already?
>>
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:srowen@gmail.com]
>> Sent: Thursday, March 22, 2012 13:51
>> To: user@mahout.apache.org
>> Subject: Re: Mahout beginner questions...
>>
>> 1. These are the JDBC-related classes. For example see
>> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>>
>> 2. The distributed and non-distributed code are quite separate. At
>> this scale I don't think you can use the non-distributed code to a
>> meaningful degree. For example you could pre-compute item-item
>> similarities over this data and use a non-distributed item-based
>> recommender but you probably have enough items that this will strain
>> memory. You would probably be looking at pre-computing recommendations
>> in batch.
>>
>> 3. I don't think Netezza will help much here. It's still not fast
>> enough at this scale to use with a real-time recommender (nothing is).
>> If it's just a place you store data to feed into Hadoop it's not
>> adding value. All the JDBC-related integrations ultimately load data
>> into memory and that's out of the question with 500M data points.
>>
>> I'd also suggest you have a think about whether you "really" have 500M
>> data points. Often you can know that most of the data is noise or not
>> useful, and can get useful recommendations on a fraction of the data
>> (maybe 5M). That makes a lot of things easier.
>>
>> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <or...@intel.com> wrote:
>>> Hi,
>>> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>>>
>>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
>>> 2.      My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
>>>      Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
>>> 3.      As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)
>>>
>>> Thanks,
>>> Oren
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Ok, so here is the point I still not getting.

The architecture we are talking about is to push heavy computation for offline work, for that I could utilize Hadoop part.
Beside, having an online part, which will retrieve the recommendation from the pre-computed results or even will do some more computation online to try and adjust the recommendation to current user context. 
But as you said for the JDBC connector, in order to serve recommendations fast, the online recommender need to have all pre-computed results in-memory. So isn't it a limitation to scale up? It means that as long as my recommender service  is growing I will need more memory in order to hold it all in-memory in the online part...
Am I wrong here?  

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Thursday, March 22, 2012 17:57
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

A distributed and non-distributed recommender are really quite
separate. They perform the same task in quite different ways. I don't
think you would mix them per se.

Depends on what you mean by a model-based recommender... I would call
the matrix-factorization-based and clustering-based approaches
"model-based" in the sense that they assume the existence of some
underlying structure and discover it. There's no Bayesian-style
approaches in the code.

They scale in different ways; I am not sure they are unilaterally a
solution to scale, no. I do agree in general that these have good
scaling properties for real-world use cases, like the
matrix-factorization approaches.


A "real" scalable architecture would have a real-time component and a
big distributed computation component. Mahout has elements of both and
can be the basis for piecing that together, but it's not a question of
strapping together the distributed and non-distributed implementation.
It's a bit harder than that.


I am actually quite close to being ready to show off something in this
area -- I have been working separately on a more complete rec system
that has both the real-time element but integrated directly with a
distributed element to handle the large-scale computation. I think
this is typical of big data architectures. You have (at least) a
real-time distributed "Serving Layer" and a big distributed batch
"Computation Layer". More on this in about... 2 weeks.


On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <or...@intel.com> wrote:
> Hi Sean,
> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums.
> Just to clear my second question...
> I want to build a recommender framework that will support different use cases.  So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)?
>
> BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders.
> Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already?
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Thursday, March 22, 2012 13:51
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> 1. These are the JDBC-related classes. For example see
> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>
> 2. The distributed and non-distributed code are quite separate. At
> this scale I don't think you can use the non-distributed code to a
> meaningful degree. For example you could pre-compute item-item
> similarities over this data and use a non-distributed item-based
> recommender but you probably have enough items that this will strain
> memory. You would probably be looking at pre-computing recommendations
> in batch.
>
> 3. I don't think Netezza will help much here. It's still not fast
> enough at this scale to use with a real-time recommender (nothing is).
> If it's just a place you store data to feed into Hadoop it's not
> adding value. All the JDBC-related integrations ultimately load data
> into memory and that's out of the question with 500M data points.
>
> I'd also suggest you have a think about whether you "really" have 500M
> data points. Often you can know that most of the data is noise or not
> useful, and can get useful recommendations on a fraction of the data
> (maybe 5M). That makes a lot of things easier.
>
> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <or...@intel.com> wrote:
>> Hi,
>> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>>
>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
>> 2.      My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
>>      Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
>> 3.      As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)
>>
>> Thanks,
>> Oren
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

Caching recommendations is a good use of memory, sure. It doesn't decrease
memory requirements and doesn't speed up the initial recommendation though.

Yes pre-computing recommendations is also possible. This is more or less
what the Hadoop-based implementation is for. That scales just fine but is
not real-time. You're waiting X minutes/hours to see any reaction to new
data a user inputs. For some contexts, that's fine. For many it's not; I
expect my recs to change every time I rate a book on Amazon. That's almost
the fun of it.

Ted is right that you may more commonly pre-compute some big piece of the
puzzle like item-item similarities or a matrix factorization. Then you can
finish the rec computation quite quickly, and it can respond to new data
straight away (at least approximately).

This is the sort of setup I was alluding to earlier, and what a 'real' and
complete, scalable system resembles. It's more complex.

This does not exist per se in the project. The pieces are there, in fact
80% of it I'd say, but the stitching together is still mostly up to the
developer.

On Sun, Mar 25, 2012 at 8:28 PM, Razon, Oren <or...@intel.com> wrote:

> Correct me if I'm wrong but a good way to boost up speed could be to use
> caching recommender, meaning computing the recommendations in advanced
> (refresh it every X min\hours) and always recommend using the most updated
> recommendations, right?!
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Sunday, March 25, 2012 21:25
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> It is memory. You will need a pretty large heap to put 100M data in memory
> -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
> You can go bigger if you have more memory but that size seems about the
> biggest to reasonably assume people have.
>
> Of course more data slows things down and past about 10M data points you
> need to tune things to sample data rather than try every possibility. This
> is most of what CandidateItemStrategy has to do with. It is relatively easy
> to tune this though so speed doesn't have to ben an issue.
>
> Again you can go bigger and tune it to down-sample more; somehow I stil
> believe that 100M is a crude but useful rule of thumb, as to the point
> beyond which it's just hard to get good speed and quality.
>
> Sean
>
> On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <or...@intel.com> wrote:
>
> > Thanks for the detailed answer Sean.
> > I want to understand more clearly the non-distributed code limitations.
> > I saw that you advise that for more than 100,000,000 ratings the
> > non-distributed engine won't do the job.
> > The question is why? Is it memory issue (and then if I will have a bigger
> > machine, meaning I could scale up), or is it because of the
> recommendation
> > time it takes?
> >
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

Yes, my position is that you need at least these two layers in the end.

To get straight to your point, no you don't have to load all item-item
pairs in memory, necessarily. At one extreme, if you completely
pre-computed recommendations and didn't calculate anything in real-time,
you wouldn't need any of that in memory. Even if you did load it in memory,
you could sample, and only retain the tiny fraction of similarities that
are significant.

The more you down-sample, the less accurate the results become of course.
One point I'm driving at is that in this hybrid model, where you
periodically recompute "best" results based on all data, off-line, you can
get away with much more approximate updates in real-time. A new datum ought
to have some effect, and some roughly correct effect, but it's not such a
big deal if it's not perfect, since the right-er answer is coming soon
anyway and will overwrite.

And of course the properties of an item-item similarity-based approach
aren't necessarily those of other approaches. For example with
matrix-factorization approaches there is a much more well-defined (and
faster) way to fold in new data. And the data that must live in memory is
also bounded and relatively smaller.

On Mon, Mar 26, 2012 at 11:05 AM, Razon, Oren <or...@intel.com> wrote:

> Saying that, my conclusion so far (sorry if I'm a bit slow here :)) --> I
> need to have the 2 parts (offline and online) in place, If I plan to have a
> real scalable machine that could do some of the recommendation calculations
> in real time in order to interact with the user dynamically.
>
> But I'm still not quite sure I've understood how I can scale with that...
> As more as I'm pushing computation to offline I guess I'm less concerned
> with the retrieving time. From that perspective I could scale
>
> But I'm still not sure how it help me to scale from memory perspective...
> Even if I computed all similarities in advanced I still need to load the
> entire similarity result file into my memory in order that the online part
> will calculate his part. Maybe I'm wrong here, and I don't necessarily need
> to load the entire intermediate file (similarity results) into the memory?!
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Monday, March 26, 2012 11:48
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> I'm sure he's referring to the off-line model-building bit, not an online
> component.
>
> On Mon, Mar 26, 2012 at 9:27 AM, Razon, Oren <or...@intel.com> wrote:
>
> > By saying: "At Veoh, we built our models from several billion
> interactions
> > on a tiny cluster " you meant that you used the distributed code on your
> > cluster as an online recommender?
> > From what I've understood so far, I can't rely only on the Hadoop part if
> > I want a truly real time recommender that will modify his recommendations
> > and models per click of the user (because you need to rebuild the data in
> > the HDFS run you batch job, and return an answer)
> >
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Saying that, my conclusion so far (sorry if I'm a bit slow here :)) --> I need to have the 2 parts (offline and online) in place, If I plan to have a real scalable machine that could do some of the recommendation calculations in real time in order to interact with the user dynamically.

But I'm still not quite sure I've understood how I can scale with that...
As more as I'm pushing computation to offline I guess I'm less concerned with the retrieving time. From that perspective I could scale

But I'm still not sure how it help me to scale from memory perspective...
Even if I computed all similarities in advanced I still need to load the entire similarity result file into my memory in order that the online part will calculate his part. Maybe I'm wrong here, and I don't necessarily need to load the entire intermediate file (similarity results) into the memory?!

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Monday, March 26, 2012 11:48
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

I'm sure he's referring to the off-line model-building bit, not an online
component.

On Mon, Mar 26, 2012 at 9:27 AM, Razon, Oren <or...@intel.com> wrote:

> By saying: "At Veoh, we built our models from several billion interactions
> on a tiny cluster " you meant that you used the distributed code on your
> cluster as an online recommender?
> From what I've understood so far, I can't rely only on the Hadoop part if
> I want a truly real time recommender that will modify his recommendations
> and models per click of the user (because you need to rebuild the data in
> the HDFS run you batch job, and return an answer)
>
>
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

I'm sure he's referring to the off-line model-building bit, not an online
component.

On Mon, Mar 26, 2012 at 9:27 AM, Razon, Oren <or...@intel.com> wrote:

> By saying: "At Veoh, we built our models from several billion interactions
> on a tiny cluster " you meant that you used the distributed code on your
> cluster as an online recommender?
> From what I've understood so far, I can't rely only on the Hadoop part if
> I want a truly real time recommender that will modify his recommendations
> and models per click of the user (because you need to rebuild the data in
> the HDFS run you batch job, and return an answer)
>
>

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

An SQL database doesn't have much role to play in this kind of system,
and that's no criticism of RDBMSes.

The algorithms operate on very simple, nearly unstructured data and
are essentially read-only. So the complexity of keys and transactions
is just overhead. The simple, non-distributed implementations need a
huge amount of random access to data. Even lean fast NoSQL stores
aren't really suitable; these are just going to be in-memory problems.

If you're just going to read into memory, well, it's certainly
possible and simple to read that out of an RDBMS. But it might as well
come from a file; there's no advantage to having bothered to put it in
a table.

(Of course at tiny scale, a DB can keep up fine. 100K data points? no
problem. That's why things like MySQLJDBCDataModel even exist I
suppose.)

Once you go to the trouble of parallelizing the algorithm, and
breaking it up so that every computation doesn't touch so much data
(and, this is often the hard, clever part) you can split it up using
MapReduce / Hadoop and those tiny workers can meaningfully crunch
through parts of the problem. There too, they are simple beasts and
have a simple sequential read-only input model. You could make them
too read out of an RDBMS, but at *best*, it's overkill; it might as
well have come from a dumber store like HDFS. At *worst* it will still
fall over when 1000 workers try to pull (unrelated) data out of the
same table and overwhelm the RDBMS machine, when the whole point of
parallelizing it was to be able to read in parallel chunks of
unrelated data from many storage servers -- a la HDFS.



On Mon, Mar 26, 2012 at 4:42 PM, Razon, Oren <or...@intel.com> wrote:
> Another question that crossed my mind.
> Consider all you said below... I'm not quite sure when will I want to use a SQL machine at all as my data source?
> Response perspective --> You said it will take much more than reading from a file
> Memory perspective --> In the end you need to move the data from the DB into your memory
>
> So what is the pros in doing so? When should I consider it?
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Monday, March 26, 2012 15:52
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> No.  I meant that I used the same sort of combined offline and online processes that I have recommended to you.  The cluster did the offline part and a web tier did the online part.
>
> Sent from my iPhone
>
> On Mar 26, 2012, at 1:27 AM, "Razon, Oren" <or...@intel.com> wrote:
>
>> By saying: "At Veoh, we built our models from several billion interactions on a tiny cluster " you meant that you used the distributed code on your cluster as an online recommender?
>> From what I've understood so far, I can't rely only on the Hadoop part if I want a truly real time recommender that will modify his recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer)
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Monday, March 26, 2012 00:56
>> To: user@mahout.apache.org
>> Subject: Re: Mahout beginner questions...
>>
>> On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <or...@intel.com> wrote:
>>
>>> ...
>>> The system I need should of course give the recommendation itself in no
>>> time.
>>> ...
>>
>> But because I'm talking about very large scales, I guess that I want to
>>> push much of my model computation to offline mode (which will be refreshed
>>> every X minutes).
>>>
>>
>> Actually, you aren't talking about all that large a scale.  At Veoh, we
>> built our models from several billion interactions on a tiny cluster.
>>
>>
>>> So my options are like that (considering I want to build a real scalable
>>> solution):
>>> Use the non-distributed \ distributed code to compute some of my model in
>>> advance (for example similarity between items \ KNN for each users) --> I
>>> guess that for that part, considering I'm offline, the mapreduce code is
>>> idle, because of his scalability.
>>>
>>
>> Repeating what I said earlier, the offline part produces item-item
>> information only.  It does not produce KNN data for any users.  There is no
>> reference to a user in the result.
>>
>>
>>> Than use a non-distributed online code to calculate the final
>>> recommendations based on the pre computed part and do some final
>>> computation (weighting the KNN ratings for items my user didn't experienced
>>> yet)
>>>
>>
>> All that happens here is that item => item* lists are combined.
>>
>>
>>> In order to be able to do so, I will probably need a machine that have
>>> high memory capacity to contain all the calculations inside the memory.
>>>
>>
>> Not really.
>>
>>
>>> I can even go further and prepare a cached recommender that will be
>>> refreshed whenever I really want my recommendations to be updated.
>>>
>>
>> This is correct.
>>
>>
>>> ...
>>> I know the "glue" between the 2 parts is not quite there (as Sean said),
>>> but my question is, how much does the current framework support this kind
>>> of architecture?
>>
>>
>> Yes.
>>
>>
>>> Meaning what kind of actions can I really prepare in advance before
>>> continuing to the final computation? If so, beside of co-occurrence matrix
>>> and matrix factorization what other computations are available to me to do
>>> in a mapreduce manner? Does it mean I will have 2 separate machines for
>>> that case, one as an Hadoop cluster for the offline computation and an
>>> online one that will use the distributed output to do final recommendations
>>> (but then it mean I need to move data between machines, which is not so
>>> idle...)?
>>>
>>
>> Yes.  You will need off-line and on-line machines if you want to have
>> serious guarantees about response times.  And yes, you will need to do some
>> copying if you use standard Hadoop.  If you use MapR's version of Hadoop,
>> you can serve data directly out of the cluster with no copying because you
>> can access files via NFS.
>>
>>
>>>
>>> Also, as I mentioned earlier I might need to store my data in a SQL
>>> machine. If so, what drivers are currently supported? I saw only JDBC &
>>> PostgreSQL, is there anyone else?
>>>
>>
>> You don't need to store your data ONLY on an SQL machine and storing logs
>> in SQL is generally a bad mistake.
>>
>>
>>> As you said in the book, using a SQL machine will probably slow things
>>> down because of the data movement using the drivers... Could you estimate
>>> how much slower is it comparing to using a file?
>>
>>
>> 100x, roughly.  SQL is generally not usable as the source for parallel
>> computations.
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Another question that crossed my mind.
Consider all you said below... I'm not quite sure when will I want to use a SQL machine at all as my data source?
Response perspective --> You said it will take much more than reading from a file
Memory perspective --> In the end you need to move the data from the DB into your memory

So what is the pros in doing so? When should I consider it?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, March 26, 2012 15:52
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

No.  I meant that I used the same sort of combined offline and online processes that I have recommended to you.  The cluster did the offline part and a web tier did the online part. 

Sent from my iPhone

On Mar 26, 2012, at 1:27 AM, "Razon, Oren" <or...@intel.com> wrote:

> By saying: "At Veoh, we built our models from several billion interactions on a tiny cluster " you meant that you used the distributed code on your cluster as an online recommender?
> From what I've understood so far, I can't rely only on the Hadoop part if I want a truly real time recommender that will modify his recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer)
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com] 
> Sent: Monday, March 26, 2012 00:56
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <or...@intel.com> wrote:
> 
>> ...
>> The system I need should of course give the recommendation itself in no
>> time.
>> ...
> 
> But because I'm talking about very large scales, I guess that I want to
>> push much of my model computation to offline mode (which will be refreshed
>> every X minutes).
>> 
> 
> Actually, you aren't talking about all that large a scale.  At Veoh, we
> built our models from several billion interactions on a tiny cluster.
> 
> 
>> So my options are like that (considering I want to build a real scalable
>> solution):
>> Use the non-distributed \ distributed code to compute some of my model in
>> advance (for example similarity between items \ KNN for each users) --> I
>> guess that for that part, considering I'm offline, the mapreduce code is
>> idle, because of his scalability.
>> 
> 
> Repeating what I said earlier, the offline part produces item-item
> information only.  It does not produce KNN data for any users.  There is no
> reference to a user in the result.
> 
> 
>> Than use a non-distributed online code to calculate the final
>> recommendations based on the pre computed part and do some final
>> computation (weighting the KNN ratings for items my user didn't experienced
>> yet)
>> 
> 
> All that happens here is that item => item* lists are combined.
> 
> 
>> In order to be able to do so, I will probably need a machine that have
>> high memory capacity to contain all the calculations inside the memory.
>> 
> 
> Not really.
> 
> 
>> I can even go further and prepare a cached recommender that will be
>> refreshed whenever I really want my recommendations to be updated.
>> 
> 
> This is correct.
> 
> 
>> ...
>> I know the "glue" between the 2 parts is not quite there (as Sean said),
>> but my question is, how much does the current framework support this kind
>> of architecture?
> 
> 
> Yes.
> 
> 
>> Meaning what kind of actions can I really prepare in advance before
>> continuing to the final computation? If so, beside of co-occurrence matrix
>> and matrix factorization what other computations are available to me to do
>> in a mapreduce manner? Does it mean I will have 2 separate machines for
>> that case, one as an Hadoop cluster for the offline computation and an
>> online one that will use the distributed output to do final recommendations
>> (but then it mean I need to move data between machines, which is not so
>> idle...)?
>> 
> 
> Yes.  You will need off-line and on-line machines if you want to have
> serious guarantees about response times.  And yes, you will need to do some
> copying if you use standard Hadoop.  If you use MapR's version of Hadoop,
> you can serve data directly out of the cluster with no copying because you
> can access files via NFS.
> 
> 
>> 
>> Also, as I mentioned earlier I might need to store my data in a SQL
>> machine. If so, what drivers are currently supported? I saw only JDBC &
>> PostgreSQL, is there anyone else?
>> 
> 
> You don't need to store your data ONLY on an SQL machine and storing logs
> in SQL is generally a bad mistake.
> 
> 
>> As you said in the book, using a SQL machine will probably slow things
>> down because of the data movement using the drivers... Could you estimate
>> how much slower is it comparing to using a file?
> 
> 
> 100x, roughly.  SQL is generally not usable as the source for parallel
> computations.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Ted Dunning <te...@gmail.com>.

No.  I meant that I used the same sort of combined offline and online processes that I have recommended to you.  The cluster did the offline part and a web tier did the online part. 

Sent from my iPhone

On Mar 26, 2012, at 1:27 AM, "Razon, Oren" <or...@intel.com> wrote:

> By saying: "At Veoh, we built our models from several billion interactions on a tiny cluster " you meant that you used the distributed code on your cluster as an online recommender?
> From what I've understood so far, I can't rely only on the Hadoop part if I want a truly real time recommender that will modify his recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer)
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com] 
> Sent: Monday, March 26, 2012 00:56
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <or...@intel.com> wrote:
> 
>> ...
>> The system I need should of course give the recommendation itself in no
>> time.
>> ...
> 
> But because I'm talking about very large scales, I guess that I want to
>> push much of my model computation to offline mode (which will be refreshed
>> every X minutes).
>> 
> 
> Actually, you aren't talking about all that large a scale.  At Veoh, we
> built our models from several billion interactions on a tiny cluster.
> 
> 
>> So my options are like that (considering I want to build a real scalable
>> solution):
>> Use the non-distributed \ distributed code to compute some of my model in
>> advance (for example similarity between items \ KNN for each users) --> I
>> guess that for that part, considering I'm offline, the mapreduce code is
>> idle, because of his scalability.
>> 
> 
> Repeating what I said earlier, the offline part produces item-item
> information only.  It does not produce KNN data for any users.  There is no
> reference to a user in the result.
> 
> 
>> Than use a non-distributed online code to calculate the final
>> recommendations based on the pre computed part and do some final
>> computation (weighting the KNN ratings for items my user didn't experienced
>> yet)
>> 
> 
> All that happens here is that item => item* lists are combined.
> 
> 
>> In order to be able to do so, I will probably need a machine that have
>> high memory capacity to contain all the calculations inside the memory.
>> 
> 
> Not really.
> 
> 
>> I can even go further and prepare a cached recommender that will be
>> refreshed whenever I really want my recommendations to be updated.
>> 
> 
> This is correct.
> 
> 
>> ...
>> I know the "glue" between the 2 parts is not quite there (as Sean said),
>> but my question is, how much does the current framework support this kind
>> of architecture?
> 
> 
> Yes.
> 
> 
>> Meaning what kind of actions can I really prepare in advance before
>> continuing to the final computation? If so, beside of co-occurrence matrix
>> and matrix factorization what other computations are available to me to do
>> in a mapreduce manner? Does it mean I will have 2 separate machines for
>> that case, one as an Hadoop cluster for the offline computation and an
>> online one that will use the distributed output to do final recommendations
>> (but then it mean I need to move data between machines, which is not so
>> idle...)?
>> 
> 
> Yes.  You will need off-line and on-line machines if you want to have
> serious guarantees about response times.  And yes, you will need to do some
> copying if you use standard Hadoop.  If you use MapR's version of Hadoop,
> you can serve data directly out of the cluster with no copying because you
> can access files via NFS.
> 
> 
>> 
>> Also, as I mentioned earlier I might need to store my data in a SQL
>> machine. If so, what drivers are currently supported? I saw only JDBC &
>> PostgreSQL, is there anyone else?
>> 
> 
> You don't need to store your data ONLY on an SQL machine and storing logs
> in SQL is generally a bad mistake.
> 
> 
>> As you said in the book, using a SQL machine will probably slow things
>> down because of the data movement using the drivers... Could you estimate
>> how much slower is it comparing to using a file?
> 
> 
> 100x, roughly.  SQL is generally not usable as the source for parallel
> computations.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

By saying: "At Veoh, we built our models from several billion interactions on a tiny cluster " you meant that you used the distributed code on your cluster as an online recommender?
From what I've understood so far, I can't rely only on the Hadoop part if I want a truly real time recommender that will modify his recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer)

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, March 26, 2012 00:56
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <or...@intel.com> wrote:

> ...
> The system I need should of course give the recommendation itself in no
> time.
> ...

But because I'm talking about very large scales, I guess that I want to
> push much of my model computation to offline mode (which will be refreshed
> every X minutes).
>

Actually, you aren't talking about all that large a scale.  At Veoh, we
built our models from several billion interactions on a tiny cluster.

> So my options are like that (considering I want to build a real scalable
> solution):
> Use the non-distributed \ distributed code to compute some of my model in
> advance (for example similarity between items \ KNN for each users) --> I
> guess that for that part, considering I'm offline, the mapreduce code is
> idle, because of his scalability.
>

Repeating what I said earlier, the offline part produces item-item
information only.  It does not produce KNN data for any users.  There is no
reference to a user in the result.

> Than use a non-distributed online code to calculate the final
> recommendations based on the pre computed part and do some final
> computation (weighting the KNN ratings for items my user didn't experienced
> yet)
>

All that happens here is that item => item* lists are combined.

> In order to be able to do so, I will probably need a machine that have
> high memory capacity to contain all the calculations inside the memory.
>

Not really.

> I can even go further and prepare a cached recommender that will be
> refreshed whenever I really want my recommendations to be updated.
>

This is correct.

> ...
> I know the "glue" between the 2 parts is not quite there (as Sean said),
> but my question is, how much does the current framework support this kind
> of architecture?

Yes.

> Meaning what kind of actions can I really prepare in advance before
> continuing to the final computation? If so, beside of co-occurrence matrix
> and matrix factorization what other computations are available to me to do
> in a mapreduce manner? Does it mean I will have 2 separate machines for
> that case, one as an Hadoop cluster for the offline computation and an
> online one that will use the distributed output to do final recommendations
> (but then it mean I need to move data between machines, which is not so
> idle...)?
>

Yes.  You will need off-line and on-line machines if you want to have
serious guarantees about response times.  And yes, you will need to do some
copying if you use standard Hadoop.  If you use MapR's version of Hadoop,
you can serve data directly out of the cluster with no copying because you
can access files via NFS.

>
> Also, as I mentioned earlier I might need to store my data in a SQL
> machine. If so, what drivers are currently supported? I saw only JDBC &
> PostgreSQL, is there anyone else?
>

You don't need to store your data ONLY on an SQL machine and storing logs
in SQL is generally a bad mistake.

> As you said in the book, using a SQL machine will probably slow things
> down because of the data movement using the drivers... Could you estimate
> how much slower is it comparing to using a file?

100x, roughly.  SQL is generally not usable as the source for parallel
computations.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Mar 25, 2012 at 4:02 PM, Razon, Oren <or...@intel.com> wrote:

>
> So let's continue with your example... I will do I 2 I similarity matrix
> on Hadoop and then will do online recommendation based on it and the user
> ranked items.
>

Yes.

> So where does the online part will sit at? Is it a good design to
> implement it on the same machine that Hadoop run on (name node for
> example)? Or you suggest to build 2 different applications on 2 different
> machines (one of them is the cluster) and transfer the data between them?
>

I recommend that you separate the off-line computation away from the
on-line component.  The reason is that the off-line computation can put a
severe strain on the resources of the machines it runs on.  You can isolate
this load somewhat, but it is better to simply use different machines
unless you are really absolutely desperate for hardware.  Even then, it is
probably more cost effective to drive your off-line resources as hard as
possible and simply use a relatively small machine for the on-line
component.

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Thanks Ted,
So let's continue with your example... I will do I 2 I similarity matrix on Hadoop and then will do online recommendation based on it and the user ranked items.
So where does the online part will sit at? Is it a good design to implement it on the same machine that Hadoop run on (name node for example)? Or you suggest to build 2 different applications on 2 different machines (one of them is the cluster) and transfer the data between them?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, March 26, 2012 00:56
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <or...@intel.com> wrote:

> ...
> The system I need should of course give the recommendation itself in no
> time.
> ...

But because I'm talking about very large scales, I guess that I want to
> push much of my model computation to offline mode (which will be refreshed
> every X minutes).
>

Actually, you aren't talking about all that large a scale.  At Veoh, we
built our models from several billion interactions on a tiny cluster.

> So my options are like that (considering I want to build a real scalable
> solution):
> Use the non-distributed \ distributed code to compute some of my model in
> advance (for example similarity between items \ KNN for each users) --> I
> guess that for that part, considering I'm offline, the mapreduce code is
> idle, because of his scalability.
>

Repeating what I said earlier, the offline part produces item-item
information only.  It does not produce KNN data for any users.  There is no
reference to a user in the result.

> Than use a non-distributed online code to calculate the final
> recommendations based on the pre computed part and do some final
> computation (weighting the KNN ratings for items my user didn't experienced
> yet)
>

All that happens here is that item => item* lists are combined.

> In order to be able to do so, I will probably need a machine that have
> high memory capacity to contain all the calculations inside the memory.
>

Not really.

> I can even go further and prepare a cached recommender that will be
> refreshed whenever I really want my recommendations to be updated.
>

This is correct.

> ...
> I know the "glue" between the 2 parts is not quite there (as Sean said),
> but my question is, how much does the current framework support this kind
> of architecture?

Yes.

> Meaning what kind of actions can I really prepare in advance before
> continuing to the final computation? If so, beside of co-occurrence matrix
> and matrix factorization what other computations are available to me to do
> in a mapreduce manner? Does it mean I will have 2 separate machines for
> that case, one as an Hadoop cluster for the offline computation and an
> online one that will use the distributed output to do final recommendations
> (but then it mean I need to move data between machines, which is not so
> idle...)?
>

Yes.  You will need off-line and on-line machines if you want to have
serious guarantees about response times.  And yes, you will need to do some
copying if you use standard Hadoop.  If you use MapR's version of Hadoop,
you can serve data directly out of the cluster with no copying because you
can access files via NFS.

>
> Also, as I mentioned earlier I might need to store my data in a SQL
> machine. If so, what drivers are currently supported? I saw only JDBC &
> PostgreSQL, is there anyone else?
>

You don't need to store your data ONLY on an SQL machine and storing logs
in SQL is generally a bad mistake.

> As you said in the book, using a SQL machine will probably slow things
> down because of the data movement using the drivers... Could you estimate
> how much slower is it comparing to using a file?

100x, roughly.  SQL is generally not usable as the source for parallel
computations.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <or...@intel.com> wrote:

> ...
> The system I need should of course give the recommendation itself in no
> time.
> ...

But because I'm talking about very large scales, I guess that I want to
> push much of my model computation to offline mode (which will be refreshed
> every X minutes).
>

Actually, you aren't talking about all that large a scale.  At Veoh, we
built our models from several billion interactions on a tiny cluster.

> So my options are like that (considering I want to build a real scalable
> solution):
> Use the non-distributed \ distributed code to compute some of my model in
> advance (for example similarity between items \ KNN for each users) --> I
> guess that for that part, considering I'm offline, the mapreduce code is
> idle, because of his scalability.
>

Repeating what I said earlier, the offline part produces item-item
information only.  It does not produce KNN data for any users.  There is no
reference to a user in the result.

> Than use a non-distributed online code to calculate the final
> recommendations based on the pre computed part and do some final
> computation (weighting the KNN ratings for items my user didn't experienced
> yet)
>

All that happens here is that item => item* lists are combined.

> In order to be able to do so, I will probably need a machine that have
> high memory capacity to contain all the calculations inside the memory.
>

Not really.

> I can even go further and prepare a cached recommender that will be
> refreshed whenever I really want my recommendations to be updated.
>

This is correct.

> ...
> I know the "glue" between the 2 parts is not quite there (as Sean said),
> but my question is, how much does the current framework support this kind
> of architecture?

Yes.

> Meaning what kind of actions can I really prepare in advance before
> continuing to the final computation? If so, beside of co-occurrence matrix
> and matrix factorization what other computations are available to me to do
> in a mapreduce manner? Does it mean I will have 2 separate machines for
> that case, one as an Hadoop cluster for the offline computation and an
> online one that will use the distributed output to do final recommendations
> (but then it mean I need to move data between machines, which is not so
> idle...)?
>

Yes.  You will need off-line and on-line machines if you want to have
serious guarantees about response times.  And yes, you will need to do some
copying if you use standard Hadoop.  If you use MapR's version of Hadoop,
you can serve data directly out of the cluster with no copying because you
can access files via NFS.

>
> Also, as I mentioned earlier I might need to store my data in a SQL
> machine. If so, what drivers are currently supported? I saw only JDBC &
> PostgreSQL, is there anyone else?
>

You don't need to store your data ONLY on an SQL machine and storing logs
in SQL is generally a bad mistake.

> As you said in the book, using a SQL machine will probably slow things
> down because of the data movement using the drivers... Could you estimate
> how much slower is it comparing to using a file?

100x, roughly.  SQL is generally not usable as the source for parallel
computations.

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

On Sun, Mar 25, 2012 at 11:36 PM, Razon, Oren <or...@intel.com> wrote:

> In order to be able to do so, I will probably need a machine that have
> high memory capacity to contain all the calculations inside the memory.
> I can even go further and prepare a cached recommender that will be
> refreshed whenever I really want my recommendations to be updated.
> Am I right here?
>

Maybe -- the memory requirements are lower than if one machine is doing
everything but yes I generally agree that the front-end often has to keep a
load of stuff in memory to do what it does quickly.


>
> I know the "glue" between the 2 parts is not quite there (as Sean said),
> but my question is, how much does the current framework support this kind
> of architecture? Meaning what kind of actions can I really prepare in
> advance before continuing to the final computation? If so, beside of
> co-occurrence matrix and matrix factorization what other computations are
> available to me to do in a mapreduce manner? Does it mean I will have 2
> separate machines for that case, one as an Hadoop cluster for the offline
> computation and an online one that will use the distributed output to do
> final recommendations (but then it mean I need to move data between
> machines, which is not so idle...)?
>

Item similarity (based on co-occurrence or otherwise) and
matrix-factorization stuff is more or less exactly what's available. It's
easy to integrate the output of the distributed item-item similarity
computation. That plugs right in to the non-distributed item-based
recommender. Well, you have to write some code to read the result off HDFS
and construct some objects. And you probably have to do some pruning. Etc.
It's the last 20%, the wiring and mortar that isn't necessarily handed to
you. That's kind of open-ended since how it's glued together is something
you may need or want to control.

Your Hadoop cluster is definitely not the same sort of beast as a front-end
server. Logically, quite different, and in practice almost surely separate
machines. I suppose you could run both on one machine for testing or
experiments.


> Also, as I mentioned earlier I might need to store my data in a SQL
> machine. If so, what drivers are currently supported? I saw only JDBC &
> PostgreSQL, is there anyone else?
> As you said in the book, using a SQL machine will probably slow things
> down because of the data movement using the drivers... Could you estimate
> how much slower is it comparing to using a file? Again I might do the
> reading from the DB offline so I'm not too afraid from losing some of my
> speed...
>

For you, your question is what can be used as an input to Hadoop. I think
there are InputFormats for generic SQL databases, yes, but that's a
question for Hadoop not Mahout. A SQL database is not the best place to
store and read your input for Hadoop. It's overkill. HDFS is the right sort
of place to have this data.

There is no question of reading from a DB "online" -- it's way too slow.
The 'drivers' you see are for reading info from a DB into memory mostly.
And they are for non-distributed stuff. It's such simple SQL that I think
it will work on just about any DB, with perhaps a tiny tweak here or there.



>
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Sunday, March 25, 2012 21:35
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> Not really.  See my previous posting.
>
> The best way to get fast recommendations is to use an item-based
> recommender.  Pre-computing recommendations for all users is not usually a
> win because you wind up doing a lot of wasted work and you still don't have
> anything for new users who appear between refreshes.  If you build up a
> service to handle the new users, you might as well just serve all users
> from that service so that you get up to date recommendations for everyone.
>
> There IS a large off-line computation.  But that doesn't produce
> recommendations for USER's.  It typically produces recommendations for
> ITEM's.  Then those item-item recommendations are combined to produce
> recommendations for users.
>
> On Sun, Mar 25, 2012 at 12:28 PM, Razon, Oren <or...@intel.com>
> wrote:
>
> > Correct me if I'm wrong but a good way to boost up speed could be to use
> > caching recommender, meaning computing the recommendations in advanced
> > (refresh it every X min\hours) and always recommend using the most
> updated
> > recommendations, right?!
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Sunday, March 25, 2012 21:25
> > To: user@mahout.apache.org
> > Subject: Re: Mahout beginner questions...
> >
> > It is memory. You will need a pretty large heap to put 100M data in
> memory
> > -- probably 4GB, if not a little more (so the machine would need 8GB+
> RAM).
> > You can go bigger if you have more memory but that size seems about the
> > biggest to reasonably assume people have.
> >
> > Of course more data slows things down and past about 10M data points you
> > need to tune things to sample data rather than try every possibility.
> This
> > is most of what CandidateItemStrategy has to do with. It is relatively
> easy
> > to tune this though so speed doesn't have to ben an issue.
> >
> > Again you can go bigger and tune it to down-sample more; somehow I stil
> > believe that 100M is a crude but useful rule of thumb, as to the point
> > beyond which it's just hard to get good speed and quality.
> >
> > Sean
> >
> > On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <or...@intel.com>
> wrote:
> >
> > > Thanks for the detailed answer Sean.
> > > I want to understand more clearly the non-distributed code limitations.
> > > I saw that you advise that for more than 100,000,000 ratings the
> > > non-distributed engine won't do the job.
> > > The question is why? Is it memory issue (and then if I will have a
> bigger
> > > machine, meaning I could scale up), or is it because of the
> > recommendation
> > > time it takes?
> > >
> > >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for
> > the sole use of the intended recipient(s). Any review or distribution
> > by others is strictly prohibited. If you are not the intended
> > recipient, please contact the sender and delete all copies.
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Ok, so that was a good clarification, which lead me to new questions :)

The system I need should of course give the recommendation itself in no time.
And as Sean said, it need to have some real time components to enable a different recommendation after the user interact with the application.
But because I'm talking about very large scales, I guess that I want to push much of my model computation to offline mode (which will be refreshed every X minutes).

So my options are like that (considering I want to build a real scalable solution):
Use the non-distributed \ distributed code to compute some of my model in advance (for example similarity between items \ KNN for each users) --> I guess that for that part, considering I'm offline, the mapreduce code is idle, because of his scalability.
Than use a non-distributed online code to calculate the final recommendations based on the pre computed part and do some final computation (weighting the KNN ratings for items my user didn't experienced yet)
In order to be able to do so, I will probably need a machine that have high memory capacity to contain all the calculations inside the memory.
I can even go further and prepare a cached recommender that will be refreshed whenever I really want my recommendations to be updated.
Am I right here?

I know the "glue" between the 2 parts is not quite there (as Sean said), but my question is, how much does the current framework support this kind of architecture? Meaning what kind of actions can I really prepare in advance before continuing to the final computation? If so, beside of co-occurrence matrix and matrix factorization what other computations are available to me to do in a mapreduce manner? Does it mean I will have 2 separate machines for that case, one as an Hadoop cluster for the offline computation and an online one that will use the distributed output to do final recommendations (but then it mean I need to move data between machines, which is not so idle...)?

Also, as I mentioned earlier I might need to store my data in a SQL machine. If so, what drivers are currently supported? I saw only JDBC & PostgreSQL, is there anyone else?
As you said in the book, using a SQL machine will probably slow things down because of the data movement using the drivers... Could you estimate how much slower is it comparing to using a file? Again I might do the reading from the DB offline so I'm not too afraid from losing some of my speed...

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Sunday, March 25, 2012 21:35
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

Not really.  See my previous posting.

The best way to get fast recommendations is to use an item-based
recommender.  Pre-computing recommendations for all users is not usually a
win because you wind up doing a lot of wasted work and you still don't have
anything for new users who appear between refreshes.  If you build up a
service to handle the new users, you might as well just serve all users
from that service so that you get up to date recommendations for everyone.

There IS a large off-line computation.  But that doesn't produce
recommendations for USER's.  It typically produces recommendations for
ITEM's.  Then those item-item recommendations are combined to produce
recommendations for users.

On Sun, Mar 25, 2012 at 12:28 PM, Razon, Oren <or...@intel.com> wrote:

> Correct me if I'm wrong but a good way to boost up speed could be to use
> caching recommender, meaning computing the recommendations in advanced
> (refresh it every X min\hours) and always recommend using the most updated
> recommendations, right?!
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Sunday, March 25, 2012 21:25
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> It is memory. You will need a pretty large heap to put 100M data in memory
> -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
> You can go bigger if you have more memory but that size seems about the
> biggest to reasonably assume people have.
>
> Of course more data slows things down and past about 10M data points you
> need to tune things to sample data rather than try every possibility. This
> is most of what CandidateItemStrategy has to do with. It is relatively easy
> to tune this though so speed doesn't have to ben an issue.
>
> Again you can go bigger and tune it to down-sample more; somehow I stil
> believe that 100M is a crude but useful rule of thumb, as to the point
> beyond which it's just hard to get good speed and quality.
>
> Sean
>
> On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <or...@intel.com> wrote:
>
> > Thanks for the detailed answer Sean.
> > I want to understand more clearly the non-distributed code limitations.
> > I saw that you advise that for more than 100,000,000 ratings the
> > non-distributed engine won't do the job.
> > The question is why? Is it memory issue (and then if I will have a bigger
> > machine, meaning I could scale up), or is it because of the
> recommendation
> > time it takes?
> >
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Ted Dunning <te...@gmail.com>.

Not really.  See my previous posting.

The best way to get fast recommendations is to use an item-based
recommender.  Pre-computing recommendations for all users is not usually a
win because you wind up doing a lot of wasted work and you still don't have
anything for new users who appear between refreshes.  If you build up a
service to handle the new users, you might as well just serve all users
from that service so that you get up to date recommendations for everyone.

There IS a large off-line computation.  But that doesn't produce
recommendations for USER's.  It typically produces recommendations for
ITEM's.  Then those item-item recommendations are combined to produce
recommendations for users.

On Sun, Mar 25, 2012 at 12:28 PM, Razon, Oren <or...@intel.com> wrote:

> Correct me if I'm wrong but a good way to boost up speed could be to use
> caching recommender, meaning computing the recommendations in advanced
> (refresh it every X min\hours) and always recommend using the most updated
> recommendations, right?!
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Sunday, March 25, 2012 21:25
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> It is memory. You will need a pretty large heap to put 100M data in memory
> -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
> You can go bigger if you have more memory but that size seems about the
> biggest to reasonably assume people have.
>
> Of course more data slows things down and past about 10M data points you
> need to tune things to sample data rather than try every possibility. This
> is most of what CandidateItemStrategy has to do with. It is relatively easy
> to tune this though so speed doesn't have to ben an issue.
>
> Again you can go bigger and tune it to down-sample more; somehow I stil
> believe that 100M is a crude but useful rule of thumb, as to the point
> beyond which it's just hard to get good speed and quality.
>
> Sean
>
> On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <or...@intel.com> wrote:
>
> > Thanks for the detailed answer Sean.
> > I want to understand more clearly the non-distributed code limitations.
> > I saw that you advise that for more than 100,000,000 ratings the
> > non-distributed engine won't do the job.
> > The question is why? Is it memory issue (and then if I will have a bigger
> > machine, meaning I could scale up), or is it because of the
> recommendation
> > time it takes?
> >
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Correct me if I'm wrong but a good way to boost up speed could be to use caching recommender, meaning computing the recommendations in advanced (refresh it every X min\hours) and always recommend using the most updated recommendations, right?!

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Sunday, March 25, 2012 21:25
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

It is memory. You will need a pretty large heap to put 100M data in memory
-- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
You can go bigger if you have more memory but that size seems about the
biggest to reasonably assume people have.

Of course more data slows things down and past about 10M data points you
need to tune things to sample data rather than try every possibility. This
is most of what CandidateItemStrategy has to do with. It is relatively easy
to tune this though so speed doesn't have to ben an issue.

Again you can go bigger and tune it to down-sample more; somehow I stil
believe that 100M is a crude but useful rule of thumb, as to the point
beyond which it's just hard to get good speed and quality.

Sean

On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <or...@intel.com> wrote:

> Thanks for the detailed answer Sean.
> I want to understand more clearly the non-distributed code limitations.
> I saw that you advise that for more than 100,000,000 ratings the
> non-distributed engine won't do the job.
> The question is why? Is it memory issue (and then if I will have a bigger
> machine, meaning I could scale up), or is it because of the recommendation
> time it takes?
>
>
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Ted Dunning <te...@gmail.com>.

It rounds like the original poster isn't clear about the division between
off-line and on-line work.

Almost all production recommendation systems have a large off-line
component which analyzes logs of behavior and produces a recommendation
model.  This model typically consists of item-item relationships stored in
a form that is usable by the on-line component of the system.  This part is
preparation for recommendation, but is not itself recommendation.  This
off-line component can run sequentially or in parallel using map-reduce.
 In my experience, with decent down-sampling of excessively active users
and excessively popular items, it isn't unreasonable to reach 100M
non-zeros in the user x item history in the off-line component.

The actual recommendations are produced using the on-line component.  This
component reads in the recommendation model, possibly all at once, possibly
on demand and possibly as the model is changed.  The model may be read from
a database or from flat files or many other sources.  To make a
recommendation, a user history or user id is presented to the
recommendation system.  If an id is presented, it is presumed that the
history is available somewhere or that the recommendations have been
pre-computed for that user.  In any case, the history is combined with the
recommendation model to produce a recommendation list for the user of the
moment.

On Sun, Mar 25, 2012 at 12:25 PM, Sean Owen <sr...@gmail.com> wrote:

> It is memory. You will need a pretty large heap to put 100M data in memory
> -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
> You can go bigger if you have more memory but that size seems about the
> biggest to reasonably assume people have.
>
> Of course more data slows things down and past about 10M data points you
> need to tune things to sample data rather than try every possibility. This
> is most of what CandidateItemStrategy has to do with. It is relatively easy
> to tune this though so speed doesn't have to ben an issue.
>
> Again you can go bigger and tune it to down-sample more; somehow I stil
> believe that 100M is a crude but useful rule of thumb, as to the point
> beyond which it's just hard to get good speed and quality.
>
> Sean
>
> On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <or...@intel.com> wrote:
>
> > Thanks for the detailed answer Sean.
> > I want to understand more clearly the non-distributed code limitations.
> > I saw that you advise that for more than 100,000,000 ratings the
> > non-distributed engine won't do the job.
> > The question is why? Is it memory issue (and then if I will have a bigger
> > machine, meaning I could scale up), or is it because of the
> recommendation
> > time it takes?
> >
> >
>

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

It is memory. You will need a pretty large heap to put 100M data in memory
-- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
You can go bigger if you have more memory but that size seems about the
biggest to reasonably assume people have.

Of course more data slows things down and past about 10M data points you
need to tune things to sample data rather than try every possibility. This
is most of what CandidateItemStrategy has to do with. It is relatively easy
to tune this though so speed doesn't have to ben an issue.

Again you can go bigger and tune it to down-sample more; somehow I stil
believe that 100M is a crude but useful rule of thumb, as to the point
beyond which it's just hard to get good speed and quality.

Sean

On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <or...@intel.com> wrote:

> Thanks for the detailed answer Sean.
> I want to understand more clearly the non-distributed code limitations.
> I saw that you advise that for more than 100,000,000 ratings the
> non-distributed engine won't do the job.
> The question is why? Is it memory issue (and then if I will have a bigger
> machine, meaning I could scale up), or is it because of the recommendation
> time it takes?
>
>

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Thanks for the detailed answer Sean.
I want to understand more clearly the non-distributed code limitations.
I saw that you advise that for more than 100,000,000 ratings the non-distributed engine won't do the job.
The question is why? Is it memory issue (and then if I will have a bigger machine, meaning I could scale up), or is it because of the recommendation time it takes?

Thanks,
Oren

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Thursday, March 22, 2012 17:57
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

A distributed and non-distributed recommender are really quite
separate. They perform the same task in quite different ways. I don't
think you would mix them per se.

Depends on what you mean by a model-based recommender... I would call
the matrix-factorization-based and clustering-based approaches
"model-based" in the sense that they assume the existence of some
underlying structure and discover it. There's no Bayesian-style
approaches in the code.

They scale in different ways; I am not sure they are unilaterally a
solution to scale, no. I do agree in general that these have good
scaling properties for real-world use cases, like the
matrix-factorization approaches.


A "real" scalable architecture would have a real-time component and a
big distributed computation component. Mahout has elements of both and
can be the basis for piecing that together, but it's not a question of
strapping together the distributed and non-distributed implementation.
It's a bit harder than that.


I am actually quite close to being ready to show off something in this
area -- I have been working separately on a more complete rec system
that has both the real-time element but integrated directly with a
distributed element to handle the large-scale computation. I think
this is typical of big data architectures. You have (at least) a
real-time distributed "Serving Layer" and a big distributed batch
"Computation Layer". More on this in about... 2 weeks.


On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <or...@intel.com> wrote:
> Hi Sean,
> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums.
> Just to clear my second question...
> I want to build a recommender framework that will support different use cases.  So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)?
>
> BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders.
> Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already?
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Thursday, March 22, 2012 13:51
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> 1. These are the JDBC-related classes. For example see
> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>
> 2. The distributed and non-distributed code are quite separate. At
> this scale I don't think you can use the non-distributed code to a
> meaningful degree. For example you could pre-compute item-item
> similarities over this data and use a non-distributed item-based
> recommender but you probably have enough items that this will strain
> memory. You would probably be looking at pre-computing recommendations
> in batch.
>
> 3. I don't think Netezza will help much here. It's still not fast
> enough at this scale to use with a real-time recommender (nothing is).
> If it's just a place you store data to feed into Hadoop it's not
> adding value. All the JDBC-related integrations ultimately load data
> into memory and that's out of the question with 500M data points.
>
> I'd also suggest you have a think about whether you "really" have 500M
> data points. Often you can know that most of the data is noise or not
> useful, and can get useful recommendations on a fraction of the data
> (maybe 5M). That makes a lot of things easier.
>
> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <or...@intel.com> wrote:
>> Hi,
>> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>>
>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
>> 2.      My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
>>      Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
>> 3.      As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)
>>
>> Thanks,
>> Oren
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

A distributed and non-distributed recommender are really quite
separate. They perform the same task in quite different ways. I don't
think you would mix them per se.

Depends on what you mean by a model-based recommender... I would call
the matrix-factorization-based and clustering-based approaches
"model-based" in the sense that they assume the existence of some
underlying structure and discover it. There's no Bayesian-style
approaches in the code.

They scale in different ways; I am not sure they are unilaterally a
solution to scale, no. I do agree in general that these have good
scaling properties for real-world use cases, like the
matrix-factorization approaches.


A "real" scalable architecture would have a real-time component and a
big distributed computation component. Mahout has elements of both and
can be the basis for piecing that together, but it's not a question of
strapping together the distributed and non-distributed implementation.
It's a bit harder than that.


I am actually quite close to being ready to show off something in this
area -- I have been working separately on a more complete rec system
that has both the real-time element but integrated directly with a
distributed element to handle the large-scale computation. I think
this is typical of big data architectures. You have (at least) a
real-time distributed "Serving Layer" and a big distributed batch
"Computation Layer". More on this in about... 2 weeks.


On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <or...@intel.com> wrote:
> Hi Sean,
> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums.
> Just to clear my second question...
> I want to build a recommender framework that will support different use cases.  So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)?
>
> BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders.
> Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already?
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Thursday, March 22, 2012 13:51
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> 1. These are the JDBC-related classes. For example see
> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>
> 2. The distributed and non-distributed code are quite separate. At
> this scale I don't think you can use the non-distributed code to a
> meaningful degree. For example you could pre-compute item-item
> similarities over this data and use a non-distributed item-based
> recommender but you probably have enough items that this will strain
> memory. You would probably be looking at pre-computing recommendations
> in batch.
>
> 3. I don't think Netezza will help much here. It's still not fast
> enough at this scale to use with a real-time recommender (nothing is).
> If it's just a place you store data to feed into Hadoop it's not
> adding value. All the JDBC-related integrations ultimately load data
> into memory and that's out of the question with 500M data points.
>
> I'd also suggest you have a think about whether you "really" have 500M
> data points. Often you can know that most of the data is noise or not
> useful, and can get useful recommendations on a fraction of the data
> (maybe 5M). That makes a lot of things easier.
>
> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <or...@intel.com> wrote:
>> Hi,
>> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>>
>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
>> 2.      My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
>>      Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
>> 3.      As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)
>>
>> Thanks,
>> Oren
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

RE: Mahout beginner questions...

Posted by "Razon, Oren" <or...@intel.com>.

Hi Sean,
Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums.
Just to clear my second question...
I want to build a recommender framework that will support different use cases.  So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)?

BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders.
Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already?

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Thursday, March 22, 2012 13:51
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

1. These are the JDBC-related classes. For example see
MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/

2. The distributed and non-distributed code are quite separate. At
this scale I don't think you can use the non-distributed code to a
meaningful degree. For example you could pre-compute item-item
similarities over this data and use a non-distributed item-based
recommender but you probably have enough items that this will strain
memory. You would probably be looking at pre-computing recommendations
in batch.

3. I don't think Netezza will help much here. It's still not fast
enough at this scale to use with a real-time recommender (nothing is).
If it's just a place you store data to feed into Hadoop it's not
adding value. All the JDBC-related integrations ultimately load data
into memory and that's out of the question with 500M data points.

I'd also suggest you have a think about whether you "really" have 500M
data points. Often you can know that most of the data is noise or not
useful, and can get useful recommendations on a fraction of the data
(maybe 5M). That makes a lot of things easier.

On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <or...@intel.com> wrote:
> Hi,
> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>
> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
> 2.      My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
>      Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
> 3.      As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)
>
> Thanks,
> Oren
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Posted by Sean Owen <sr...@gmail.com>.

1. These are the JDBC-related classes. For example see
MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/

2. The distributed and non-distributed code are quite separate. At
this scale I don't think you can use the non-distributed code to a
meaningful degree. For example you could pre-compute item-item
similarities over this data and use a non-distributed item-based
recommender but you probably have enough items that this will strain
memory. You would probably be looking at pre-computing recommendations
in batch.

3. I don't think Netezza will help much here. It's still not fast
enough at this scale to use with a real-time recommender (nothing is).
If it's just a place you store data to feed into Hadoop it's not
adding value. All the JDBC-related integrations ultimately load data
into memory and that's out of the question with 500M data points.

I'd also suggest you have a think about whether you "really" have 500M
data points. Often you can know that most of the data is noise or not
useful, and can get useful recommendations on a fraction of the data
(maybe 5M). That makes a lot of things easier.

On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <or...@intel.com> wrote:
> Hi,
> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>
> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB?
> 2.      My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here?
>      Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node?
> 3.      As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :)
>
> Thanks,
> Oren
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.