You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vinod <pi...@gmail.com> on 2011/12/08 13:07:28 UTC

Persisting trained models in Mahout

Hi,

This is my first day of experimentation with Mahout. I am following "Mahout
in Action" book and looking at the sample code provided, it seems that
models for ex:- recommender, needs to be trained at the start of the
program (start/restart). Recommender interface extends Refreshable which
doesn't extend serializable. So, I am wondering if Mahout provides an
alternate mechanism to to persist trained models (recommender instance in
this case).

Apologies if this is a very silly question.

Thanks & regards,
Vinod

Re: Persisting trained models in Mahout

Posted by Vinod <pi...@gmail.com>.
Hi Ted,

Sure. I'll continue reading and try examples in later chapters. Thanks.

regards,
Vinod

On Thu, Dec 8, 2011 at 7:53 PM, Ted Dunning <te...@gmail.com> wrote:

> This is a fair statement of the traditional way of doing business for
> *small* models of the sort used in classification.  The insistence on using
> serialization is kind of silly since there are many down-sides to Java
> serialization and it is becoming rare for systems that need to serialize
> large amounts of data to use Java serialization.
>
> The fact is, however, that this is not general practice with
> recommendations.  It is common to do lots of off-line computation that you
> could characterize as "learning", and it is common to save the results of
> this off-line computation for later deployment, but it is also common to do
> the learning on the fly since it is generally pretty trivial stuff.
>
> The earliest examples highlight the simpler approach.  Keep going to see
> more interesting examples.
>
> On Thu, Dec 8, 2011 at 6:46 AM, Vinod <pi...@gmail.com> wrote:
>
> > I'll use the first example from Chapter 2 of your book to clarify what I
> > mean by training:-
> >
> > Following code trains the recommender:-
> >    DataModel model = new FileDataModel(new File("intro.csv"));
> >
> >    UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
> >    UserNeighborhood neighborhood =
> >      new NearestNUserNeighborhood(2, similarity, model);
> >
> >    Recommender recommender = new GenericUserBasedRecommender(
> >        model, neighborhood, similarity);
> >
> > At this point, recommender is trained on preferences of users 1 to 5 in
> > intro.csv.
> >
> > We should now be able to serialize() this recommender instance into a
> file,
> > say "Movie Recommender.model" using steps mentioned here (
> >
> http://java.sun.com/developer/technicalArticles/Programming/serialization/
> > )
> >
> > All we need to do now is deploy "Movie Recommender.model" to production.
> >
> > If I understand the behavior correctly, this model should now be able to
> > predict recommendation for a new user.
> >
>

Re: Persisting trained models in Mahout

Posted by Ted Dunning <te...@gmail.com>.
This is a fair statement of the traditional way of doing business for
*small* models of the sort used in classification.  The insistence on using
serialization is kind of silly since there are many down-sides to Java
serialization and it is becoming rare for systems that need to serialize
large amounts of data to use Java serialization.

The fact is, however, that this is not general practice with
recommendations.  It is common to do lots of off-line computation that you
could characterize as "learning", and it is common to save the results of
this off-line computation for later deployment, but it is also common to do
the learning on the fly since it is generally pretty trivial stuff.

The earliest examples highlight the simpler approach.  Keep going to see
more interesting examples.

On Thu, Dec 8, 2011 at 6:46 AM, Vinod <pi...@gmail.com> wrote:

> I'll use the first example from Chapter 2 of your book to clarify what I
> mean by training:-
>
> Following code trains the recommender:-
>    DataModel model = new FileDataModel(new File("intro.csv"));
>
>    UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
>    UserNeighborhood neighborhood =
>      new NearestNUserNeighborhood(2, similarity, model);
>
>    Recommender recommender = new GenericUserBasedRecommender(
>        model, neighborhood, similarity);
>
> At this point, recommender is trained on preferences of users 1 to 5 in
> intro.csv.
>
> We should now be able to serialize() this recommender instance into a file,
> say "Movie Recommender.model" using steps mentioned here (
> http://java.sun.com/developer/technicalArticles/Programming/serialization/
> )
>
> All we need to do now is deploy "Movie Recommender.model" to production.
>
> If I understand the behavior correctly, this model should now be able to
> predict recommendation for a new user.
>

Re: Persisting trained models in Mahout

Posted by Suneel Marthi <su...@yahoo.com>.
That's correct. 

Thanks for pointing this out, Lance.



________________________________
 From: Lance Norskog <go...@gmail.com>
To: user@mahout.apache.org 
Sent: Thursday, December 8, 2011 5:52 PM
Subject: Re: Persisting trained models in Mahout
 
It would also be useful to load and cache often-used items and compute
rarely-used items online. The Caching classes are the natural fit for this.

On Thu, Dec 8, 2011 at 9:20 AM, Vinod <pi...@gmail.com> wrote:

> Sure Suneel. Thanks.
>
> On Thu, Dec 8, 2011 at 8:00 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > Would ModelSerializer class in Mahout be what you are looking for?  I had
> > used it to persist trained models for SGD classifiers, you may want to
> look
> > into it.
> >
> >
> >
> > ________________________________
> >  From: Vinod <pi...@gmail.com>
> > To: user@mahout.apache.org
> > Sent: Thursday, December 8, 2011 8:46 AM
> > Subject: Re: Persisting trained models in Mahout
> >
> > I'll use the first example from Chapter 2 of your book to clarify what I
> > mean by training:-
> >
> > Following code trains the recommender:-
> >     DataModel model = new FileDataModel(new File("intro.csv"));
> >
> >     UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
> >     UserNeighborhood neighborhood =
> >       new NearestNUserNeighborhood(2, similarity, model);
> >
> >     Recommender recommender = new GenericUserBasedRecommender(
> >         model, neighborhood, similarity);
> >
> > At this point, recommender is trained on preferences of users 1 to 5 in
> > intro.csv.
> >
> > We should now be able to serialize() this recommender instance into a
> file,
> > say "Movie Recommender.model" using steps mentioned here (
> >
> http://java.sun.com/developer/technicalArticles/Programming/serialization/
> > )
> >
> > All we need to do now is deploy "Movie Recommender.model" to production.
> >
> > If I understand the behavior correctly, this model should now be able to
> > predict recommendation for a new user.
> >
> > As an example, lets assume that production has a different user base. If
> > recommender instance is loaded from "Movie Recommender.model" file and
> > asked to provide recommendations for user '7' who has rated 101 and 102
> as
> > 4 and 3 respectively, it should be able to predict recommendations for 7.
> > right?
> >
> > regards,
> > Vinod
> >
> >
> >
> >
> > On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > Yes, I mean you need to write it and read it in your own code.
> > >
> > > What do you mean by training a model? computing similarities? I don't
> > know
> > > if there's such a thing here as "training" on one data set and running
> on
> > > another. The implementations always use all currently available info.
> Is
> > > this a cold-start issue?
> > >
> > > OutOfMemoryError is nothing to do with this; on such a small data set
> it
> > > indicates you didn't set your JVM heap size above the default.
> > >
> > >
> > > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <pi...@gmail.com> wrote:
> > >
> > > > Hi Sean,
> > > >
> > > > Neither Recommender nor any of its parent interface extends
> > serializable
> > > so
> > > > there is no way that I'd be able to serialize it.
> > > >
> > > > I agree that the implementations may not have startup overhead.
> > However,
> > > > training a model on millions of row is a cpu, memory & time consuming
> > > > activity. For example, when data set is changed from 100K to 1M in
> > > chapter
> > > > 4, program crashes with OutOfMemory after significant amount of time.
> > > >
> > > > I feel that training should be done in development only. Once a
> > developer
> > > > is ok with test results, he should be able to save instance of the
> > > trained
> > > > and tested model  (for ex:- recommender or classifier).
> > > >
> > > > These saved instances of trained and tested models only should be
> > > deployed
> > > > to production.
> > > >
> > > > Thought?
> > > >
> > > > regards,
> > > > Vinod
> > > >
> > > >
> > > >
> > > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:
> > > >
> > > > > Ah right. No, there's still not a provision for this. You would
> just
> > > have
> > > > > to serialize it yourself if you like.
> > > > > Most of the implementations don't have a great deal of startup
> > > overhead,
> > > > so
> > > > > don't really need this. The exception is perhaps slope-one, but
> there
> > > you
> > > > > can actually save and supply pre-computed diffs.
> > > > > Still it would be valid to store and re-supply user-user
> similarities
> > > or
> > > > > something. You can do this, manually, by querying for user-user
> > > > > similarities, saving them, then loading them and supplying them via
> > > > > GenericUserSimilarity for instance.
> > > > >
> > > > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
> > > > >
> > > > > > Hi Sean,
> > > > > >
> > > > > > Thanks for the quick response.
> > > > > >
> > > > > > By model, I am not referring to data model but, a "trained"
> > > recommender
> > > > > > instance.
> > > > > >
> > > > > > Weka, for examples, has ability to save and load models:-
> > > > > > http://weka.wikispaces.com/Serialization
> > > > > > http://weka.wikispaces.com/Saving+and+loading+models
> > > > > >
> > > > > > This avoids the need to train model (recommender) every time a
> > server
> > > > is
> > > > > > bounced or program is restarted.
> > > > > >
> > > > > > regards,
> > > > > > Vinod
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > The classes aren't Serializable, no. In the case of DataModel,
> > it's
> > > > > > assumed
> > > > > > > that you already have some persisted model somewhere, in a DB
> or
> > > file
> > > > > or
> > > > > > > something, so this would be redundant.
> > > > > > >
> > > > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > This is my first day of experimentation with Mahout. I am
> > > following
> > > > > > > "Mahout
> > > > > > > > in Action" book and looking at the sample code provided, it
> > seems
> > > > > that
> > > > > > > > models for ex:- recommender, needs to be trained at the start
> > of
> > > > the
> > > > > > > > program (start/restart). Recommender interface extends
> > > Refreshable
> > > > > > which
> > > > > > > > doesn't extend serializable. So, I am wondering if Mahout
> > > provides
> > > > an
> > > > > > > > alternate mechanism to to persist trained models (recommender
> > > > > instance
> > > > > > in
> > > > > > > > this case).
> > > > > > > >
> > > > > > > > Apologies if this is a very silly question.
> > > > > > > >
> > > > > > > > Thanks & regards,
> > > > > > > > Vinod
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Lance Norskog
goksron@gmail.com

Re: Persisting trained models in Mahout

Posted by Lance Norskog <go...@gmail.com>.
It would also be useful to load and cache often-used items and compute
rarely-used items online. The Caching classes are the natural fit for this.

On Thu, Dec 8, 2011 at 9:20 AM, Vinod <pi...@gmail.com> wrote:

> Sure Suneel. Thanks.
>
> On Thu, Dec 8, 2011 at 8:00 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > Would ModelSerializer class in Mahout be what you are looking for?  I had
> > used it to persist trained models for SGD classifiers, you may want to
> look
> > into it.
> >
> >
> >
> > ________________________________
> >  From: Vinod <pi...@gmail.com>
> > To: user@mahout.apache.org
> > Sent: Thursday, December 8, 2011 8:46 AM
> > Subject: Re: Persisting trained models in Mahout
> >
> > I'll use the first example from Chapter 2 of your book to clarify what I
> > mean by training:-
> >
> > Following code trains the recommender:-
> >     DataModel model = new FileDataModel(new File("intro.csv"));
> >
> >     UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
> >     UserNeighborhood neighborhood =
> >       new NearestNUserNeighborhood(2, similarity, model);
> >
> >     Recommender recommender = new GenericUserBasedRecommender(
> >         model, neighborhood, similarity);
> >
> > At this point, recommender is trained on preferences of users 1 to 5 in
> > intro.csv.
> >
> > We should now be able to serialize() this recommender instance into a
> file,
> > say "Movie Recommender.model" using steps mentioned here (
> >
> http://java.sun.com/developer/technicalArticles/Programming/serialization/
> > )
> >
> > All we need to do now is deploy "Movie Recommender.model" to production.
> >
> > If I understand the behavior correctly, this model should now be able to
> > predict recommendation for a new user.
> >
> > As an example, lets assume that production has a different user base. If
> > recommender instance is loaded from "Movie Recommender.model" file and
> > asked to provide recommendations for user '7' who has rated 101 and 102
> as
> > 4 and 3 respectively, it should be able to predict recommendations for 7.
> > right?
> >
> > regards,
> > Vinod
> >
> >
> >
> >
> > On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > Yes, I mean you need to write it and read it in your own code.
> > >
> > > What do you mean by training a model? computing similarities? I don't
> > know
> > > if there's such a thing here as "training" on one data set and running
> on
> > > another. The implementations always use all currently available info.
> Is
> > > this a cold-start issue?
> > >
> > > OutOfMemoryError is nothing to do with this; on such a small data set
> it
> > > indicates you didn't set your JVM heap size above the default.
> > >
> > >
> > > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <pi...@gmail.com> wrote:
> > >
> > > > Hi Sean,
> > > >
> > > > Neither Recommender nor any of its parent interface extends
> > serializable
> > > so
> > > > there is no way that I'd be able to serialize it.
> > > >
> > > > I agree that the implementations may not have startup overhead.
> > However,
> > > > training a model on millions of row is a cpu, memory & time consuming
> > > > activity. For example, when data set is changed from 100K to 1M in
> > > chapter
> > > > 4, program crashes with OutOfMemory after significant amount of time.
> > > >
> > > > I feel that training should be done in development only. Once a
> > developer
> > > > is ok with test results, he should be able to save instance of the
> > > trained
> > > > and tested model  (for ex:- recommender or classifier).
> > > >
> > > > These saved instances of trained and tested models only should be
> > > deployed
> > > > to production.
> > > >
> > > > Thought?
> > > >
> > > > regards,
> > > > Vinod
> > > >
> > > >
> > > >
> > > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:
> > > >
> > > > > Ah right. No, there's still not a provision for this. You would
> just
> > > have
> > > > > to serialize it yourself if you like.
> > > > > Most of the implementations don't have a great deal of startup
> > > overhead,
> > > > so
> > > > > don't really need this. The exception is perhaps slope-one, but
> there
> > > you
> > > > > can actually save and supply pre-computed diffs.
> > > > > Still it would be valid to store and re-supply user-user
> similarities
> > > or
> > > > > something. You can do this, manually, by querying for user-user
> > > > > similarities, saving them, then loading them and supplying them via
> > > > > GenericUserSimilarity for instance.
> > > > >
> > > > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
> > > > >
> > > > > > Hi Sean,
> > > > > >
> > > > > > Thanks for the quick response.
> > > > > >
> > > > > > By model, I am not referring to data model but, a "trained"
> > > recommender
> > > > > > instance.
> > > > > >
> > > > > > Weka, for examples, has ability to save and load models:-
> > > > > > http://weka.wikispaces.com/Serialization
> > > > > > http://weka.wikispaces.com/Saving+and+loading+models
> > > > > >
> > > > > > This avoids the need to train model (recommender) every time a
> > server
> > > > is
> > > > > > bounced or program is restarted.
> > > > > >
> > > > > > regards,
> > > > > > Vinod
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > The classes aren't Serializable, no. In the case of DataModel,
> > it's
> > > > > > assumed
> > > > > > > that you already have some persisted model somewhere, in a DB
> or
> > > file
> > > > > or
> > > > > > > something, so this would be redundant.
> > > > > > >
> > > > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > This is my first day of experimentation with Mahout. I am
> > > following
> > > > > > > "Mahout
> > > > > > > > in Action" book and looking at the sample code provided, it
> > seems
> > > > > that
> > > > > > > > models for ex:- recommender, needs to be trained at the start
> > of
> > > > the
> > > > > > > > program (start/restart). Recommender interface extends
> > > Refreshable
> > > > > > which
> > > > > > > > doesn't extend serializable. So, I am wondering if Mahout
> > > provides
> > > > an
> > > > > > > > alternate mechanism to to persist trained models (recommender
> > > > > instance
> > > > > > in
> > > > > > > > this case).
> > > > > > > >
> > > > > > > > Apologies if this is a very silly question.
> > > > > > > >
> > > > > > > > Thanks & regards,
> > > > > > > > Vinod
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Lance Norskog
goksron@gmail.com

Re: Persisting trained models in Mahout

Posted by Vinod <pi...@gmail.com>.
Sure Suneel. Thanks.

On Thu, Dec 8, 2011 at 8:00 PM, Suneel Marthi <su...@yahoo.com>wrote:

> Would ModelSerializer class in Mahout be what you are looking for?  I had
> used it to persist trained models for SGD classifiers, you may want to look
> into it.
>
>
>
> ________________________________
>  From: Vinod <pi...@gmail.com>
> To: user@mahout.apache.org
> Sent: Thursday, December 8, 2011 8:46 AM
> Subject: Re: Persisting trained models in Mahout
>
> I'll use the first example from Chapter 2 of your book to clarify what I
> mean by training:-
>
> Following code trains the recommender:-
>     DataModel model = new FileDataModel(new File("intro.csv"));
>
>     UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
>     UserNeighborhood neighborhood =
>       new NearestNUserNeighborhood(2, similarity, model);
>
>     Recommender recommender = new GenericUserBasedRecommender(
>         model, neighborhood, similarity);
>
> At this point, recommender is trained on preferences of users 1 to 5 in
> intro.csv.
>
> We should now be able to serialize() this recommender instance into a file,
> say "Movie Recommender.model" using steps mentioned here (
> http://java.sun.com/developer/technicalArticles/Programming/serialization/
> )
>
> All we need to do now is deploy "Movie Recommender.model" to production.
>
> If I understand the behavior correctly, this model should now be able to
> predict recommendation for a new user.
>
> As an example, lets assume that production has a different user base. If
> recommender instance is loaded from "Movie Recommender.model" file and
> asked to provide recommendations for user '7' who has rated 101 and 102 as
> 4 and 3 respectively, it should be able to predict recommendations for 7.
> right?
>
> regards,
> Vinod
>
>
>
>
> On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Yes, I mean you need to write it and read it in your own code.
> >
> > What do you mean by training a model? computing similarities? I don't
> know
> > if there's such a thing here as "training" on one data set and running on
> > another. The implementations always use all currently available info. Is
> > this a cold-start issue?
> >
> > OutOfMemoryError is nothing to do with this; on such a small data set it
> > indicates you didn't set your JVM heap size above the default.
> >
> >
> > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <pi...@gmail.com> wrote:
> >
> > > Hi Sean,
> > >
> > > Neither Recommender nor any of its parent interface extends
> serializable
> > so
> > > there is no way that I'd be able to serialize it.
> > >
> > > I agree that the implementations may not have startup overhead.
> However,
> > > training a model on millions of row is a cpu, memory & time consuming
> > > activity. For example, when data set is changed from 100K to 1M in
> > chapter
> > > 4, program crashes with OutOfMemory after significant amount of time.
> > >
> > > I feel that training should be done in development only. Once a
> developer
> > > is ok with test results, he should be able to save instance of the
> > trained
> > > and tested model  (for ex:- recommender or classifier).
> > >
> > > These saved instances of trained and tested models only should be
> > deployed
> > > to production.
> > >
> > > Thought?
> > >
> > > regards,
> > > Vinod
> > >
> > >
> > >
> > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:
> > >
> > > > Ah right. No, there's still not a provision for this. You would just
> > have
> > > > to serialize it yourself if you like.
> > > > Most of the implementations don't have a great deal of startup
> > overhead,
> > > so
> > > > don't really need this. The exception is perhaps slope-one, but there
> > you
> > > > can actually save and supply pre-computed diffs.
> > > > Still it would be valid to store and re-supply user-user similarities
> > or
> > > > something. You can do this, manually, by querying for user-user
> > > > similarities, saving them, then loading them and supplying them via
> > > > GenericUserSimilarity for instance.
> > > >
> > > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
> > > >
> > > > > Hi Sean,
> > > > >
> > > > > Thanks for the quick response.
> > > > >
> > > > > By model, I am not referring to data model but, a "trained"
> > recommender
> > > > > instance.
> > > > >
> > > > > Weka, for examples, has ability to save and load models:-
> > > > > http://weka.wikispaces.com/Serialization
> > > > > http://weka.wikispaces.com/Saving+and+loading+models
> > > > >
> > > > > This avoids the need to train model (recommender) every time a
> server
> > > is
> > > > > bounced or program is restarted.
> > > > >
> > > > > regards,
> > > > > Vinod
> > > > >
> > > > >
> > > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com>
> wrote:
> > > > >
> > > > > > The classes aren't Serializable, no. In the case of DataModel,
> it's
> > > > > assumed
> > > > > > that you already have some persisted model somewhere, in a DB or
> > file
> > > > or
> > > > > > something, so this would be redundant.
> > > > > >
> > > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com>
> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This is my first day of experimentation with Mahout. I am
> > following
> > > > > > "Mahout
> > > > > > > in Action" book and looking at the sample code provided, it
> seems
> > > > that
> > > > > > > models for ex:- recommender, needs to be trained at the start
> of
> > > the
> > > > > > > program (start/restart). Recommender interface extends
> > Refreshable
> > > > > which
> > > > > > > doesn't extend serializable. So, I am wondering if Mahout
> > provides
> > > an
> > > > > > > alternate mechanism to to persist trained models (recommender
> > > > instance
> > > > > in
> > > > > > > this case).
> > > > > > >
> > > > > > > Apologies if this is a very silly question.
> > > > > > >
> > > > > > > Thanks & regards,
> > > > > > > Vinod
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Persisting trained models in Mahout

Posted by Suneel Marthi <su...@yahoo.com>.
Would ModelSerializer class in Mahout be what you are looking for?  I had used it to persist trained models for SGD classifiers, you may want to look into it.



________________________________
 From: Vinod <pi...@gmail.com>
To: user@mahout.apache.org 
Sent: Thursday, December 8, 2011 8:46 AM
Subject: Re: Persisting trained models in Mahout
 
I'll use the first example from Chapter 2 of your book to clarify what I
mean by training:-

Following code trains the recommender:-
    DataModel model = new FileDataModel(new File("intro.csv"));

    UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
    UserNeighborhood neighborhood =
      new NearestNUserNeighborhood(2, similarity, model);

    Recommender recommender = new GenericUserBasedRecommender(
        model, neighborhood, similarity);

At this point, recommender is trained on preferences of users 1 to 5 in
intro.csv.

We should now be able to serialize() this recommender instance into a file,
say "Movie Recommender.model" using steps mentioned here (
http://java.sun.com/developer/technicalArticles/Programming/serialization/)

All we need to do now is deploy "Movie Recommender.model" to production.

If I understand the behavior correctly, this model should now be able to
predict recommendation for a new user.

As an example, lets assume that production has a different user base. If
recommender instance is loaded from "Movie Recommender.model" file and
asked to provide recommendations for user '7' who has rated 101 and 102 as
4 and 3 respectively, it should be able to predict recommendations for 7.
right?

regards,
Vinod




On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <sr...@gmail.com> wrote:

> Yes, I mean you need to write it and read it in your own code.
>
> What do you mean by training a model? computing similarities? I don't know
> if there's such a thing here as "training" on one data set and running on
> another. The implementations always use all currently available info. Is
> this a cold-start issue?
>
> OutOfMemoryError is nothing to do with this; on such a small data set it
> indicates you didn't set your JVM heap size above the default.
>
>
> On Thu, Dec 8, 2011 at 1:02 PM, Vinod <pi...@gmail.com> wrote:
>
> > Hi Sean,
> >
> > Neither Recommender nor any of its parent interface extends serializable
> so
> > there is no way that I'd be able to serialize it.
> >
> > I agree that the implementations may not have startup overhead. However,
> > training a model on millions of row is a cpu, memory & time consuming
> > activity. For example, when data set is changed from 100K to 1M in
> chapter
> > 4, program crashes with OutOfMemory after significant amount of time.
> >
> > I feel that training should be done in development only. Once a developer
> > is ok with test results, he should be able to save instance of the
> trained
> > and tested model  (for ex:- recommender or classifier).
> >
> > These saved instances of trained and tested models only should be
> deployed
> > to production.
> >
> > Thought?
> >
> > regards,
> > Vinod
> >
> >
> >
> > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > Ah right. No, there's still not a provision for this. You would just
> have
> > > to serialize it yourself if you like.
> > > Most of the implementations don't have a great deal of startup
> overhead,
> > so
> > > don't really need this. The exception is perhaps slope-one, but there
> you
> > > can actually save and supply pre-computed diffs.
> > > Still it would be valid to store and re-supply user-user similarities
> or
> > > something. You can do this, manually, by querying for user-user
> > > similarities, saving them, then loading them and supplying them via
> > > GenericUserSimilarity for instance.
> > >
> > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
> > >
> > > > Hi Sean,
> > > >
> > > > Thanks for the quick response.
> > > >
> > > > By model, I am not referring to data model but, a "trained"
> recommender
> > > > instance.
> > > >
> > > > Weka, for examples, has ability to save and load models:-
> > > > http://weka.wikispaces.com/Serialization
> > > > http://weka.wikispaces.com/Saving+and+loading+models
> > > >
> > > > This avoids the need to train model (recommender) every time a server
> > is
> > > > bounced or program is restarted.
> > > >
> > > > regards,
> > > > Vinod
> > > >
> > > >
> > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com> wrote:
> > > >
> > > > > The classes aren't Serializable, no. In the case of DataModel, it's
> > > > assumed
> > > > > that you already have some persisted model somewhere, in a DB or
> file
> > > or
> > > > > something, so this would be redundant.
> > > > >
> > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This is my first day of experimentation with Mahout. I am
> following
> > > > > "Mahout
> > > > > > in Action" book and looking at the sample code provided, it seems
> > > that
> > > > > > models for ex:- recommender, needs to be trained at the start of
> > the
> > > > > > program (start/restart). Recommender interface extends
> Refreshable
> > > > which
> > > > > > doesn't extend serializable. So, I am wondering if Mahout
> provides
> > an
> > > > > > alternate mechanism to to persist trained models (recommender
> > > instance
> > > > in
> > > > > > this case).
> > > > > >
> > > > > > Apologies if this is a very silly question.
> > > > > >
> > > > > > Thanks & regards,
> > > > > > Vinod
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Persisting trained models in Mahout

Posted by Jens Grivolla <j+...@grivolla.net>.
I'm just getting started on Mahout for a new project. I used Taste a few 
years back, but things have changed a lot since then. So basically, I'll 
be working on getting all the basic functionality I need first, and am 
not really ready to take on such development right now.

I may look into persisting the transformation matrices if I need to, but 
that's at least a few months away still. So if it's ready to use by the 
time I need it, all the better ;-)

I'll be mostly working on integrating external user/content features 
(demographics, etc.) to deal with cold start, and will rely as heavily 
as possible on existing algorithms and implementations for the core CF 
stuff.

Bye,
Jens

On 12/09/2011 03:20 PM, Sebastian Schelter wrote:
> Yes, you describe it perfectly. I think the only reason this has not
> been done yet is that the model computation is not very fast on Hadoop
> because of its iterative nature.
>
> Would you like to work on integrating the SVD recommenders?
>
> --sebastian
>
> On 09.12.2011 11:17, Jens Grivolla wrote:
>> On 12/08/2011 03:19 PM, Sebastian Schelter wrote:
>>> [...]
>>>
>>> A model for recommenders that use matrix factorization consists of the
>>> user and item feature vectors. You can use a FilePersistenceStrategy
>>> with any SVDRecommender to read and write these.
>>>
>>> In the future we could also support loading the results of
>>> ParallelALSFactorizationJob into an SVDRecommender.
>>
>> I was actually looking for this. I guess this is the one case where
>> there actually is a "model", and calculating the factorization can be
>> costly.
>>
>> I would expect that doing the "SVD" offline (e.g. on Hadoop) and then
>> providing online recommendations which only need a simple linear
>> projection is a pretty common use case, isn't it?  You can even take new
>> user preferences into account in realtime (when projecting the user
>> vector into the feature space) with very little cost, and just update
>> the transformation matrices (which should be quite static) periodically.
>>
>> Bye,
>> Jens
>>
>
>



Re: Persisting trained models in Mahout

Posted by Sebastian Schelter <ss...@apache.org>.
Yes, you describe it perfectly. I think the only reason this has not
been done yet is that the model computation is not very fast on Hadoop
because of its iterative nature.

Would you like to work on integrating the SVD recommenders?

--sebastian

On 09.12.2011 11:17, Jens Grivolla wrote:
> On 12/08/2011 03:19 PM, Sebastian Schelter wrote:
>> [...]
>>
>> A model for recommenders that use matrix factorization consists of the
>> user and item feature vectors. You can use a FilePersistenceStrategy
>> with any SVDRecommender to read and write these.
>>
>> In the future we could also support loading the results of
>> ParallelALSFactorizationJob into an SVDRecommender.
> 
> I was actually looking for this. I guess this is the one case where
> there actually is a "model", and calculating the factorization can be
> costly.
> 
> I would expect that doing the "SVD" offline (e.g. on Hadoop) and then
> providing online recommendations which only need a simple linear
> projection is a pretty common use case, isn't it?  You can even take new
> user preferences into account in realtime (when projecting the user
> vector into the feature space) with very little cost, and just update
> the transformation matrices (which should be quite static) periodically.
> 
> Bye,
> Jens
> 


Re: Persisting trained models in Mahout

Posted by Jens Grivolla <j+...@grivolla.net>.
On 12/08/2011 03:19 PM, Sebastian Schelter wrote:
> [...]
>
> A model for recommenders that use matrix factorization consists of the
> user and item feature vectors. You can use a FilePersistenceStrategy
> with any SVDRecommender to read and write these.
>
> In the future we could also support loading the results of
> ParallelALSFactorizationJob into an SVDRecommender.

I was actually looking for this. I guess this is the one case where 
there actually is a "model", and calculating the factorization can be 
costly.

I would expect that doing the "SVD" offline (e.g. on Hadoop) and then 
providing online recommendations which only need a simple linear 
projection is a pretty common use case, isn't it?  You can even take new 
user preferences into account in realtime (when projecting the user 
vector into the feature space) with very little cost, and just update 
the transformation matrices (which should be quite static) periodically.

Bye,
Jens


Re: Persisting trained models in Mahout

Posted by Sebastian Schelter <ss...@apache.org>.
A model for item-based collaborative filtering simply consists of the
precomputed item similarities.

We currently support such a precomputation only as hadoop job, but it
should be a matter of an hour to create a class that precalculates the
item similarities sequentially using an ItemBasedRecommender.

You can either store these similarities in the database and load them
via MySQLJDBCInMemoryItemSimilarity/SQL92JDBCInMemoryItemSimilarity or
you can write them to a .csv file and load them via FileItemSimilarity.

A model for recommenders that use matrix factorization consists of the
user and item feature vectors. You can use a FilePersistenceStrategy
with any SVDRecommender to read and write these.

In the future we could also support loading the results of
ParallelALSFactorizationJob into an SVDRecommender.

--sebastian



On 08.12.2011 14:49, Sean Owen wrote:
> That's right, you could get this effect by computing and saving off all the
> user-user similarities, then reading them back in, putting them in a
> GenericUserSimilarity, and proceeding as below. Those similarities are the
> closest thing to a model here.
> 
> It's going to take a while to compute all those pairs, and most will be
> unused, and so reloading them is going to take a lot of time and memory.
> You could prune the small ones I suppose. It might be faster to recompute!
> 
> On Thu, Dec 8, 2011 at 1:46 PM, Vinod <pi...@gmail.com> wrote:
> 
>> I'll use the first example from Chapter 2 of your book to clarify what I
>> mean by training:-
>>
>> Following code trains the recommender:-
>>    DataModel model = new FileDataModel(new File("intro.csv"));
>>
>>    UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
>>    UserNeighborhood neighborhood =
>>      new NearestNUserNeighborhood(2, similarity, model);
>>
>>    Recommender recommender = new GenericUserBasedRecommender(
>>        model, neighborhood, similarity);
>>
>> At this point, recommender is trained on preferences of users 1 to 5 in
>> intro.csv.
>>
>> We should now be able to serialize() this recommender instance into a file,
>> say "Movie Recommender.model" using steps mentioned here (
>> http://java.sun.com/developer/technicalArticles/Programming/serialization/
>> )
>>
>> All we need to do now is deploy "Movie Recommender.model" to production.
>>
>> If I understand the behavior correctly, this model should now be able to
>> predict recommendation for a new user.
>>
>> As an example, lets assume that production has a different user base. If
>> recommender instance is loaded from "Movie Recommender.model" file and
>> asked to provide recommendations for user '7' who has rated 101 and 102 as
>> 4 and 3 respectively, it should be able to predict recommendations for 7.
>> right?
>>
>> regards,
>> Vinod
>>
>>
>>
>>
>> On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>>> Yes, I mean you need to write it and read it in your own code.
>>>
>>> What do you mean by training a model? computing similarities? I don't
>> know
>>> if there's such a thing here as "training" on one data set and running on
>>> another. The implementations always use all currently available info. Is
>>> this a cold-start issue?
>>>
>>> OutOfMemoryError is nothing to do with this; on such a small data set it
>>> indicates you didn't set your JVM heap size above the default.
>>>
>>>
>>> On Thu, Dec 8, 2011 at 1:02 PM, Vinod <pi...@gmail.com> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> Neither Recommender nor any of its parent interface extends
>> serializable
>>> so
>>>> there is no way that I'd be able to serialize it.
>>>>
>>>> I agree that the implementations may not have startup overhead.
>> However,
>>>> training a model on millions of row is a cpu, memory & time consuming
>>>> activity. For example, when data set is changed from 100K to 1M in
>>> chapter
>>>> 4, program crashes with OutOfMemory after significant amount of time.
>>>>
>>>> I feel that training should be done in development only. Once a
>> developer
>>>> is ok with test results, he should be able to save instance of the
>>> trained
>>>> and tested model  (for ex:- recommender or classifier).
>>>>
>>>> These saved instances of trained and tested models only should be
>>> deployed
>>>> to production.
>>>>
>>>> Thought?
>>>>
>>>> regards,
>>>> Vinod
>>>>
>>>>
>>>>
>>>> On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> Ah right. No, there's still not a provision for this. You would just
>>> have
>>>>> to serialize it yourself if you like.
>>>>> Most of the implementations don't have a great deal of startup
>>> overhead,
>>>> so
>>>>> don't really need this. The exception is perhaps slope-one, but there
>>> you
>>>>> can actually save and supply pre-computed diffs.
>>>>> Still it would be valid to store and re-supply user-user similarities
>>> or
>>>>> something. You can do this, manually, by querying for user-user
>>>>> similarities, saving them, then loading them and supplying them via
>>>>> GenericUserSimilarity for instance.
>>>>>
>>>>> On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>> By model, I am not referring to data model but, a "trained"
>>> recommender
>>>>>> instance.
>>>>>>
>>>>>> Weka, for examples, has ability to save and load models:-
>>>>>> http://weka.wikispaces.com/Serialization
>>>>>> http://weka.wikispaces.com/Saving+and+loading+models
>>>>>>
>>>>>> This avoids the need to train model (recommender) every time a
>> server
>>>> is
>>>>>> bounced or program is restarted.
>>>>>>
>>>>>> regards,
>>>>>> Vinod
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com>
>> wrote:
>>>>>>
>>>>>>> The classes aren't Serializable, no. In the case of DataModel,
>> it's
>>>>>> assumed
>>>>>>> that you already have some persisted model somewhere, in a DB or
>>> file
>>>>> or
>>>>>>> something, so this would be redundant.
>>>>>>>
>>>>>>> On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com>
>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This is my first day of experimentation with Mahout. I am
>>> following
>>>>>>> "Mahout
>>>>>>>> in Action" book and looking at the sample code provided, it
>> seems
>>>>> that
>>>>>>>> models for ex:- recommender, needs to be trained at the start
>> of
>>>> the
>>>>>>>> program (start/restart). Recommender interface extends
>>> Refreshable
>>>>>> which
>>>>>>>> doesn't extend serializable. So, I am wondering if Mahout
>>> provides
>>>> an
>>>>>>>> alternate mechanism to to persist trained models (recommender
>>>>> instance
>>>>>> in
>>>>>>>> this case).
>>>>>>>>
>>>>>>>> Apologies if this is a very silly question.
>>>>>>>>
>>>>>>>> Thanks & regards,
>>>>>>>> Vinod
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


Re: Persisting trained models in Mahout

Posted by Sean Owen <sr...@gmail.com>.
That's right, you could get this effect by computing and saving off all the
user-user similarities, then reading them back in, putting them in a
GenericUserSimilarity, and proceeding as below. Those similarities are the
closest thing to a model here.

It's going to take a while to compute all those pairs, and most will be
unused, and so reloading them is going to take a lot of time and memory.
You could prune the small ones I suppose. It might be faster to recompute!

On Thu, Dec 8, 2011 at 1:46 PM, Vinod <pi...@gmail.com> wrote:

> I'll use the first example from Chapter 2 of your book to clarify what I
> mean by training:-
>
> Following code trains the recommender:-
>    DataModel model = new FileDataModel(new File("intro.csv"));
>
>    UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
>    UserNeighborhood neighborhood =
>      new NearestNUserNeighborhood(2, similarity, model);
>
>    Recommender recommender = new GenericUserBasedRecommender(
>        model, neighborhood, similarity);
>
> At this point, recommender is trained on preferences of users 1 to 5 in
> intro.csv.
>
> We should now be able to serialize() this recommender instance into a file,
> say "Movie Recommender.model" using steps mentioned here (
> http://java.sun.com/developer/technicalArticles/Programming/serialization/
> )
>
> All we need to do now is deploy "Movie Recommender.model" to production.
>
> If I understand the behavior correctly, this model should now be able to
> predict recommendation for a new user.
>
> As an example, lets assume that production has a different user base. If
> recommender instance is loaded from "Movie Recommender.model" file and
> asked to provide recommendations for user '7' who has rated 101 and 102 as
> 4 and 3 respectively, it should be able to predict recommendations for 7.
> right?
>
> regards,
> Vinod
>
>
>
>
> On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Yes, I mean you need to write it and read it in your own code.
> >
> > What do you mean by training a model? computing similarities? I don't
> know
> > if there's such a thing here as "training" on one data set and running on
> > another. The implementations always use all currently available info. Is
> > this a cold-start issue?
> >
> > OutOfMemoryError is nothing to do with this; on such a small data set it
> > indicates you didn't set your JVM heap size above the default.
> >
> >
> > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <pi...@gmail.com> wrote:
> >
> > > Hi Sean,
> > >
> > > Neither Recommender nor any of its parent interface extends
> serializable
> > so
> > > there is no way that I'd be able to serialize it.
> > >
> > > I agree that the implementations may not have startup overhead.
> However,
> > > training a model on millions of row is a cpu, memory & time consuming
> > > activity. For example, when data set is changed from 100K to 1M in
> > chapter
> > > 4, program crashes with OutOfMemory after significant amount of time.
> > >
> > > I feel that training should be done in development only. Once a
> developer
> > > is ok with test results, he should be able to save instance of the
> > trained
> > > and tested model  (for ex:- recommender or classifier).
> > >
> > > These saved instances of trained and tested models only should be
> > deployed
> > > to production.
> > >
> > > Thought?
> > >
> > > regards,
> > > Vinod
> > >
> > >
> > >
> > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:
> > >
> > > > Ah right. No, there's still not a provision for this. You would just
> > have
> > > > to serialize it yourself if you like.
> > > > Most of the implementations don't have a great deal of startup
> > overhead,
> > > so
> > > > don't really need this. The exception is perhaps slope-one, but there
> > you
> > > > can actually save and supply pre-computed diffs.
> > > > Still it would be valid to store and re-supply user-user similarities
> > or
> > > > something. You can do this, manually, by querying for user-user
> > > > similarities, saving them, then loading them and supplying them via
> > > > GenericUserSimilarity for instance.
> > > >
> > > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
> > > >
> > > > > Hi Sean,
> > > > >
> > > > > Thanks for the quick response.
> > > > >
> > > > > By model, I am not referring to data model but, a "trained"
> > recommender
> > > > > instance.
> > > > >
> > > > > Weka, for examples, has ability to save and load models:-
> > > > > http://weka.wikispaces.com/Serialization
> > > > > http://weka.wikispaces.com/Saving+and+loading+models
> > > > >
> > > > > This avoids the need to train model (recommender) every time a
> server
> > > is
> > > > > bounced or program is restarted.
> > > > >
> > > > > regards,
> > > > > Vinod
> > > > >
> > > > >
> > > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com>
> wrote:
> > > > >
> > > > > > The classes aren't Serializable, no. In the case of DataModel,
> it's
> > > > > assumed
> > > > > > that you already have some persisted model somewhere, in a DB or
> > file
> > > > or
> > > > > > something, so this would be redundant.
> > > > > >
> > > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com>
> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This is my first day of experimentation with Mahout. I am
> > following
> > > > > > "Mahout
> > > > > > > in Action" book and looking at the sample code provided, it
> seems
> > > > that
> > > > > > > models for ex:- recommender, needs to be trained at the start
> of
> > > the
> > > > > > > program (start/restart). Recommender interface extends
> > Refreshable
> > > > > which
> > > > > > > doesn't extend serializable. So, I am wondering if Mahout
> > provides
> > > an
> > > > > > > alternate mechanism to to persist trained models (recommender
> > > > instance
> > > > > in
> > > > > > > this case).
> > > > > > >
> > > > > > > Apologies if this is a very silly question.
> > > > > > >
> > > > > > > Thanks & regards,
> > > > > > > Vinod
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Persisting trained models in Mahout

Posted by Vinod <pi...@gmail.com>.
I'll use the first example from Chapter 2 of your book to clarify what I
mean by training:-

Following code trains the recommender:-
    DataModel model = new FileDataModel(new File("intro.csv"));

    UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
    UserNeighborhood neighborhood =
      new NearestNUserNeighborhood(2, similarity, model);

    Recommender recommender = new GenericUserBasedRecommender(
        model, neighborhood, similarity);

At this point, recommender is trained on preferences of users 1 to 5 in
intro.csv.

We should now be able to serialize() this recommender instance into a file,
say "Movie Recommender.model" using steps mentioned here (
http://java.sun.com/developer/technicalArticles/Programming/serialization/)

All we need to do now is deploy "Movie Recommender.model" to production.

If I understand the behavior correctly, this model should now be able to
predict recommendation for a new user.

As an example, lets assume that production has a different user base. If
recommender instance is loaded from "Movie Recommender.model" file and
asked to provide recommendations for user '7' who has rated 101 and 102 as
4 and 3 respectively, it should be able to predict recommendations for 7.
right?

regards,
Vinod




On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <sr...@gmail.com> wrote:

> Yes, I mean you need to write it and read it in your own code.
>
> What do you mean by training a model? computing similarities? I don't know
> if there's such a thing here as "training" on one data set and running on
> another. The implementations always use all currently available info. Is
> this a cold-start issue?
>
> OutOfMemoryError is nothing to do with this; on such a small data set it
> indicates you didn't set your JVM heap size above the default.
>
>
> On Thu, Dec 8, 2011 at 1:02 PM, Vinod <pi...@gmail.com> wrote:
>
> > Hi Sean,
> >
> > Neither Recommender nor any of its parent interface extends serializable
> so
> > there is no way that I'd be able to serialize it.
> >
> > I agree that the implementations may not have startup overhead. However,
> > training a model on millions of row is a cpu, memory & time consuming
> > activity. For example, when data set is changed from 100K to 1M in
> chapter
> > 4, program crashes with OutOfMemory after significant amount of time.
> >
> > I feel that training should be done in development only. Once a developer
> > is ok with test results, he should be able to save instance of the
> trained
> > and tested model  (for ex:- recommender or classifier).
> >
> > These saved instances of trained and tested models only should be
> deployed
> > to production.
> >
> > Thought?
> >
> > regards,
> > Vinod
> >
> >
> >
> > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > Ah right. No, there's still not a provision for this. You would just
> have
> > > to serialize it yourself if you like.
> > > Most of the implementations don't have a great deal of startup
> overhead,
> > so
> > > don't really need this. The exception is perhaps slope-one, but there
> you
> > > can actually save and supply pre-computed diffs.
> > > Still it would be valid to store and re-supply user-user similarities
> or
> > > something. You can do this, manually, by querying for user-user
> > > similarities, saving them, then loading them and supplying them via
> > > GenericUserSimilarity for instance.
> > >
> > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
> > >
> > > > Hi Sean,
> > > >
> > > > Thanks for the quick response.
> > > >
> > > > By model, I am not referring to data model but, a "trained"
> recommender
> > > > instance.
> > > >
> > > > Weka, for examples, has ability to save and load models:-
> > > > http://weka.wikispaces.com/Serialization
> > > > http://weka.wikispaces.com/Saving+and+loading+models
> > > >
> > > > This avoids the need to train model (recommender) every time a server
> > is
> > > > bounced or program is restarted.
> > > >
> > > > regards,
> > > > Vinod
> > > >
> > > >
> > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com> wrote:
> > > >
> > > > > The classes aren't Serializable, no. In the case of DataModel, it's
> > > > assumed
> > > > > that you already have some persisted model somewhere, in a DB or
> file
> > > or
> > > > > something, so this would be redundant.
> > > > >
> > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This is my first day of experimentation with Mahout. I am
> following
> > > > > "Mahout
> > > > > > in Action" book and looking at the sample code provided, it seems
> > > that
> > > > > > models for ex:- recommender, needs to be trained at the start of
> > the
> > > > > > program (start/restart). Recommender interface extends
> Refreshable
> > > > which
> > > > > > doesn't extend serializable. So, I am wondering if Mahout
> provides
> > an
> > > > > > alternate mechanism to to persist trained models (recommender
> > > instance
> > > > in
> > > > > > this case).
> > > > > >
> > > > > > Apologies if this is a very silly question.
> > > > > >
> > > > > > Thanks & regards,
> > > > > > Vinod
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Persisting trained models in Mahout

Posted by Sean Owen <sr...@gmail.com>.
Yes, I mean you need to write it and read it in your own code.

What do you mean by training a model? computing similarities? I don't know
if there's such a thing here as "training" on one data set and running on
another. The implementations always use all currently available info. Is
this a cold-start issue?

OutOfMemoryError is nothing to do with this; on such a small data set it
indicates you didn't set your JVM heap size above the default.


On Thu, Dec 8, 2011 at 1:02 PM, Vinod <pi...@gmail.com> wrote:

> Hi Sean,
>
> Neither Recommender nor any of its parent interface extends serializable so
> there is no way that I'd be able to serialize it.
>
> I agree that the implementations may not have startup overhead. However,
> training a model on millions of row is a cpu, memory & time consuming
> activity. For example, when data set is changed from 100K to 1M in chapter
> 4, program crashes with OutOfMemory after significant amount of time.
>
> I feel that training should be done in development only. Once a developer
> is ok with test results, he should be able to save instance of the trained
> and tested model  (for ex:- recommender or classifier).
>
> These saved instances of trained and tested models only should be deployed
> to production.
>
> Thought?
>
> regards,
> Vinod
>
>
>
> On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Ah right. No, there's still not a provision for this. You would just have
> > to serialize it yourself if you like.
> > Most of the implementations don't have a great deal of startup overhead,
> so
> > don't really need this. The exception is perhaps slope-one, but there you
> > can actually save and supply pre-computed diffs.
> > Still it would be valid to store and re-supply user-user similarities or
> > something. You can do this, manually, by querying for user-user
> > similarities, saving them, then loading them and supplying them via
> > GenericUserSimilarity for instance.
> >
> > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
> >
> > > Hi Sean,
> > >
> > > Thanks for the quick response.
> > >
> > > By model, I am not referring to data model but, a "trained" recommender
> > > instance.
> > >
> > > Weka, for examples, has ability to save and load models:-
> > > http://weka.wikispaces.com/Serialization
> > > http://weka.wikispaces.com/Saving+and+loading+models
> > >
> > > This avoids the need to train model (recommender) every time a server
> is
> > > bounced or program is restarted.
> > >
> > > regards,
> > > Vinod
> > >
> > >
> > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com> wrote:
> > >
> > > > The classes aren't Serializable, no. In the case of DataModel, it's
> > > assumed
> > > > that you already have some persisted model somewhere, in a DB or file
> > or
> > > > something, so this would be redundant.
> > > >
> > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > This is my first day of experimentation with Mahout. I am following
> > > > "Mahout
> > > > > in Action" book and looking at the sample code provided, it seems
> > that
> > > > > models for ex:- recommender, needs to be trained at the start of
> the
> > > > > program (start/restart). Recommender interface extends Refreshable
> > > which
> > > > > doesn't extend serializable. So, I am wondering if Mahout provides
> an
> > > > > alternate mechanism to to persist trained models (recommender
> > instance
> > > in
> > > > > this case).
> > > > >
> > > > > Apologies if this is a very silly question.
> > > > >
> > > > > Thanks & regards,
> > > > > Vinod
> > > > >
> > > >
> > >
> >
>

Re: Persisting trained models in Mahout

Posted by Ted Dunning <te...@gmail.com>.
There are other ways to store data structures than extending serializable.

The classifiers, for instance, can be saved and loaded at will.  See
Chapter 16.

Recommenders allow off-line computation of item-item similarities which is
the major cost for a recommender.  The on-line component starts quickly and
provides fast access given this data.

Your problems with memory usage were probably due to using the paradigm in
which the entire computation is done on-line.  That is fine for small
problems, but not for large.  Keep in mind also that recommendation models
are not small.

On Thu, Dec 8, 2011 at 6:02 AM, Vinod <pi...@gmail.com> wrote:

> Neither Recommender nor any of its parent interface extends serializable so
> there is no way that I'd be able to serialize it.
>

Re: Persisting trained models in Mahout

Posted by Vinod <pi...@gmail.com>.
Hi Sean,

Neither Recommender nor any of its parent interface extends serializable so
there is no way that I'd be able to serialize it.

I agree that the implementations may not have startup overhead. However,
training a model on millions of row is a cpu, memory & time consuming
activity. For example, when data set is changed from 100K to 1M in chapter
4, program crashes with OutOfMemory after significant amount of time.

I feel that training should be done in development only. Once a developer
is ok with test results, he should be able to save instance of the trained
and tested model  (for ex:- recommender or classifier).

These saved instances of trained and tested models only should be deployed
to production.

Thought?

regards,
Vinod



On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <sr...@gmail.com> wrote:

> Ah right. No, there's still not a provision for this. You would just have
> to serialize it yourself if you like.
> Most of the implementations don't have a great deal of startup overhead, so
> don't really need this. The exception is perhaps slope-one, but there you
> can actually save and supply pre-computed diffs.
> Still it would be valid to store and re-supply user-user similarities or
> something. You can do this, manually, by querying for user-user
> similarities, saving them, then loading them and supplying them via
> GenericUserSimilarity for instance.
>
> On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:
>
> > Hi Sean,
> >
> > Thanks for the quick response.
> >
> > By model, I am not referring to data model but, a "trained" recommender
> > instance.
> >
> > Weka, for examples, has ability to save and load models:-
> > http://weka.wikispaces.com/Serialization
> > http://weka.wikispaces.com/Saving+and+loading+models
> >
> > This avoids the need to train model (recommender) every time a server is
> > bounced or program is restarted.
> >
> > regards,
> > Vinod
> >
> >
> > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > The classes aren't Serializable, no. In the case of DataModel, it's
> > assumed
> > > that you already have some persisted model somewhere, in a DB or file
> or
> > > something, so this would be redundant.
> > >
> > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > This is my first day of experimentation with Mahout. I am following
> > > "Mahout
> > > > in Action" book and looking at the sample code provided, it seems
> that
> > > > models for ex:- recommender, needs to be trained at the start of the
> > > > program (start/restart). Recommender interface extends Refreshable
> > which
> > > > doesn't extend serializable. So, I am wondering if Mahout provides an
> > > > alternate mechanism to to persist trained models (recommender
> instance
> > in
> > > > this case).
> > > >
> > > > Apologies if this is a very silly question.
> > > >
> > > > Thanks & regards,
> > > > Vinod
> > > >
> > >
> >
>

Re: Persisting trained models in Mahout

Posted by Sean Owen <sr...@gmail.com>.
Ah right. No, there's still not a provision for this. You would just have
to serialize it yourself if you like.
Most of the implementations don't have a great deal of startup overhead, so
don't really need this. The exception is perhaps slope-one, but there you
can actually save and supply pre-computed diffs.
Still it would be valid to store and re-supply user-user similarities or
something. You can do this, manually, by querying for user-user
similarities, saving them, then loading them and supplying them via
GenericUserSimilarity for instance.

On Thu, Dec 8, 2011 at 12:27 PM, Vinod <pi...@gmail.com> wrote:

> Hi Sean,
>
> Thanks for the quick response.
>
> By model, I am not referring to data model but, a "trained" recommender
> instance.
>
> Weka, for examples, has ability to save and load models:-
> http://weka.wikispaces.com/Serialization
> http://weka.wikispaces.com/Saving+and+loading+models
>
> This avoids the need to train model (recommender) every time a server is
> bounced or program is restarted.
>
> regards,
> Vinod
>
>
> On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > The classes aren't Serializable, no. In the case of DataModel, it's
> assumed
> > that you already have some persisted model somewhere, in a DB or file or
> > something, so this would be redundant.
> >
> > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > This is my first day of experimentation with Mahout. I am following
> > "Mahout
> > > in Action" book and looking at the sample code provided, it seems that
> > > models for ex:- recommender, needs to be trained at the start of the
> > > program (start/restart). Recommender interface extends Refreshable
> which
> > > doesn't extend serializable. So, I am wondering if Mahout provides an
> > > alternate mechanism to to persist trained models (recommender instance
> in
> > > this case).
> > >
> > > Apologies if this is a very silly question.
> > >
> > > Thanks & regards,
> > > Vinod
> > >
> >
>

Re: Persisting trained models in Mahout

Posted by Vinod <pi...@gmail.com>.
Hi Sean,

Thanks for the quick response.

By model, I am not referring to data model but, a "trained" recommender
instance.

Weka, for examples, has ability to save and load models:-
http://weka.wikispaces.com/Serialization
http://weka.wikispaces.com/Saving+and+loading+models

This avoids the need to train model (recommender) every time a server is
bounced or program is restarted.

regards,
Vinod


On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <sr...@gmail.com> wrote:

> The classes aren't Serializable, no. In the case of DataModel, it's assumed
> that you already have some persisted model somewhere, in a DB or file or
> something, so this would be redundant.
>
> On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com> wrote:
>
> > Hi,
> >
> > This is my first day of experimentation with Mahout. I am following
> "Mahout
> > in Action" book and looking at the sample code provided, it seems that
> > models for ex:- recommender, needs to be trained at the start of the
> > program (start/restart). Recommender interface extends Refreshable which
> > doesn't extend serializable. So, I am wondering if Mahout provides an
> > alternate mechanism to to persist trained models (recommender instance in
> > this case).
> >
> > Apologies if this is a very silly question.
> >
> > Thanks & regards,
> > Vinod
> >
>

Re: Persisting trained models in Mahout

Posted by Sean Owen <sr...@gmail.com>.
The classes aren't Serializable, no. In the case of DataModel, it's assumed
that you already have some persisted model somewhere, in a DB or file or
something, so this would be redundant.

On Thu, Dec 8, 2011 at 12:07 PM, Vinod <pi...@gmail.com> wrote:

> Hi,
>
> This is my first day of experimentation with Mahout. I am following "Mahout
> in Action" book and looking at the sample code provided, it seems that
> models for ex:- recommender, needs to be trained at the start of the
> program (start/restart). Recommender interface extends Refreshable which
> doesn't extend serializable. So, I am wondering if Mahout provides an
> alternate mechanism to to persist trained models (recommender instance in
> this case).
>
> Apologies if this is a very silly question.
>
> Thanks & regards,
> Vinod
>