You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by PierLorenzo Bianchini <pi...@yahoo.com.INVALID> on 2015/04/03 12:26:47 UTC

fast performance way of writing preferences to file?

Hello everyone,
I'm new to mahout, to recommender systems and to the mailing list.

I''m trying to find a (fast) way to write back preferences to a file. I tried a few methods but I'm sure there must be a better approach.
Here's the deal (you can find the same post in stackoverflow[1]).
I have a training dataset of 800.000 records from 6000 users rating 3900 movies. These are stored in a comma separated file like: userId,movieId,preference. I have another dataset (200.000 records) in the format: userId,movieId. My goal is to use the first dataset as a training-set, in order to determine the missing preferences of the second set.

So far, I managed to load the training dataset and I generated user-based recommendations. This is pretty smooth and doesn't take too much time. But I'm struggling when it comes to writing back the recommendations.

The first method I tried is:

 * read a line from the file and get the userId,movieId tuple.
 * retrieve the calculated preference with estimatePreference(userId, movieId)
 * append the preference to the line and save it in a new file
This works, but it's incredibly slow (I added a counter to print every 10.000th iteration: after a couple of minutes it had only printed once. I have 8GB-RAM with an i7-core... how long can it take to process 200.000 lines?!)

My second choise was:

 * create a new FileDataModel with the second dataset
 * do something like this: newDataModel.setPreference(userId, movieId, recommender.estimatePreference(userId, movieId));

Here I get several problems:
 * at runtime: java.lang.UnsupportedOperationException (as I found out in [2], FileDataModel actually can't be updated. I don't understand why the function setPreference exists in the first place...)
 * The API of FileDataModel#setPreference states "This method should also be considered relatively slow."

I read around that a solution would be to use delta files, but I couldn't find out what that actually means. Any suggestion on how I could speed up my writing-the-preferences process?
Thank you!

Pier Lorenzo


[1] http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
[2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330

Re: fast performance way of writing preferences to file?

Posted by PierLorenzo Bianchini <pi...@yahoo.com.INVALID>.
Ok thank you very much I'll have a look at all that tonight.
FYI I ran the program a few times and checked out my performance results. It takes between 48 minutes and one hour to write the results to file. I improved my writing procedure as much as I could but I have the feeling that my bottlenecke is the "estimatePreference" method. Too bad, I set up a benchmark to run several tests at night and I think I'll be able to get enough results before my submission deadline (maybe I can try to parallelise things a bit...never did that, that should be interesting).
Thank you all for your inputs; I'll try to code the adjusted cos sim tonight.
Regards,

PL


--------------------------------------------
On Tue, 4/7/15, Suneel Marthi <su...@gmail.com> wrote:

 Subject: Re: fast performance way of writing preferences to file?
 To: "user@mahout.apache.org" <us...@mahout.apache.org>
 Date: Tuesday, April 7, 2015, 4:36 AM
 
 FYI,  adding to
 Pat's reply below Slope-One has been long deprecated.
 
 On Mon, Apr 6, 2015 at 5:00 PM, Pat Ferrel
 <pa...@occamsmachete.com>
 wrote:
 
 > Sorry, we are
 trying to get a release out.
 >
 > You can look at a custom similarity
 measure. Look at where
 >
 SIMILARITY_COSINE leads you and customize that maybe? There
 are in-memory
 > and mapreduce versions
 and not sure which you are using. That is code I
 > haven’t looked at for a long time so
 can’t get you much closer.
 >
 >
 > On Apr 3, 2015, at
 10:52 AM, PierLorenzo Bianchini
 > <pi...@yahoo.com.INVALID>
 wrote:
 >
 > Hi
 again,
 > seeing the answers to this
 question and the other I had posted ("adjusted
 > cosine similarity for item-based
 recommender?"), I think I should clarify a
 > bit what I'm trying to achieve and why
 I (believe I should) do things the
 > way
 I'm doing.
 >
 >
 I'm doing a class called "Learning from
 User-Generated data". Our first
 >
 assignment deals with analysing the results of various types
 of
 > recommenders. I'll go as far as
 saying "old-school" recommenders, given the
 > content of your answers.
 > We have been introduced to:
 > * Memory based:
 > 
    - user-based
 > 
    - item-based (*with* adjusted cosine
 similarity!)
 >     -
 slope-one
 >     - graph-based
 transitivity
 > * Memory based
 >     - preprocessed item/user
 based (? this is unclear to me but I didn't
 > reach this part of the assignment so
 I'll search for information before I
 > ask questions; I also found an article
 where they mentioned slope-one
 > amongst
 the model based; I guess I'll need to do more research
 on this)
 >     - matrix
 factorization-based (I saw that SVD is available in
 Mahout;
 > my project partner is looking
 into that right now)
 >
 > We have a *static* training dataset
 (800.000 <user,movie,preference>
 >
 triples) and another static dataset for which we have to
 extract the
 > predicted preferences
 (200.000 <user,movie> tuples) and write them back
 to
 > a movie (i.e. recompose the
 <user,movie,preference> triples). Note that
 > this will never go in a production
 environment, as it is merely a
 >
 university requirement. For the same reason, I would prefer
 not to mix up
 > things too much and
 I'd rather do a step-by-step learning (i.e. focus on
 > Mahout for now, before I dig deeper and
 check the search-based approach,
 > which
 uses DB-mahout-solr-spark... maybe a bit too much to handle
 at once
 > with the deadline we were
 given).
 >
 > So if I
 might get back to my original questions (again, I'm
 sorry for
 > being stubborn but I'm
 under specific constraints - I'll really try to
 > understand the search-based approach when
 I have more time) ;)
 > 1. I'm
 guessing that to implement an adjusted cosine similarity I
 should
 > extend AbstractSimilarity (or
 maybe even AbstractRecommender?). Is this
 > right?
 > 2. I still
 can't believe that it takes more than at-most a few
 minutes to
 > go through my 200.000 lines
 and find the already calculated preference.
 > What am I doing wrong? :/ Should I store
 my whole datamodel in a file
 > (how?) and
 then read through the file? I don't see how this could
 be faster
 > than just reading the exact
 value I'm searching for...
 >
 > Thanks again for your answers! Regards,
 >
 > Pier Lorenzo
 >
 >
 >
 --------------------------------------------
 > On Fri, 4/3/15, Ted Dunning <te...@gmail.com>
 wrote:
 >
 > Subject:
 Re: fast performance way of writing preferences to file?
 > To: "user@mahout.apache.org"
 <us...@mahout.apache.org>
 > Date: Friday, April 3, 2015, 5:52 PM
 >
 > Are you sure that
 the
 > problem is writing the results? 
 It seems to me that
 > the real problem is
 the use of a user-based
 > recommender.
 >
 > For such a
 > small data set, for instance, a
 search-based recommender
 > will be
 > able to make recommendations in less
 > than a millisecond with multiple
 > recommendations possible in parallel. 
 This
 > should allow you to do 200,000
 > recommendations in a few minutes on a
 single
 > machine.
 >
 > With such a small
 >
 dataset, indicator-based methods may not be the best
 > option.  To improve that, try using
 something
 > larger such as the million
 > song dataset.
 > See http://labrosa.ee.columbia.edu/millionsong/
 >
 > Also, using and
 estimating
 > ratings is not a
 particularly good thing to be
 > doing if
 you want to build a real
 >
 recommender.
 >
 >
 > On
 > Fri, Apr 3, 2015
 at 3:26 AM, PierLorenzo Bianchini <
 >
 pielle87@yahoo.com.invalid>
 > wrote:
 >
 > > Hello
 >
 everyone,
 > > I'm new to mahout,
 to
 > recommender systems and to the
 mailing list.
 > >
 >
 > I''m trying
 > to find a
 (fast) way to write back preferences to a file.
 > I
 > > tried a few
 methods but I'm sure
 > there must be
 a better approach.
 > >
 > Here's the deal (you can find the same
 post in
 > stackoverflow[1]).
 > > I have a training
 > dataset of 800.000 records from 6000 users
 rating 3900
 > > movies. These are
 stored in a comma
 > separated file
 like:
 > >
 >
 userId,movieId,preference. I have another dataset
 (200.000
 > records) in the
 > > format: userId,movieId.
 > My goal is to use the first dataset as
 a
 > > training-set, in order to
 determine the
 > missing preferences of
 the second
 > >
 >
 set.
 > >
 > > So
 far, I
 > managed to load the training
 dataset and I generated
 > user-based
 > > recommendations. This is
 > pretty smooth and doesn't take too
 much time. But
 > > I'm struggling
 when it comes to
 > writing back the
 recommendations.
 > >
 > > The first method I tried is:
 > >
 >
 >   * read a line from
 >
 the file and get the userId,movieId tuple.
 > >   * retrieve the
 calculated preference
 > with
 estimatePreference(userId,
 > >
 > movieId)
 >
 >   * append the preference to
 > the line and save it in a new file
 > > This
 > works, but
 it's incredibly slow (I added a counter to
 > print every
 > >
 10.000th iteration: after a
 > couple of
 minutes it had only printed once. I
 >
 > have 8GB-RAM with an i7-core... how long
 > can it take to process 200.000
 > >
 > lines?!)
 > >
 > > My
 second
 > choise was:
 >
 >
 > >   *
 > create a new FileDataModel with the second
 dataset
 > >   * do
 something like this:
 >
 newDataModel.setPreference(userId, movieId,
 > >
 recommender.estimatePreference(userId,
 >
 movieId));
 > >
 >
 > Here I
 > get several problems:
 > >   * at runtime:
 > java.lang.UnsupportedOperationException
 (as I found out
 > in
 >
 > [2], FileDataModel actually
 >
 can't be updated. I don't understand why the
 > > function setPreference exists in the
 first
 > place...)
 >
 >   * The API of
 >
 FileDataModel#setPreference states "This method
 should
 > also
 > >
 be considered relatively
 > slow."
 > >
 > > I read
 > around that a solution would be to use
 delta files, but I
 > couldn't
 > > find out what that
 > actually means. Any suggestion on how I
 could speed up
 > > my
 writing-the-preferences process?
 > >
 Thank you!
 > >
 >
 > Pier Lorenzo
 > >
 > >
 > > [1]
 > >
 > http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
 > > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
 > >
 >
 >
 >

Re: fast performance way of writing preferences to file?

Posted by Suneel Marthi <su...@gmail.com>.
FYI,  adding to Pat's reply below Slope-One has been long deprecated.

On Mon, Apr 6, 2015 at 5:00 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Sorry, we are trying to get a release out.
>
> You can look at a custom similarity measure. Look at where
> SIMILARITY_COSINE leads you and customize that maybe? There are in-memory
> and mapreduce versions and not sure which you are using. That is code I
> haven’t looked at for a long time so can’t get you much closer.
>
>
> On Apr 3, 2015, at 10:52 AM, PierLorenzo Bianchini
> <pi...@yahoo.com.INVALID> wrote:
>
> Hi again,
> seeing the answers to this question and the other I had posted ("adjusted
> cosine similarity for item-based recommender?"), I think I should clarify a
> bit what I'm trying to achieve and why I (believe I should) do things the
> way I'm doing.
>
> I'm doing a class called "Learning from User-Generated data". Our first
> assignment deals with analysing the results of various types of
> recommenders. I'll go as far as saying "old-school" recommenders, given the
> content of your answers.
> We have been introduced to:
> * Memory based:
>     - user-based
>     - item-based (*with* adjusted cosine similarity!)
>     - slope-one
>     - graph-based transitivity
> * Memory based
>     - preprocessed item/user based (? this is unclear to me but I didn't
> reach this part of the assignment so I'll search for information before I
> ask questions; I also found an article where they mentioned slope-one
> amongst the model based; I guess I'll need to do more research on this)
>     - matrix factorization-based (I saw that SVD is available in Mahout;
> my project partner is looking into that right now)
>
> We have a *static* training dataset (800.000 <user,movie,preference>
> triples) and another static dataset for which we have to extract the
> predicted preferences (200.000 <user,movie> tuples) and write them back to
> a movie (i.e. recompose the <user,movie,preference> triples). Note that
> this will never go in a production environment, as it is merely a
> university requirement. For the same reason, I would prefer not to mix up
> things too much and I'd rather do a step-by-step learning (i.e. focus on
> Mahout for now, before I dig deeper and check the search-based approach,
> which uses DB-mahout-solr-spark... maybe a bit too much to handle at once
> with the deadline we were given).
>
> So if I might get back to my original questions (again, I'm sorry for
> being stubborn but I'm under specific constraints - I'll really try to
> understand the search-based approach when I have more time) ;)
> 1. I'm guessing that to implement an adjusted cosine similarity I should
> extend AbstractSimilarity (or maybe even AbstractRecommender?). Is this
> right?
> 2. I still can't believe that it takes more than at-most a few minutes to
> go through my 200.000 lines and find the already calculated preference.
> What am I doing wrong? :/ Should I store my whole datamodel in a file
> (how?) and then read through the file? I don't see how this could be faster
> than just reading the exact value I'm searching for...
>
> Thanks again for your answers! Regards,
>
> Pier Lorenzo
>
>
> --------------------------------------------
> On Fri, 4/3/15, Ted Dunning <te...@gmail.com> wrote:
>
> Subject: Re: fast performance way of writing preferences to file?
> To: "user@mahout.apache.org" <us...@mahout.apache.org>
> Date: Friday, April 3, 2015, 5:52 PM
>
> Are you sure that the
> problem is writing the results?  It seems to me that
> the real problem is the use of a user-based
> recommender.
>
> For such a
> small data set, for instance, a search-based recommender
> will be
> able to make recommendations in less
> than a millisecond with multiple
> recommendations possible in parallel.  This
> should allow you to do 200,000
> recommendations in a few minutes on a single
> machine.
>
> With such a small
> dataset, indicator-based methods may not be the best
> option.  To improve that, try using something
> larger such as the million
> song dataset.
> See http://labrosa.ee.columbia.edu/millionsong/
>
> Also, using and estimating
> ratings is not a particularly good thing to be
> doing if you want to build a real
> recommender.
>
>
> On
> Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
> pielle87@yahoo.com.invalid>
> wrote:
>
> > Hello
> everyone,
> > I'm new to mahout, to
> recommender systems and to the mailing list.
> >
> > I''m trying
> to find a (fast) way to write back preferences to a file.
> I
> > tried a few methods but I'm sure
> there must be a better approach.
> >
> Here's the deal (you can find the same post in
> stackoverflow[1]).
> > I have a training
> dataset of 800.000 records from 6000 users rating 3900
> > movies. These are stored in a comma
> separated file like:
> >
> userId,movieId,preference. I have another dataset (200.000
> records) in the
> > format: userId,movieId.
> My goal is to use the first dataset as a
> > training-set, in order to determine the
> missing preferences of the second
> >
> set.
> >
> > So far, I
> managed to load the training dataset and I generated
> user-based
> > recommendations. This is
> pretty smooth and doesn't take too much time. But
> > I'm struggling when it comes to
> writing back the recommendations.
> >
> > The first method I tried is:
> >
> >   * read a line from
> the file and get the userId,movieId tuple.
> >   * retrieve the calculated preference
> with estimatePreference(userId,
> >
> movieId)
> >   * append the preference to
> the line and save it in a new file
> > This
> works, but it's incredibly slow (I added a counter to
> print every
> > 10.000th iteration: after a
> couple of minutes it had only printed once. I
> > have 8GB-RAM with an i7-core... how long
> can it take to process 200.000
> >
> lines?!)
> >
> > My second
> choise was:
> >
> >   *
> create a new FileDataModel with the second dataset
> >   * do something like this:
> newDataModel.setPreference(userId, movieId,
> > recommender.estimatePreference(userId,
> movieId));
> >
> > Here I
> get several problems:
> >   * at runtime:
> java.lang.UnsupportedOperationException (as I found out
> in
> > [2], FileDataModel actually
> can't be updated. I don't understand why the
> > function setPreference exists in the first
> place...)
> >   * The API of
> FileDataModel#setPreference states "This method should
> also
> > be considered relatively
> slow."
> >
> > I read
> around that a solution would be to use delta files, but I
> couldn't
> > find out what that
> actually means. Any suggestion on how I could speed up
> > my writing-the-preferences process?
> > Thank you!
> >
> > Pier Lorenzo
> >
> >
> > [1]
> >
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
> >
>
>
>

Re: fast performance way of writing preferences to file?

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Sorry, we are trying to get a release out.

You can look at a custom similarity measure. Look at where SIMILARITY_COSINE leads you and customize that maybe? There are in-memory and mapreduce versions and not sure which you are using. That is code I haven’t looked at for a long time so can’t get you much closer.


On Apr 3, 2015, at 10:52 AM, PierLorenzo Bianchini <pi...@yahoo.com.INVALID> wrote:

Hi again,
seeing the answers to this question and the other I had posted ("adjusted cosine similarity for item-based recommender?"), I think I should clarify a bit what I'm trying to achieve and why I (believe I should) do things the way I'm doing.

I'm doing a class called "Learning from User-Generated data". Our first assignment deals with analysing the results of various types of recommenders. I'll go as far as saying "old-school" recommenders, given the content of your answers.
We have been introduced to:
* Memory based:
    - user-based
    - item-based (*with* adjusted cosine similarity!)
    - slope-one
    - graph-based transitivity
* Memory based
    - preprocessed item/user based (? this is unclear to me but I didn't reach this part of the assignment so I'll search for information before I ask questions; I also found an article where they mentioned slope-one amongst the model based; I guess I'll need to do more research on this)
    - matrix factorization-based (I saw that SVD is available in Mahout; my project partner is looking into that right now)

We have a *static* training dataset (800.000 <user,movie,preference> triples) and another static dataset for which we have to extract the predicted preferences (200.000 <user,movie> tuples) and write them back to a movie (i.e. recompose the <user,movie,preference> triples). Note that this will never go in a production environment, as it is merely a university requirement. For the same reason, I would prefer not to mix up things too much and I'd rather do a step-by-step learning (i.e. focus on Mahout for now, before I dig deeper and check the search-based approach, which uses DB-mahout-solr-spark... maybe a bit too much to handle at once with the deadline we were given).

So if I might get back to my original questions (again, I'm sorry for being stubborn but I'm under specific constraints - I'll really try to understand the search-based approach when I have more time) ;)
1. I'm guessing that to implement an adjusted cosine similarity I should extend AbstractSimilarity (or maybe even AbstractRecommender?). Is this right?
2. I still can't believe that it takes more than at-most a few minutes to go through my 200.000 lines and find the already calculated preference. What am I doing wrong? :/ Should I store my whole datamodel in a file (how?) and then read through the file? I don't see how this could be faster than just reading the exact value I'm searching for...

Thanks again for your answers! Regards,

Pier Lorenzo


--------------------------------------------
On Fri, 4/3/15, Ted Dunning <te...@gmail.com> wrote:

Subject: Re: fast performance way of writing preferences to file?
To: "user@mahout.apache.org" <us...@mahout.apache.org>
Date: Friday, April 3, 2015, 5:52 PM

Are you sure that the
problem is writing the results?  It seems to me that
the real problem is the use of a user-based
recommender.

For such a
small data set, for instance, a search-based recommender
will be
able to make recommendations in less
than a millisecond with multiple
recommendations possible in parallel.  This
should allow you to do 200,000
recommendations in a few minutes on a single
machine.

With such a small
dataset, indicator-based methods may not be the best
option.  To improve that, try using something
larger such as the million
song dataset. 
See http://labrosa.ee.columbia.edu/millionsong/

Also, using and estimating
ratings is not a particularly good thing to be
doing if you want to build a real
recommender.


On
Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
pielle87@yahoo.com.invalid>
wrote:

> Hello
everyone,
> I'm new to mahout, to
recommender systems and to the mailing list.
> 
> I''m trying
to find a (fast) way to write back preferences to a file.
I
> tried a few methods but I'm sure
there must be a better approach.
> 
Here's the deal (you can find the same post in
stackoverflow[1]).
> I have a training
dataset of 800.000 records from 6000 users rating 3900
> movies. These are stored in a comma
separated file like:
> 
userId,movieId,preference. I have another dataset (200.000
records) in the
> format: userId,movieId.
My goal is to use the first dataset as a
> training-set, in order to determine the
missing preferences of the second
> 
set.
> 
> So far, I
managed to load the training dataset and I generated
user-based
> recommendations. This is
pretty smooth and doesn't take too much time. But
> I'm struggling when it comes to
writing back the recommendations.
> 
> The first method I tried is:
> 
>   * read a line from
the file and get the userId,movieId tuple.
>   * retrieve the calculated preference
with estimatePreference(userId,
> 
movieId)
>   * append the preference to
the line and save it in a new file
> This
works, but it's incredibly slow (I added a counter to
print every
> 10.000th iteration: after a
couple of minutes it had only printed once. I
> have 8GB-RAM with an i7-core... how long
can it take to process 200.000
> 
lines?!)
> 
> My second
choise was:
> 
>   *
create a new FileDataModel with the second dataset
>   * do something like this:
newDataModel.setPreference(userId, movieId,
> recommender.estimatePreference(userId,
movieId));
> 
> Here I
get several problems:
>   * at runtime:
java.lang.UnsupportedOperationException (as I found out
in
> [2], FileDataModel actually
can't be updated. I don't understand why the
> function setPreference exists in the first
place...)
>   * The API of
FileDataModel#setPreference states "This method should
also
> be considered relatively
slow."
> 
> I read
around that a solution would be to use delta files, but I
couldn't
> find out what that
actually means. Any suggestion on how I could speed up
> my writing-the-preferences process?
> Thank you!
> 
> Pier Lorenzo
> 
> 
> [1]
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
> 



Re: fast performance way of writing preferences to file?

Posted by PierLorenzo Bianchini <pi...@yahoo.com.INVALID>.
Hi again,
seeing the answers to this question and the other I had posted ("adjusted cosine similarity for item-based recommender?"), I think I should clarify a bit what I'm trying to achieve and why I (believe I should) do things the way I'm doing.

I'm doing a class called "Learning from User-Generated data". Our first assignment deals with analysing the results of various types of recommenders. I'll go as far as saying "old-school" recommenders, given the content of your answers.
We have been introduced to:
 * Memory based:
     - user-based
     - item-based (*with* adjusted cosine similarity!)
     - slope-one
     - graph-based transitivity
 * Memory based
     - preprocessed item/user based (? this is unclear to me but I didn't reach this part of the assignment so I'll search for information before I ask questions; I also found an article where they mentioned slope-one amongst the model based; I guess I'll need to do more research on this)
     - matrix factorization-based (I saw that SVD is available in Mahout; my project partner is looking into that right now)

We have a *static* training dataset (800.000 <user,movie,preference> triples) and another static dataset for which we have to extract the predicted preferences (200.000 <user,movie> tuples) and write them back to a movie (i.e. recompose the <user,movie,preference> triples). Note that this will never go in a production environment, as it is merely a university requirement. For the same reason, I would prefer not to mix up things too much and I'd rather do a step-by-step learning (i.e. focus on Mahout for now, before I dig deeper and check the search-based approach, which uses DB-mahout-solr-spark... maybe a bit too much to handle at once with the deadline we were given).

So if I might get back to my original questions (again, I'm sorry for being stubborn but I'm under specific constraints - I'll really try to understand the search-based approach when I have more time) ;)
1. I'm guessing that to implement an adjusted cosine similarity I should extend AbstractSimilarity (or maybe even AbstractRecommender?). Is this right?
2. I still can't believe that it takes more than at-most a few minutes to go through my 200.000 lines and find the already calculated preference. What am I doing wrong? :/ Should I store my whole datamodel in a file (how?) and then read through the file? I don't see how this could be faster than just reading the exact value I'm searching for...

Thanks again for your answers! Regards,

Pier Lorenzo


--------------------------------------------
On Fri, 4/3/15, Ted Dunning <te...@gmail.com> wrote:

 Subject: Re: fast performance way of writing preferences to file?
 To: "user@mahout.apache.org" <us...@mahout.apache.org>
 Date: Friday, April 3, 2015, 5:52 PM
 
 Are you sure that the
 problem is writing the results?  It seems to me that
 the real problem is the use of a user-based
 recommender.
 
 For such a
 small data set, for instance, a search-based recommender
 will be
 able to make recommendations in less
 than a millisecond with multiple
 recommendations possible in parallel.  This
 should allow you to do 200,000
 recommendations in a few minutes on a single
 machine.
 
 With such a small
 dataset, indicator-based methods may not be the best
 option.  To improve that, try using something
 larger such as the million
 song dataset. 
 See http://labrosa.ee.columbia.edu/millionsong/
 
 Also, using and estimating
 ratings is not a particularly good thing to be
 doing if you want to build a real
 recommender.
 
 
 On
 Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
 pielle87@yahoo.com.invalid>
 wrote:
 
 > Hello
 everyone,
 > I'm new to mahout, to
 recommender systems and to the mailing list.
 >
 > I''m trying
 to find a (fast) way to write back preferences to a file.
 I
 > tried a few methods but I'm sure
 there must be a better approach.
 >
 Here's the deal (you can find the same post in
 stackoverflow[1]).
 > I have a training
 dataset of 800.000 records from 6000 users rating 3900
 > movies. These are stored in a comma
 separated file like:
 >
 userId,movieId,preference. I have another dataset (200.000
 records) in the
 > format: userId,movieId.
 My goal is to use the first dataset as a
 > training-set, in order to determine the
 missing preferences of the second
 >
 set.
 >
 > So far, I
 managed to load the training dataset and I generated
 user-based
 > recommendations. This is
 pretty smooth and doesn't take too much time. But
 > I'm struggling when it comes to
 writing back the recommendations.
 >
 > The first method I tried is:
 >
 >  * read a line from
 the file and get the userId,movieId tuple.
 >  * retrieve the calculated preference
 with estimatePreference(userId,
 >
 movieId)
 >  * append the preference to
 the line and save it in a new file
 > This
 works, but it's incredibly slow (I added a counter to
 print every
 > 10.000th iteration: after a
 couple of minutes it had only printed once. I
 > have 8GB-RAM with an i7-core... how long
 can it take to process 200.000
 >
 lines?!)
 >
 > My second
 choise was:
 >
 >  *
 create a new FileDataModel with the second dataset
 >  * do something like this:
 newDataModel.setPreference(userId, movieId,
 > recommender.estimatePreference(userId,
 movieId));
 >
 > Here I
 get several problems:
 >  * at runtime:
 java.lang.UnsupportedOperationException (as I found out
 in
 > [2], FileDataModel actually
 can't be updated. I don't understand why the
 > function setPreference exists in the first
 place...)
 >  * The API of
 FileDataModel#setPreference states "This method should
 also
 > be considered relatively
 slow."
 >
 > I read
 around that a solution would be to use delta files, but I
 couldn't
 > find out what that
 actually means. Any suggestion on how I could speed up
 > my writing-the-preferences process?
 > Thank you!
 >
 > Pier Lorenzo
 >
 >
 > [1]
 > http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
 > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
 >
 

Re: fast performance way of writing preferences to file?

Posted by Ted Dunning <te...@gmail.com>.
Are you sure that the problem is writing the results?  It seems to me that
the real problem is the use of a user-based recommender.

For such a small data set, for instance, a search-based recommender will be
able to make recommendations in less than a millisecond with multiple
recommendations possible in parallel.  This should allow you to do 200,000
recommendations in a few minutes on a single machine.

With such a small dataset, indicator-based methods may not be the best
option.  To improve that, try using something larger such as the million
song dataset.  See http://labrosa.ee.columbia.edu/millionsong/

Also, using and estimating ratings is not a particularly good thing to be
doing if you want to build a real recommender.


On Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
pielle87@yahoo.com.invalid> wrote:

> Hello everyone,
> I'm new to mahout, to recommender systems and to the mailing list.
>
> I''m trying to find a (fast) way to write back preferences to a file. I
> tried a few methods but I'm sure there must be a better approach.
> Here's the deal (you can find the same post in stackoverflow[1]).
> I have a training dataset of 800.000 records from 6000 users rating 3900
> movies. These are stored in a comma separated file like:
> userId,movieId,preference. I have another dataset (200.000 records) in the
> format: userId,movieId. My goal is to use the first dataset as a
> training-set, in order to determine the missing preferences of the second
> set.
>
> So far, I managed to load the training dataset and I generated user-based
> recommendations. This is pretty smooth and doesn't take too much time. But
> I'm struggling when it comes to writing back the recommendations.
>
> The first method I tried is:
>
>  * read a line from the file and get the userId,movieId tuple.
>  * retrieve the calculated preference with estimatePreference(userId,
> movieId)
>  * append the preference to the line and save it in a new file
> This works, but it's incredibly slow (I added a counter to print every
> 10.000th iteration: after a couple of minutes it had only printed once. I
> have 8GB-RAM with an i7-core... how long can it take to process 200.000
> lines?!)
>
> My second choise was:
>
>  * create a new FileDataModel with the second dataset
>  * do something like this: newDataModel.setPreference(userId, movieId,
> recommender.estimatePreference(userId, movieId));
>
> Here I get several problems:
>  * at runtime: java.lang.UnsupportedOperationException (as I found out in
> [2], FileDataModel actually can't be updated. I don't understand why the
> function setPreference exists in the first place...)
>  * The API of FileDataModel#setPreference states "This method should also
> be considered relatively slow."
>
> I read around that a solution would be to use delta files, but I couldn't
> find out what that actually means. Any suggestion on how I could speed up
> my writing-the-preferences process?
> Thank you!
>
> Pier Lorenzo
>
>
> [1]
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
>