You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by PierLorenzo Bianchini <pi...@yahoo.com.INVALID> on 2015/04/03 12:26:47 UTC
fast performance way of writing preferences to file?
Hello everyone,
I'm new to mahout, to recommender systems and to the mailing list.
I''m trying to find a (fast) way to write back preferences to a file. I tried a few methods but I'm sure there must be a better approach.
Here's the deal (you can find the same post in stackoverflow[1]).
I have a training dataset of 800.000 records from 6000 users rating 3900 movies. These are stored in a comma separated file like: userId,movieId,preference. I have another dataset (200.000 records) in the format: userId,movieId. My goal is to use the first dataset as a training-set, in order to determine the missing preferences of the second set.
So far, I managed to load the training dataset and I generated user-based recommendations. This is pretty smooth and doesn't take too much time. But I'm struggling when it comes to writing back the recommendations.
The first method I tried is:
* read a line from the file and get the userId,movieId tuple.
* retrieve the calculated preference with estimatePreference(userId, movieId)
* append the preference to the line and save it in a new file
This works, but it's incredibly slow (I added a counter to print every 10.000th iteration: after a couple of minutes it had only printed once. I have 8GB-RAM with an i7-core... how long can it take to process 200.000 lines?!)
My second choise was:
* create a new FileDataModel with the second dataset
* do something like this: newDataModel.setPreference(userId, movieId, recommender.estimatePreference(userId, movieId));
Here I get several problems:
* at runtime: java.lang.UnsupportedOperationException (as I found out in [2], FileDataModel actually can't be updated. I don't understand why the function setPreference exists in the first place...)
* The API of FileDataModel#setPreference states "This method should also be considered relatively slow."
I read around that a solution would be to use delta files, but I couldn't find out what that actually means. Any suggestion on how I could speed up my writing-the-preferences process?
Thank you!
Pier Lorenzo
[1] http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
[2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
Re: fast performance way of writing preferences to file?
Posted by PierLorenzo Bianchini <pi...@yahoo.com.INVALID>.
Ok thank you very much I'll have a look at all that tonight.
FYI I ran the program a few times and checked out my performance results. It takes between 48 minutes and one hour to write the results to file. I improved my writing procedure as much as I could but I have the feeling that my bottlenecke is the "estimatePreference" method. Too bad, I set up a benchmark to run several tests at night and I think I'll be able to get enough results before my submission deadline (maybe I can try to parallelise things a bit...never did that, that should be interesting).
Thank you all for your inputs; I'll try to code the adjusted cos sim tonight.
Regards,
PL
--------------------------------------------
On Tue, 4/7/15, Suneel Marthi <su...@gmail.com> wrote:
Subject: Re: fast performance way of writing preferences to file?
To: "user@mahout.apache.org" <us...@mahout.apache.org>
Date: Tuesday, April 7, 2015, 4:36 AM
FYI, adding to
Pat's reply below Slope-One has been long deprecated.
On Mon, Apr 6, 2015 at 5:00 PM, Pat Ferrel
<pa...@occamsmachete.com>
wrote:
> Sorry, we are
trying to get a release out.
>
> You can look at a custom similarity
measure. Look at where
>
SIMILARITY_COSINE leads you and customize that maybe? There
are in-memory
> and mapreduce versions
and not sure which you are using. That is code I
> haven’t looked at for a long time so
can’t get you much closer.
>
>
> On Apr 3, 2015, at
10:52 AM, PierLorenzo Bianchini
> <pi...@yahoo.com.INVALID>
wrote:
>
> Hi
again,
> seeing the answers to this
question and the other I had posted ("adjusted
> cosine similarity for item-based
recommender?"), I think I should clarify a
> bit what I'm trying to achieve and why
I (believe I should) do things the
> way
I'm doing.
>
>
I'm doing a class called "Learning from
User-Generated data". Our first
>
assignment deals with analysing the results of various types
of
> recommenders. I'll go as far as
saying "old-school" recommenders, given the
> content of your answers.
> We have been introduced to:
> * Memory based:
>
- user-based
>
- item-based (*with* adjusted cosine
similarity!)
> -
slope-one
> - graph-based
transitivity
> * Memory based
> - preprocessed item/user
based (? this is unclear to me but I didn't
> reach this part of the assignment so
I'll search for information before I
> ask questions; I also found an article
where they mentioned slope-one
> amongst
the model based; I guess I'll need to do more research
on this)
> - matrix
factorization-based (I saw that SVD is available in
Mahout;
> my project partner is looking
into that right now)
>
> We have a *static* training dataset
(800.000 <user,movie,preference>
>
triples) and another static dataset for which we have to
extract the
> predicted preferences
(200.000 <user,movie> tuples) and write them back
to
> a movie (i.e. recompose the
<user,movie,preference> triples). Note that
> this will never go in a production
environment, as it is merely a
>
university requirement. For the same reason, I would prefer
not to mix up
> things too much and
I'd rather do a step-by-step learning (i.e. focus on
> Mahout for now, before I dig deeper and
check the search-based approach,
> which
uses DB-mahout-solr-spark... maybe a bit too much to handle
at once
> with the deadline we were
given).
>
> So if I
might get back to my original questions (again, I'm
sorry for
> being stubborn but I'm
under specific constraints - I'll really try to
> understand the search-based approach when
I have more time) ;)
> 1. I'm
guessing that to implement an adjusted cosine similarity I
should
> extend AbstractSimilarity (or
maybe even AbstractRecommender?). Is this
> right?
> 2. I still
can't believe that it takes more than at-most a few
minutes to
> go through my 200.000 lines
and find the already calculated preference.
> What am I doing wrong? :/ Should I store
my whole datamodel in a file
> (how?) and
then read through the file? I don't see how this could
be faster
> than just reading the exact
value I'm searching for...
>
> Thanks again for your answers! Regards,
>
> Pier Lorenzo
>
>
>
--------------------------------------------
> On Fri, 4/3/15, Ted Dunning <te...@gmail.com>
wrote:
>
> Subject:
Re: fast performance way of writing preferences to file?
> To: "user@mahout.apache.org"
<us...@mahout.apache.org>
> Date: Friday, April 3, 2015, 5:52 PM
>
> Are you sure that
the
> problem is writing the results?
It seems to me that
> the real problem is
the use of a user-based
> recommender.
>
> For such a
> small data set, for instance, a
search-based recommender
> will be
> able to make recommendations in less
> than a millisecond with multiple
> recommendations possible in parallel.
This
> should allow you to do 200,000
> recommendations in a few minutes on a
single
> machine.
>
> With such a small
>
dataset, indicator-based methods may not be the best
> option. To improve that, try using
something
> larger such as the million
> song dataset.
> See http://labrosa.ee.columbia.edu/millionsong/
>
> Also, using and
estimating
> ratings is not a
particularly good thing to be
> doing if
you want to build a real
>
recommender.
>
>
> On
> Fri, Apr 3, 2015
at 3:26 AM, PierLorenzo Bianchini <
>
pielle87@yahoo.com.invalid>
> wrote:
>
> > Hello
>
everyone,
> > I'm new to mahout,
to
> recommender systems and to the
mailing list.
> >
>
> I''m trying
> to find a
(fast) way to write back preferences to a file.
> I
> > tried a few
methods but I'm sure
> there must be
a better approach.
> >
> Here's the deal (you can find the same
post in
> stackoverflow[1]).
> > I have a training
> dataset of 800.000 records from 6000 users
rating 3900
> > movies. These are
stored in a comma
> separated file
like:
> >
>
userId,movieId,preference. I have another dataset
(200.000
> records) in the
> > format: userId,movieId.
> My goal is to use the first dataset as
a
> > training-set, in order to
determine the
> missing preferences of
the second
> >
>
set.
> >
> > So
far, I
> managed to load the training
dataset and I generated
> user-based
> > recommendations. This is
> pretty smooth and doesn't take too
much time. But
> > I'm struggling
when it comes to
> writing back the
recommendations.
> >
> > The first method I tried is:
> >
>
> * read a line from
>
the file and get the userId,movieId tuple.
> > * retrieve the
calculated preference
> with
estimatePreference(userId,
> >
> movieId)
>
> * append the preference to
> the line and save it in a new file
> > This
> works, but
it's incredibly slow (I added a counter to
> print every
> >
10.000th iteration: after a
> couple of
minutes it had only printed once. I
>
> have 8GB-RAM with an i7-core... how long
> can it take to process 200.000
> >
> lines?!)
> >
> > My
second
> choise was:
>
>
> > *
> create a new FileDataModel with the second
dataset
> > * do
something like this:
>
newDataModel.setPreference(userId, movieId,
> >
recommender.estimatePreference(userId,
>
movieId));
> >
>
> Here I
> get several problems:
> > * at runtime:
> java.lang.UnsupportedOperationException
(as I found out
> in
>
> [2], FileDataModel actually
>
can't be updated. I don't understand why the
> > function setPreference exists in the
first
> place...)
>
> * The API of
>
FileDataModel#setPreference states "This method
should
> also
> >
be considered relatively
> slow."
> >
> > I read
> around that a solution would be to use
delta files, but I
> couldn't
> > find out what that
> actually means. Any suggestion on how I
could speed up
> > my
writing-the-preferences process?
> >
Thank you!
> >
>
> Pier Lorenzo
> >
> >
> > [1]
> >
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
> >
>
>
>
Re: fast performance way of writing preferences to file?
Posted by Suneel Marthi <su...@gmail.com>.
FYI, adding to Pat's reply below Slope-One has been long deprecated.
On Mon, Apr 6, 2015 at 5:00 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> Sorry, we are trying to get a release out.
>
> You can look at a custom similarity measure. Look at where
> SIMILARITY_COSINE leads you and customize that maybe? There are in-memory
> and mapreduce versions and not sure which you are using. That is code I
> haven’t looked at for a long time so can’t get you much closer.
>
>
> On Apr 3, 2015, at 10:52 AM, PierLorenzo Bianchini
> <pi...@yahoo.com.INVALID> wrote:
>
> Hi again,
> seeing the answers to this question and the other I had posted ("adjusted
> cosine similarity for item-based recommender?"), I think I should clarify a
> bit what I'm trying to achieve and why I (believe I should) do things the
> way I'm doing.
>
> I'm doing a class called "Learning from User-Generated data". Our first
> assignment deals with analysing the results of various types of
> recommenders. I'll go as far as saying "old-school" recommenders, given the
> content of your answers.
> We have been introduced to:
> * Memory based:
> - user-based
> - item-based (*with* adjusted cosine similarity!)
> - slope-one
> - graph-based transitivity
> * Memory based
> - preprocessed item/user based (? this is unclear to me but I didn't
> reach this part of the assignment so I'll search for information before I
> ask questions; I also found an article where they mentioned slope-one
> amongst the model based; I guess I'll need to do more research on this)
> - matrix factorization-based (I saw that SVD is available in Mahout;
> my project partner is looking into that right now)
>
> We have a *static* training dataset (800.000 <user,movie,preference>
> triples) and another static dataset for which we have to extract the
> predicted preferences (200.000 <user,movie> tuples) and write them back to
> a movie (i.e. recompose the <user,movie,preference> triples). Note that
> this will never go in a production environment, as it is merely a
> university requirement. For the same reason, I would prefer not to mix up
> things too much and I'd rather do a step-by-step learning (i.e. focus on
> Mahout for now, before I dig deeper and check the search-based approach,
> which uses DB-mahout-solr-spark... maybe a bit too much to handle at once
> with the deadline we were given).
>
> So if I might get back to my original questions (again, I'm sorry for
> being stubborn but I'm under specific constraints - I'll really try to
> understand the search-based approach when I have more time) ;)
> 1. I'm guessing that to implement an adjusted cosine similarity I should
> extend AbstractSimilarity (or maybe even AbstractRecommender?). Is this
> right?
> 2. I still can't believe that it takes more than at-most a few minutes to
> go through my 200.000 lines and find the already calculated preference.
> What am I doing wrong? :/ Should I store my whole datamodel in a file
> (how?) and then read through the file? I don't see how this could be faster
> than just reading the exact value I'm searching for...
>
> Thanks again for your answers! Regards,
>
> Pier Lorenzo
>
>
> --------------------------------------------
> On Fri, 4/3/15, Ted Dunning <te...@gmail.com> wrote:
>
> Subject: Re: fast performance way of writing preferences to file?
> To: "user@mahout.apache.org" <us...@mahout.apache.org>
> Date: Friday, April 3, 2015, 5:52 PM
>
> Are you sure that the
> problem is writing the results? It seems to me that
> the real problem is the use of a user-based
> recommender.
>
> For such a
> small data set, for instance, a search-based recommender
> will be
> able to make recommendations in less
> than a millisecond with multiple
> recommendations possible in parallel. This
> should allow you to do 200,000
> recommendations in a few minutes on a single
> machine.
>
> With such a small
> dataset, indicator-based methods may not be the best
> option. To improve that, try using something
> larger such as the million
> song dataset.
> See http://labrosa.ee.columbia.edu/millionsong/
>
> Also, using and estimating
> ratings is not a particularly good thing to be
> doing if you want to build a real
> recommender.
>
>
> On
> Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
> pielle87@yahoo.com.invalid>
> wrote:
>
> > Hello
> everyone,
> > I'm new to mahout, to
> recommender systems and to the mailing list.
> >
> > I''m trying
> to find a (fast) way to write back preferences to a file.
> I
> > tried a few methods but I'm sure
> there must be a better approach.
> >
> Here's the deal (you can find the same post in
> stackoverflow[1]).
> > I have a training
> dataset of 800.000 records from 6000 users rating 3900
> > movies. These are stored in a comma
> separated file like:
> >
> userId,movieId,preference. I have another dataset (200.000
> records) in the
> > format: userId,movieId.
> My goal is to use the first dataset as a
> > training-set, in order to determine the
> missing preferences of the second
> >
> set.
> >
> > So far, I
> managed to load the training dataset and I generated
> user-based
> > recommendations. This is
> pretty smooth and doesn't take too much time. But
> > I'm struggling when it comes to
> writing back the recommendations.
> >
> > The first method I tried is:
> >
> > * read a line from
> the file and get the userId,movieId tuple.
> > * retrieve the calculated preference
> with estimatePreference(userId,
> >
> movieId)
> > * append the preference to
> the line and save it in a new file
> > This
> works, but it's incredibly slow (I added a counter to
> print every
> > 10.000th iteration: after a
> couple of minutes it had only printed once. I
> > have 8GB-RAM with an i7-core... how long
> can it take to process 200.000
> >
> lines?!)
> >
> > My second
> choise was:
> >
> > *
> create a new FileDataModel with the second dataset
> > * do something like this:
> newDataModel.setPreference(userId, movieId,
> > recommender.estimatePreference(userId,
> movieId));
> >
> > Here I
> get several problems:
> > * at runtime:
> java.lang.UnsupportedOperationException (as I found out
> in
> > [2], FileDataModel actually
> can't be updated. I don't understand why the
> > function setPreference exists in the first
> place...)
> > * The API of
> FileDataModel#setPreference states "This method should
> also
> > be considered relatively
> slow."
> >
> > I read
> around that a solution would be to use delta files, but I
> couldn't
> > find out what that
> actually means. Any suggestion on how I could speed up
> > my writing-the-preferences process?
> > Thank you!
> >
> > Pier Lorenzo
> >
> >
> > [1]
> >
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
> >
>
>
>
Re: fast performance way of writing preferences to file?
Posted by Pat Ferrel <pa...@occamsmachete.com>.
Sorry, we are trying to get a release out.
You can look at a custom similarity measure. Look at where SIMILARITY_COSINE leads you and customize that maybe? There are in-memory and mapreduce versions and not sure which you are using. That is code I haven’t looked at for a long time so can’t get you much closer.
On Apr 3, 2015, at 10:52 AM, PierLorenzo Bianchini <pi...@yahoo.com.INVALID> wrote:
Hi again,
seeing the answers to this question and the other I had posted ("adjusted cosine similarity for item-based recommender?"), I think I should clarify a bit what I'm trying to achieve and why I (believe I should) do things the way I'm doing.
I'm doing a class called "Learning from User-Generated data". Our first assignment deals with analysing the results of various types of recommenders. I'll go as far as saying "old-school" recommenders, given the content of your answers.
We have been introduced to:
* Memory based:
- user-based
- item-based (*with* adjusted cosine similarity!)
- slope-one
- graph-based transitivity
* Memory based
- preprocessed item/user based (? this is unclear to me but I didn't reach this part of the assignment so I'll search for information before I ask questions; I also found an article where they mentioned slope-one amongst the model based; I guess I'll need to do more research on this)
- matrix factorization-based (I saw that SVD is available in Mahout; my project partner is looking into that right now)
We have a *static* training dataset (800.000 <user,movie,preference> triples) and another static dataset for which we have to extract the predicted preferences (200.000 <user,movie> tuples) and write them back to a movie (i.e. recompose the <user,movie,preference> triples). Note that this will never go in a production environment, as it is merely a university requirement. For the same reason, I would prefer not to mix up things too much and I'd rather do a step-by-step learning (i.e. focus on Mahout for now, before I dig deeper and check the search-based approach, which uses DB-mahout-solr-spark... maybe a bit too much to handle at once with the deadline we were given).
So if I might get back to my original questions (again, I'm sorry for being stubborn but I'm under specific constraints - I'll really try to understand the search-based approach when I have more time) ;)
1. I'm guessing that to implement an adjusted cosine similarity I should extend AbstractSimilarity (or maybe even AbstractRecommender?). Is this right?
2. I still can't believe that it takes more than at-most a few minutes to go through my 200.000 lines and find the already calculated preference. What am I doing wrong? :/ Should I store my whole datamodel in a file (how?) and then read through the file? I don't see how this could be faster than just reading the exact value I'm searching for...
Thanks again for your answers! Regards,
Pier Lorenzo
--------------------------------------------
On Fri, 4/3/15, Ted Dunning <te...@gmail.com> wrote:
Subject: Re: fast performance way of writing preferences to file?
To: "user@mahout.apache.org" <us...@mahout.apache.org>
Date: Friday, April 3, 2015, 5:52 PM
Are you sure that the
problem is writing the results? It seems to me that
the real problem is the use of a user-based
recommender.
For such a
small data set, for instance, a search-based recommender
will be
able to make recommendations in less
than a millisecond with multiple
recommendations possible in parallel. This
should allow you to do 200,000
recommendations in a few minutes on a single
machine.
With such a small
dataset, indicator-based methods may not be the best
option. To improve that, try using something
larger such as the million
song dataset.
See http://labrosa.ee.columbia.edu/millionsong/
Also, using and estimating
ratings is not a particularly good thing to be
doing if you want to build a real
recommender.
On
Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
pielle87@yahoo.com.invalid>
wrote:
> Hello
everyone,
> I'm new to mahout, to
recommender systems and to the mailing list.
>
> I''m trying
to find a (fast) way to write back preferences to a file.
I
> tried a few methods but I'm sure
there must be a better approach.
>
Here's the deal (you can find the same post in
stackoverflow[1]).
> I have a training
dataset of 800.000 records from 6000 users rating 3900
> movies. These are stored in a comma
separated file like:
>
userId,movieId,preference. I have another dataset (200.000
records) in the
> format: userId,movieId.
My goal is to use the first dataset as a
> training-set, in order to determine the
missing preferences of the second
>
set.
>
> So far, I
managed to load the training dataset and I generated
user-based
> recommendations. This is
pretty smooth and doesn't take too much time. But
> I'm struggling when it comes to
writing back the recommendations.
>
> The first method I tried is:
>
> * read a line from
the file and get the userId,movieId tuple.
> * retrieve the calculated preference
with estimatePreference(userId,
>
movieId)
> * append the preference to
the line and save it in a new file
> This
works, but it's incredibly slow (I added a counter to
print every
> 10.000th iteration: after a
couple of minutes it had only printed once. I
> have 8GB-RAM with an i7-core... how long
can it take to process 200.000
>
lines?!)
>
> My second
choise was:
>
> *
create a new FileDataModel with the second dataset
> * do something like this:
newDataModel.setPreference(userId, movieId,
> recommender.estimatePreference(userId,
movieId));
>
> Here I
get several problems:
> * at runtime:
java.lang.UnsupportedOperationException (as I found out
in
> [2], FileDataModel actually
can't be updated. I don't understand why the
> function setPreference exists in the first
place...)
> * The API of
FileDataModel#setPreference states "This method should
also
> be considered relatively
slow."
>
> I read
around that a solution would be to use delta files, but I
couldn't
> find out what that
actually means. Any suggestion on how I could speed up
> my writing-the-preferences process?
> Thank you!
>
> Pier Lorenzo
>
>
> [1]
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
>
Re: fast performance way of writing preferences to file?
Posted by PierLorenzo Bianchini <pi...@yahoo.com.INVALID>.
Hi again,
seeing the answers to this question and the other I had posted ("adjusted cosine similarity for item-based recommender?"), I think I should clarify a bit what I'm trying to achieve and why I (believe I should) do things the way I'm doing.
I'm doing a class called "Learning from User-Generated data". Our first assignment deals with analysing the results of various types of recommenders. I'll go as far as saying "old-school" recommenders, given the content of your answers.
We have been introduced to:
* Memory based:
- user-based
- item-based (*with* adjusted cosine similarity!)
- slope-one
- graph-based transitivity
* Memory based
- preprocessed item/user based (? this is unclear to me but I didn't reach this part of the assignment so I'll search for information before I ask questions; I also found an article where they mentioned slope-one amongst the model based; I guess I'll need to do more research on this)
- matrix factorization-based (I saw that SVD is available in Mahout; my project partner is looking into that right now)
We have a *static* training dataset (800.000 <user,movie,preference> triples) and another static dataset for which we have to extract the predicted preferences (200.000 <user,movie> tuples) and write them back to a movie (i.e. recompose the <user,movie,preference> triples). Note that this will never go in a production environment, as it is merely a university requirement. For the same reason, I would prefer not to mix up things too much and I'd rather do a step-by-step learning (i.e. focus on Mahout for now, before I dig deeper and check the search-based approach, which uses DB-mahout-solr-spark... maybe a bit too much to handle at once with the deadline we were given).
So if I might get back to my original questions (again, I'm sorry for being stubborn but I'm under specific constraints - I'll really try to understand the search-based approach when I have more time) ;)
1. I'm guessing that to implement an adjusted cosine similarity I should extend AbstractSimilarity (or maybe even AbstractRecommender?). Is this right?
2. I still can't believe that it takes more than at-most a few minutes to go through my 200.000 lines and find the already calculated preference. What am I doing wrong? :/ Should I store my whole datamodel in a file (how?) and then read through the file? I don't see how this could be faster than just reading the exact value I'm searching for...
Thanks again for your answers! Regards,
Pier Lorenzo
--------------------------------------------
On Fri, 4/3/15, Ted Dunning <te...@gmail.com> wrote:
Subject: Re: fast performance way of writing preferences to file?
To: "user@mahout.apache.org" <us...@mahout.apache.org>
Date: Friday, April 3, 2015, 5:52 PM
Are you sure that the
problem is writing the results? It seems to me that
the real problem is the use of a user-based
recommender.
For such a
small data set, for instance, a search-based recommender
will be
able to make recommendations in less
than a millisecond with multiple
recommendations possible in parallel. This
should allow you to do 200,000
recommendations in a few minutes on a single
machine.
With such a small
dataset, indicator-based methods may not be the best
option. To improve that, try using something
larger such as the million
song dataset.
See http://labrosa.ee.columbia.edu/millionsong/
Also, using and estimating
ratings is not a particularly good thing to be
doing if you want to build a real
recommender.
On
Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
pielle87@yahoo.com.invalid>
wrote:
> Hello
everyone,
> I'm new to mahout, to
recommender systems and to the mailing list.
>
> I''m trying
to find a (fast) way to write back preferences to a file.
I
> tried a few methods but I'm sure
there must be a better approach.
>
Here's the deal (you can find the same post in
stackoverflow[1]).
> I have a training
dataset of 800.000 records from 6000 users rating 3900
> movies. These are stored in a comma
separated file like:
>
userId,movieId,preference. I have another dataset (200.000
records) in the
> format: userId,movieId.
My goal is to use the first dataset as a
> training-set, in order to determine the
missing preferences of the second
>
set.
>
> So far, I
managed to load the training dataset and I generated
user-based
> recommendations. This is
pretty smooth and doesn't take too much time. But
> I'm struggling when it comes to
writing back the recommendations.
>
> The first method I tried is:
>
> * read a line from
the file and get the userId,movieId tuple.
> * retrieve the calculated preference
with estimatePreference(userId,
>
movieId)
> * append the preference to
the line and save it in a new file
> This
works, but it's incredibly slow (I added a counter to
print every
> 10.000th iteration: after a
couple of minutes it had only printed once. I
> have 8GB-RAM with an i7-core... how long
can it take to process 200.000
>
lines?!)
>
> My second
choise was:
>
> *
create a new FileDataModel with the second dataset
> * do something like this:
newDataModel.setPreference(userId, movieId,
> recommender.estimatePreference(userId,
movieId));
>
> Here I
get several problems:
> * at runtime:
java.lang.UnsupportedOperationException (as I found out
in
> [2], FileDataModel actually
can't be updated. I don't understand why the
> function setPreference exists in the first
place...)
> * The API of
FileDataModel#setPreference states "This method should
also
> be considered relatively
slow."
>
> I read
around that a solution would be to use delta files, but I
couldn't
> find out what that
actually means. Any suggestion on how I could speed up
> my writing-the-preferences process?
> Thank you!
>
> Pier Lorenzo
>
>
> [1]
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
>
Re: fast performance way of writing preferences to file?
Posted by Ted Dunning <te...@gmail.com>.
Are you sure that the problem is writing the results? It seems to me that
the real problem is the use of a user-based recommender.
For such a small data set, for instance, a search-based recommender will be
able to make recommendations in less than a millisecond with multiple
recommendations possible in parallel. This should allow you to do 200,000
recommendations in a few minutes on a single machine.
With such a small dataset, indicator-based methods may not be the best
option. To improve that, try using something larger such as the million
song dataset. See http://labrosa.ee.columbia.edu/millionsong/
Also, using and estimating ratings is not a particularly good thing to be
doing if you want to build a real recommender.
On Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
pielle87@yahoo.com.invalid> wrote:
> Hello everyone,
> I'm new to mahout, to recommender systems and to the mailing list.
>
> I''m trying to find a (fast) way to write back preferences to a file. I
> tried a few methods but I'm sure there must be a better approach.
> Here's the deal (you can find the same post in stackoverflow[1]).
> I have a training dataset of 800.000 records from 6000 users rating 3900
> movies. These are stored in a comma separated file like:
> userId,movieId,preference. I have another dataset (200.000 records) in the
> format: userId,movieId. My goal is to use the first dataset as a
> training-set, in order to determine the missing preferences of the second
> set.
>
> So far, I managed to load the training dataset and I generated user-based
> recommendations. This is pretty smooth and doesn't take too much time. But
> I'm struggling when it comes to writing back the recommendations.
>
> The first method I tried is:
>
> * read a line from the file and get the userId,movieId tuple.
> * retrieve the calculated preference with estimatePreference(userId,
> movieId)
> * append the preference to the line and save it in a new file
> This works, but it's incredibly slow (I added a counter to print every
> 10.000th iteration: after a couple of minutes it had only printed once. I
> have 8GB-RAM with an i7-core... how long can it take to process 200.000
> lines?!)
>
> My second choise was:
>
> * create a new FileDataModel with the second dataset
> * do something like this: newDataModel.setPreference(userId, movieId,
> recommender.estimatePreference(userId, movieId));
>
> Here I get several problems:
> * at runtime: java.lang.UnsupportedOperationException (as I found out in
> [2], FileDataModel actually can't be updated. I don't understand why the
> function setPreference exists in the first place...)
> * The API of FileDataModel#setPreference states "This method should also
> be considered relatively slow."
>
> I read around that a solution would be to use delta files, but I couldn't
> find out what that actually means. Any suggestion on how I could speed up
> my writing-the-preferences process?
> Thank you!
>
> Pier Lorenzo
>
>
> [1]
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
>