You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stefano Bellasio <st...@gmail.com> on 2011/01/02 10:36:21 UTC

recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Hi guys, happy new year :) well, after several weeks of testing finally i had a complete amazon ec2-hadoop working environment thanks to Cloudera ec2 script. Well, right now i'm doing some test with movielens (10 mln version) and i need just to compute recommendations with different similirity by RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 instances, 1 master node and 2 worker node (large instance) but even if i know that recommender is not fast, i was thinking that 3 instances are very fast...my process took about 3 hours to complete for 1 users (i specified the user that needs recommendation with a user.txt file)....and just 10 recommendations. So, my question is, what is the correct setup for my cluster? How many nodes? How many data nodes and so on? Is there something that i can do to speed up this process...my goal is to recommend with a dataset of about 20/30 GB and 200 milions of items...so i'm worried about that. 

Thanks :) Stefano

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Stefano Bellasio <st...@gmail.com>.

Thanks Sebastian! I'm working on, so for this evening i will come back with some data and tests based on your hints :) Thanks again
Il giorno 02/gen/2011, alle ore 11.08, Sebastian Schelter ha scritto:

> Hi Stefano, happy new year too!
> 
> The running time of RecommenderJob is neither proportional to the number
> of users you wanna compute recommendations for nor to the number of
> recommendations per single user. Those parameters just influence the
> last step of the job, but most time will be spent before when computing
> item-item-similarities, which is done independently of the number of
> users you wanna have recommendations for or the number of
> recommendations per user.
> 
> We have some parameters to control the amount of data considered in the
> recommendation process, have you tried adjusting them to your needs? If
> you haven't I think playing with those should be the best place to start
> for you:
> 
>  --maxPrefsPerUser maxPrefsPerUser
> 	Maximum number of preferences considered per user in final
> 	recommendation phase
> 
>  --maxSimilaritiesPerItem maxSimilaritiesPerItem
> 	Maximum number of similarities considered per item
> 
>  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem
> 	try to cap the number of cooccurrences per item to this number
> 
> 
> It would be very cool if you could keep us up to date with your progress
> and maybe provide some numbers. I think there are a lot of things in the
> RecommenderJob that could be optimized by us to increase its performance
> and scalability, I think we'd be happy to patch it for you if you
> encounter a problem.
> 
> --sebastian
> 
> 
> Am 02.01.2011 10:36, schrieb Stefano Bellasio:
>> Hi guys, happy new year :) well, after several weeks of testing finally i had a complete amazon ec2-hadoop working environment thanks to Cloudera ec2 script. Well, right now i'm doing some test with movielens (10 mln version) and i need just to compute recommendations with different similirity by RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 instances, 1 master node and 2 worker node (large instance) but even if i know that recommender is not fast, i was thinking that 3 instances are very fast...my process took about 3 hours to complete for 1 users (i specified the user that needs recommendation with a user.txt file)....and just 10 recommendations. So, my question is, what is the correct setup for my cluster? How many nodes? How many data nodes and so on? Is there something that i can do to speed up this process...my goal is to recommend with a dataset of about 20/30 GB and 200 milions of items...so i'm worried about that. 
>> 
>> Thanks :) Stefano
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Stefano Bellasio <st...@gmail.com>.

Hi Ted, how many nodes you suggest? Some hints for other questions? Thanks :)

Il giorno 06/gen/2011, alle ore 17.27, Ted Dunning ha scritto:

> Hadoop does not usually work so well with 3 nodes.  Many assumptions about
> having a large cluster are baked in and can cause serious problems with
> small clusters.
> 
> On Thu, Jan 6, 2011 at 5:31 AM, Stefano Bellasio
> <st...@gmail.com>wrote:
> 
>> - I need to understand what kind of scalability i obtain with many nodes (3
>> for now, i can arrive to 5), i think that similarities calculation took most
>> of the time, am i right?
>>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Ted Dunning <te...@gmail.com>.

Hadoop does not usually work so well with 3 nodes.  Many assumptions about
having a large cluster are baked in and can cause serious problems with
small clusters.

On Thu, Jan 6, 2011 at 5:31 AM, Stefano Bellasio
<st...@gmail.com>wrote:

> - I need to understand what kind of scalability i obtain with many nodes (3
> for now, i can arrive to 5), i think that similarities calculation took most
> of the time, am i right?
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Stefano Bellasio <st...@gmail.com>.

Hi Ken, thanks. Right now im not saving my output data in S3 and ebs, so when my cluster finish i download the output file and switch off machines. I was supposing that JobTracker was in realtime, is not?
Il giorno 06/gen/2011, alle ore 18.48, Ken Krugler ha scritto:

> 
> On Jan 6, 2011, at 9:00am, Stefano Bellasio wrote:
> 
>> Ok, so can i continue with just 3 nodes? Im a bit confused right now. With computation time i mean that i need to know how much time takes every test...as i said i can see nothing from my JobTracker, it says the number of nodes but no job active or map/reduce operations, and i dont know why :/
> 
> If it's been more than a day, your job history will disappear (at least with default settings).
> 
> Some other things than can cause jobs on EC2 m1.large instances to run slower than expected:
> 
> * If you're using too much memory on a slave, based on TaskTracker + DataNode + simultaneous active child JVMs for map & reduce tasks, then you can get into swap hell. But that's an easy one to check, just log onto the slave while a job is active, run the top command, and check swap space usage.
> 
> * An m1.large has two drives. With a typical default configuration, you're only using one of these, and thus cutting your I/O performance in half.
> 
> * Depending on where your servers get allocated, network traffic can be going through multiple routers to get between systems. EMR is better at (or will be) getting all of the servers provisioned close to each other.
> 
> * Depending on who you're sharing your virtualized server with, you can get a node that runs much slower than expected. Usually this happens when somebody else is hammering the same disk, from my experience. But it's also usually short-lived, and over time the effects disappear. But for this reason we try to run jobs with the number of reduce tasks set to 1.75 * available slots, to help avoid the lags of waiting for one slow server to complete.
> 
> BTW none of this is specific to Mahout, just general Hadoop/EC2 tuning.
> 
> -- Ken
> 
>> 
>> Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:
>> 
>>> Those numbers seem "reasonable" to a first approximation, maybe a
>>> little higher than I would have expected given past experience.
>>> 
>>> You should be able to increase speed with more nodes, sure, but I use
>>> 3 for testing too.
>>> 
>>> The jobs are I/O bound for sure. I don't think you will see
>>> appreciable difference with different algorithms.
>>> 
>>> Yes the amount of data used in the similarity computation is the big
>>> factor for time. You probably need to tell it to keep fewer item-item
>>> pairs with the "max" parameters you  mentioned earlier.
>>> 
>>> mapred.num.tasks controls the number of mappers -- or at leasts
>>> suggests it to Hadoop.
>>> 
>>> What do you mean about the time of computation? The job tracker shows
>>> you when the individual tasks start and finish.
>>> 
>>> On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
>>> <st...@gmail.com> wrote:
>>>> Hi guys, well i'm doing some tests in those days and i have some questions. Here there is my environment and basic configuration:
>>>> 
>>>> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm using a 3 node with large instances + one master node to control the cluster.
>>>> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now are on 10 mln versions.
>>>> 
>>>> This is the command that i'm using to start my cluster:
>>>> 
>>>> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
>>>> 
>>>> I'm trying different values for :
>>>> 
>>>> maxSimilaritiesPerItem
>>>> maxPrefsPerUser
>>>> maxCooccurrencesPerItem
>>>> 
>>>> and using about 10 users per time. With this command, 10 mln user data set, my cluster took more than 4 hours (with 3 nodes) to give recommendations. Is a good time?
>>>> 
>>>> 
>>>> Well, right now i have 2 goals, and im posting here to request your help to figure out some problems :) My primary goal is to run item-based recommendations and see what happens when i change the parameters in time and performance of my cluster. Also, i need to look at the similarities, i will be test three of them: cousine, pearson, and co-occurence. Good choices? I noted also that all the similarities computation is in RAM (right?) so my matrix is built and stored in RAM, is there an other way to do that?
>>>> 
>>>> - I need to understand what kind of scalability i obtain with many nodes (3 for now, i can arrive to 5), i think that similarities calculation took most of the time, am i right?
>>>> 
>>>> - I know there is something like mapred.task to define how many instances some task can use...do i need that? How can i specify this?
>>>> 
>>>> - I need to see the exact time of each computation, i'm looking to jobtracker but seems that never happens in my cluster even if job (with mapping and reducing) is running. Is there another way to know the perfect time of any computation?
>>>> 
>>>> - Finally, i will take all the data and try to plot them to figure out some good trends based on number of nodes, time and data set dimension.
>>>> 
>>>> Well, any suggestion you want to give me is accepted :) Thank you guys
>>>> 
>>>> 
>> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Stefano Bellasio <st...@gmail.com>.

Any suggestions, also for JobTracker? Thanks :)

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Stefano Bellasio <st...@gmail.com>.

About the performance issues of EC2: yes i know that is not a perfect environment, but right now im looking for some hints about tuning of machines to do a better job with Hadoop and Mahout. My final goal, as said, is to plot some results for my thesis, so i need to tune everything is possible. I'm also reading something about JVM tuning in Mahout in action...hope that deserves a read
Il giorno 06/gen/2011, alle ore 18.48, Ken Krugler ha scritto:

> 
> On Jan 6, 2011, at 9:00am, Stefano Bellasio wrote:
> 
>> Ok, so can i continue with just 3 nodes? Im a bit confused right now. With computation time i mean that i need to know how much time takes every test...as i said i can see nothing from my JobTracker, it says the number of nodes but no job active or map/reduce operations, and i dont know why :/
> 
> If it's been more than a day, your job history will disappear (at least with default settings).
> 
> Some other things than can cause jobs on EC2 m1.large instances to run slower than expected:
> 
> * If you're using too much memory on a slave, based on TaskTracker + DataNode + simultaneous active child JVMs for map & reduce tasks, then you can get into swap hell. But that's an easy one to check, just log onto the slave while a job is active, run the top command, and check swap space usage.
> 
> * An m1.large has two drives. With a typical default configuration, you're only using one of these, and thus cutting your I/O performance in half.
> 
> * Depending on where your servers get allocated, network traffic can be going through multiple routers to get between systems. EMR is better at (or will be) getting all of the servers provisioned close to each other.
> 
> * Depending on who you're sharing your virtualized server with, you can get a node that runs much slower than expected. Usually this happens when somebody else is hammering the same disk, from my experience. But it's also usually short-lived, and over time the effects disappear. But for this reason we try to run jobs with the number of reduce tasks set to 1.75 * available slots, to help avoid the lags of waiting for one slow server to complete.
> 
> BTW none of this is specific to Mahout, just general Hadoop/EC2 tuning.
> 
> -- Ken
> 
>> 
>> Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:
>> 
>>> Those numbers seem "reasonable" to a first approximation, maybe a
>>> little higher than I would have expected given past experience.
>>> 
>>> You should be able to increase speed with more nodes, sure, but I use
>>> 3 for testing too.
>>> 
>>> The jobs are I/O bound for sure. I don't think you will see
>>> appreciable difference with different algorithms.
>>> 
>>> Yes the amount of data used in the similarity computation is the big
>>> factor for time. You probably need to tell it to keep fewer item-item
>>> pairs with the "max" parameters you  mentioned earlier.
>>> 
>>> mapred.num.tasks controls the number of mappers -- or at leasts
>>> suggests it to Hadoop.
>>> 
>>> What do you mean about the time of computation? The job tracker shows
>>> you when the individual tasks start and finish.
>>> 
>>> On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
>>> <st...@gmail.com> wrote:
>>>> Hi guys, well i'm doing some tests in those days and i have some questions. Here there is my environment and basic configuration:
>>>> 
>>>> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm using a 3 node with large instances + one master node to control the cluster.
>>>> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now are on 10 mln versions.
>>>> 
>>>> This is the command that i'm using to start my cluster:
>>>> 
>>>> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
>>>> 
>>>> I'm trying different values for :
>>>> 
>>>> maxSimilaritiesPerItem
>>>> maxPrefsPerUser
>>>> maxCooccurrencesPerItem
>>>> 
>>>> and using about 10 users per time. With this command, 10 mln user data set, my cluster took more than 4 hours (with 3 nodes) to give recommendations. Is a good time?
>>>> 
>>>> 
>>>> Well, right now i have 2 goals, and im posting here to request your help to figure out some problems :) My primary goal is to run item-based recommendations and see what happens when i change the parameters in time and performance of my cluster. Also, i need to look at the similarities, i will be test three of them: cousine, pearson, and co-occurence. Good choices? I noted also that all the similarities computation is in RAM (right?) so my matrix is built and stored in RAM, is there an other way to do that?
>>>> 
>>>> - I need to understand what kind of scalability i obtain with many nodes (3 for now, i can arrive to 5), i think that similarities calculation took most of the time, am i right?
>>>> 
>>>> - I know there is something like mapred.task to define how many instances some task can use...do i need that? How can i specify this?
>>>> 
>>>> - I need to see the exact time of each computation, i'm looking to jobtracker but seems that never happens in my cluster even if job (with mapping and reducing) is running. Is there another way to know the perfect time of any computation?
>>>> 
>>>> - Finally, i will take all the data and try to plot them to figure out some good trends based on number of nodes, time and data set dimension.
>>>> 
>>>> Well, any suggestion you want to give me is accepted :) Thank you guys
>>>> 
>>>> 
>> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Ken Krugler <kk...@transpac.com>.

On Jan 6, 2011, at 9:00am, Stefano Bellasio wrote:

> Ok, so can i continue with just 3 nodes? Im a bit confused right  
> now. With computation time i mean that i need to know how much time  
> takes every test...as i said i can see nothing from my JobTracker,  
> it says the number of nodes but no job active or map/reduce  
> operations, and i dont know why :/

If it's been more than a day, your job history will disappear (at  
least with default settings).

Some other things than can cause jobs on EC2 m1.large instances to run  
slower than expected:

* If you're using too much memory on a slave, based on TaskTracker +  
DataNode + simultaneous active child JVMs for map & reduce tasks, then  
you can get into swap hell. But that's an easy one to check, just log  
onto the slave while a job is active, run the top command, and check  
swap space usage.

* An m1.large has two drives. With a typical default configuration,  
you're only using one of these, and thus cutting your I/O performance  
in half.

* Depending on where your servers get allocated, network traffic can  
be going through multiple routers to get between systems. EMR is  
better at (or will be) getting all of the servers provisioned close to  
each other.

* Depending on who you're sharing your virtualized server with, you  
can get a node that runs much slower than expected. Usually this  
happens when somebody else is hammering the same disk, from my  
experience. But it's also usually short-lived, and over time the  
effects disappear. But for this reason we try to run jobs with the  
number of reduce tasks set to 1.75 * available slots, to help avoid  
the lags of waiting for one slow server to complete.

BTW none of this is specific to Mahout, just general Hadoop/EC2 tuning.

-- Ken

>
> Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:
>
>> Those numbers seem "reasonable" to a first approximation, maybe a
>> little higher than I would have expected given past experience.
>>
>> You should be able to increase speed with more nodes, sure, but I use
>> 3 for testing too.
>>
>> The jobs are I/O bound for sure. I don't think you will see
>> appreciable difference with different algorithms.
>>
>> Yes the amount of data used in the similarity computation is the big
>> factor for time. You probably need to tell it to keep fewer item-item
>> pairs with the "max" parameters you  mentioned earlier.
>>
>> mapred.num.tasks controls the number of mappers -- or at leasts
>> suggests it to Hadoop.
>>
>> What do you mean about the time of computation? The job tracker shows
>> you when the individual tasks start and finish.
>>
>> On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
>> <st...@gmail.com> wrote:
>>> Hi guys, well i'm doing some tests in those days and i have some  
>>> questions. Here there is my environment and basic configuration:
>>>
>>> 1) Amazon EC2 Cluster powered by Cloudera script with Apache  
>>> Whirr, i'm using a 3 node with large instances + one master node  
>>> to control the cluster.
>>> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests  
>>> right now are on 10 mln versions.
>>>
>>> This is the command that i'm using to start my cluster:
>>>
>>> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar  
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob - 
>>> Dmapred.input.dir=input -Dmapred.output.dir=data/ 
>>> movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser  
>>> 30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 - 
>>> u users.txt
>>>
>>> I'm trying different values for :
>>>
>>> maxSimilaritiesPerItem
>>> maxPrefsPerUser
>>> maxCooccurrencesPerItem
>>>
>>> and using about 10 users per time. With this command, 10 mln user  
>>> data set, my cluster took more than 4 hours (with 3 nodes) to give  
>>> recommendations. Is a good time?
>>>
>>>
>>> Well, right now i have 2 goals, and im posting here to request  
>>> your help to figure out some problems :) My primary goal is to run  
>>> item-based recommendations and see what happens when i change the  
>>> parameters in time and performance of my cluster. Also, i need to  
>>> look at the similarities, i will be test three of them: cousine,  
>>> pearson, and co-occurence. Good choices? I noted also that all the  
>>> similarities computation is in RAM (right?) so my matrix is built  
>>> and stored in RAM, is there an other way to do that?
>>>
>>> - I need to understand what kind of scalability i obtain with many  
>>> nodes (3 for now, i can arrive to 5), i think that similarities  
>>> calculation took most of the time, am i right?
>>>
>>> - I know there is something like mapred.task to define how many  
>>> instances some task can use...do i need that? How can i specify  
>>> this?
>>>
>>> - I need to see the exact time of each computation, i'm looking to  
>>> jobtracker but seems that never happens in my cluster even if job  
>>> (with mapping and reducing) is running. Is there another way to  
>>> know the perfect time of any computation?
>>>
>>> - Finally, i will take all the data and try to plot them to figure  
>>> out some good trends based on number of nodes, time and data set  
>>> dimension.
>>>
>>> Well, any suggestion you want to give me is accepted :) Thank you  
>>> guys
>>>
>>>
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Ted Dunning <te...@gmail.com>.

My experience is that until you get to 5-10 nodes using Hadoop will be
slower than a sequential implementation.

You can definitely continue with 3 nodes as Sean suggests for testing, but I
would not expect this to be a performant solution.

On Thu, Jan 6, 2011 at 9:00 AM, Stefano Bellasio
<st...@gmail.com>wrote:

> Ok, so can i continue with just 3 nodes? Im a bit confused right now. With
> computation time i mean that i need to know how much time takes every
> test...as i said i can see nothing from my JobTracker, it says the number of
> nodes but no job active or map/reduce operations, and i dont know why :/
>
> Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:
>
> > Those numbers seem "reasonable" to a first approximation, maybe a
> > little higher than I would have expected given past experience.
> >
> > You should be able to increase speed with more nodes, sure, but I use
> > 3 for testing too.
> >
> > The jobs are I/O bound for sure. I don't think you will see
> > appreciable difference with different algorithms.
> >
> > Yes the amount of data used in the similarity computation is the big
> > factor for time. You probably need to tell it to keep fewer item-item
> > pairs with the "max" parameters you  mentioned earlier.
> >
> > mapred.num.tasks controls the number of mappers -- or at leasts
> > suggests it to Hadoop.
> >
> > What do you mean about the time of computation? The job tracker shows
> > you when the individual tasks start and finish.
> >
> > On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
> > <st...@gmail.com> wrote:
> >> Hi guys, well i'm doing some tests in those days and i have some
> questions. Here there is my environment and basic configuration:
> >>
> >> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm
> using a 3 node with large instances + one master node to control the
> cluster.
> >> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right
> now are on 10 mln versions.
> >>
> >> This is the command that i'm using to start my cluster:
> >>
> >> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio
> --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem
> 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
> >>
> >> I'm trying different values for :
> >>
> >> maxSimilaritiesPerItem
> >> maxPrefsPerUser
> >> maxCooccurrencesPerItem
> >>
> >> and using about 10 users per time. With this command, 10 mln user data
> set, my cluster took more than 4 hours (with 3 nodes) to give
> recommendations. Is a good time?
> >>
> >>
> >> Well, right now i have 2 goals, and im posting here to request your help
> to figure out some problems :) My primary goal is to run item-based
> recommendations and see what happens when i change the parameters in time
> and performance of my cluster. Also, i need to look at the similarities, i
> will be test three of them: cousine, pearson, and co-occurence. Good
> choices? I noted also that all the similarities computation is in RAM
> (right?) so my matrix is built and stored in RAM, is there an other way to
> do that?
> >>
> >> - I need to understand what kind of scalability i obtain with many nodes
> (3 for now, i can arrive to 5), i think that similarities calculation took
> most of the time, am i right?
> >>
> >> - I know there is something like mapred.task to define how many
> instances some task can use...do i need that? How can i specify this?
> >>
> >> - I need to see the exact time of each computation, i'm looking to
> jobtracker but seems that never happens in my cluster even if job (with
> mapping and reducing) is running. Is there another way to know the perfect
> time of any computation?
> >>
> >> - Finally, i will take all the data and try to plot them to figure out
> some good trends based on number of nodes, time and data set dimension.
> >>
> >> Well, any suggestion you want to give me is accepted :) Thank you guys
> >>
> >>
>
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Stefano Bellasio <st...@gmail.com>.

Ok, so can i continue with just 3 nodes? Im a bit confused right now. With computation time i mean that i need to know how much time takes every test...as i said i can see nothing from my JobTracker, it says the number of nodes but no job active or map/reduce operations, and i dont know why :/

Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:

> Those numbers seem "reasonable" to a first approximation, maybe a
> little higher than I would have expected given past experience.
> 
> You should be able to increase speed with more nodes, sure, but I use
> 3 for testing too.
> 
> The jobs are I/O bound for sure. I don't think you will see
> appreciable difference with different algorithms.
> 
> Yes the amount of data used in the similarity computation is the big
> factor for time. You probably need to tell it to keep fewer item-item
> pairs with the "max" parameters you  mentioned earlier.
> 
> mapred.num.tasks controls the number of mappers -- or at leasts
> suggests it to Hadoop.
> 
> What do you mean about the time of computation? The job tracker shows
> you when the individual tasks start and finish.
> 
> On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
> <st...@gmail.com> wrote:
>> Hi guys, well i'm doing some tests in those days and i have some questions. Here there is my environment and basic configuration:
>> 
>> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm using a 3 node with large instances + one master node to control the cluster.
>> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now are on 10 mln versions.
>> 
>> This is the command that i'm using to start my cluster:
>> 
>> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
>> 
>> I'm trying different values for :
>> 
>> maxSimilaritiesPerItem
>> maxPrefsPerUser
>> maxCooccurrencesPerItem
>> 
>> and using about 10 users per time. With this command, 10 mln user data set, my cluster took more than 4 hours (with 3 nodes) to give recommendations. Is a good time?
>> 
>> 
>> Well, right now i have 2 goals, and im posting here to request your help to figure out some problems :) My primary goal is to run item-based recommendations and see what happens when i change the parameters in time and performance of my cluster. Also, i need to look at the similarities, i will be test three of them: cousine, pearson, and co-occurence. Good choices? I noted also that all the similarities computation is in RAM (right?) so my matrix is built and stored in RAM, is there an other way to do that?
>> 
>> - I need to understand what kind of scalability i obtain with many nodes (3 for now, i can arrive to 5), i think that similarities calculation took most of the time, am i right?
>> 
>> - I know there is something like mapred.task to define how many instances some task can use...do i need that? How can i specify this?
>> 
>> - I need to see the exact time of each computation, i'm looking to jobtracker but seems that never happens in my cluster even if job (with mapping and reducing) is running. Is there another way to know the perfect time of any computation?
>> 
>> - Finally, i will take all the data and try to plot them to figure out some good trends based on number of nodes, time and data set dimension.
>> 
>> Well, any suggestion you want to give me is accepted :) Thank you guys
>> 
>>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Sean Owen <sr...@gmail.com>.

Those numbers seem "reasonable" to a first approximation, maybe a
little higher than I would have expected given past experience.

You should be able to increase speed with more nodes, sure, but I use
3 for testing too.

The jobs are I/O bound for sure. I don't think you will see
appreciable difference with different algorithms.

Yes the amount of data used in the similarity computation is the big
factor for time. You probably need to tell it to keep fewer item-item
pairs with the "max" parameters you  mentioned earlier.

mapred.num.tasks controls the number of mappers -- or at leasts
suggests it to Hadoop.

What do you mean about the time of computation? The job tracker shows
you when the individual tasks start and finish.

On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
<st...@gmail.com> wrote:
> Hi guys, well i'm doing some tests in those days and i have some questions. Here there is my environment and basic configuration:
>
> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm using a 3 node with large instances + one master node to control the cluster.
> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now are on 10 mln versions.
>
> This is the command that i'm using to start my cluster:
>
> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
>
> I'm trying different values for :
>
> maxSimilaritiesPerItem
> maxPrefsPerUser
> maxCooccurrencesPerItem
>
> and using about 10 users per time. With this command, 10 mln user data set, my cluster took more than 4 hours (with 3 nodes) to give recommendations. Is a good time?
>
>
> Well, right now i have 2 goals, and im posting here to request your help to figure out some problems :) My primary goal is to run item-based recommendations and see what happens when i change the parameters in time and performance of my cluster. Also, i need to look at the similarities, i will be test three of them: cousine, pearson, and co-occurence. Good choices? I noted also that all the similarities computation is in RAM (right?) so my matrix is built and stored in RAM, is there an other way to do that?
>
> - I need to understand what kind of scalability i obtain with many nodes (3 for now, i can arrive to 5), i think that similarities calculation took most of the time, am i right?
>
> - I know there is something like mapred.task to define how many instances some task can use...do i need that? How can i specify this?
>
> - I need to see the exact time of each computation, i'm looking to jobtracker but seems that never happens in my cluster even if job (with mapping and reducing) is running. Is there another way to know the perfect time of any computation?
>
> - Finally, i will take all the data and try to plot them to figure out some good trends based on number of nodes, time and data set dimension.
>
> Well, any suggestion you want to give me is accepted :) Thank you guys
>
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Stefano Bellasio <st...@gmail.com>.

Hi guys, well i'm doing some tests in those days and i have some questions. Here there is my environment and basic configuration:

1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm using a 3 node with large instances + one master node to control the cluster. 
2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now are on 10 mln versions.

This is the command that i'm using to start my cluster:

hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt

I'm trying different values for :

maxSimilaritiesPerItem
maxPrefsPerUser
maxCooccurrencesPerItem

and using about 10 users per time. With this command, 10 mln user data set, my cluster took more than 4 hours (with 3 nodes) to give recommendations. Is a good time?


Well, right now i have 2 goals, and im posting here to request your help to figure out some problems :) My primary goal is to run item-based recommendations and see what happens when i change the parameters in time and performance of my cluster. Also, i need to look at the similarities, i will be test three of them: cousine, pearson, and co-occurence. Good choices? I noted also that all the similarities computation is in RAM (right?) so my matrix is built and stored in RAM, is there an other way to do that?

- I need to understand what kind of scalability i obtain with many nodes (3 for now, i can arrive to 5), i think that similarities calculation took most of the time, am i right?

- I know there is something like mapred.task to define how many instances some task can use...do i need that? How can i specify this?

- I need to see the exact time of each computation, i'm looking to jobtracker but seems that never happens in my cluster even if job (with mapping and reducing) is running. Is there another way to know the perfect time of any computation?

- Finally, i will take all the data and try to plot them to figure out some good trends based on number of nodes, time and data set dimension. 

Well, any suggestion you want to give me is accepted :) Thank you guys


Il giorno 02/gen/2011, alle ore 11.24, Sebastian Schelter ha scritto:

> Am 02.01.2011 11:21, schrieb Stefano Bellasio:
>> One question related to users.txt where i specify the users number: how can i type more users? what format? right now i think is one number for each row, is right? Thanks
>> 
> 
> Exactly, it's one userID per line.
> 
> --sebastian
> 
>> Il giorno 02/gen/2011, alle ore 11.08, Sebastian Schelter ha scritto:
>> 
>>> Hi Stefano, happy new year too!
>>> 
>>> The running time of RecommenderJob is neither proportional to the number
>>> of users you wanna compute recommendations for nor to the number of
>>> recommendations per single user. Those parameters just influence the
>>> last step of the job, but most time will be spent before when computing
>>> item-item-similarities, which is done independently of the number of
>>> users you wanna have recommendations for or the number of
>>> recommendations per user.
>>> 
>>> We have some parameters to control the amount of data considered in the
>>> recommendation process, have you tried adjusting them to your needs? If
>>> you haven't I think playing with those should be the best place to start
>>> for you:
>>> 
>>> --maxPrefsPerUser maxPrefsPerUser
>>> 	Maximum number of preferences considered per user in final
>>> 	recommendation phase
>>> 
>>> --maxSimilaritiesPerItem maxSimilaritiesPerItem
>>> 	Maximum number of similarities considered per item
>>> 
>>> --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem
>>> 	try to cap the number of cooccurrences per item to this number
>>> 
>>> 
>>> It would be very cool if you could keep us up to date with your progress
>>> and maybe provide some numbers. I think there are a lot of things in the
>>> RecommenderJob that could be optimized by us to increase its performance
>>> and scalability, I think we'd be happy to patch it for you if you
>>> encounter a problem.
>>> 
>>> --sebastian
>>> 
>>> 
>>> Am 02.01.2011 10:36, schrieb Stefano Bellasio:
>>>> Hi guys, happy new year :) well, after several weeks of testing finally i had a complete amazon ec2-hadoop working environment thanks to Cloudera ec2 script. Well, right now i'm doing some test with movielens (10 mln version) and i need just to compute recommendations with different similirity by RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 instances, 1 master node and 2 worker node (large instance) but even if i know that recommender is not fast, i was thinking that 3 instances are very fast...my process took about 3 hours to complete for 1 users (i specified the user that needs recommendation with a user.txt file)....and just 10 recommendations. So, my question is, what is the correct setup for my cluster? How many nodes? How many data nodes and so on? Is there something that i can do to speed up this process...my goal is to recommend with a dataset of about 20/30 GB and 200 milions of items...so i'm worried about that. 
>>>> 
>>>> Thanks :) Stefano
>>> 
>> 
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Sebastian Schelter <ss...@apache.org>.

Am 02.01.2011 11:21, schrieb Stefano Bellasio:
> One question related to users.txt where i specify the users number: how can i type more users? what format? right now i think is one number for each row, is right? Thanks
> 

Exactly, it's one userID per line.

--sebastian

> Il giorno 02/gen/2011, alle ore 11.08, Sebastian Schelter ha scritto:
> 
>> Hi Stefano, happy new year too!
>>
>> The running time of RecommenderJob is neither proportional to the number
>> of users you wanna compute recommendations for nor to the number of
>> recommendations per single user. Those parameters just influence the
>> last step of the job, but most time will be spent before when computing
>> item-item-similarities, which is done independently of the number of
>> users you wanna have recommendations for or the number of
>> recommendations per user.
>>
>> We have some parameters to control the amount of data considered in the
>> recommendation process, have you tried adjusting them to your needs? If
>> you haven't I think playing with those should be the best place to start
>> for you:
>>
>>  --maxPrefsPerUser maxPrefsPerUser
>> 	Maximum number of preferences considered per user in final
>> 	recommendation phase
>>
>>  --maxSimilaritiesPerItem maxSimilaritiesPerItem
>> 	Maximum number of similarities considered per item
>>
>>  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem
>> 	try to cap the number of cooccurrences per item to this number
>>
>>
>> It would be very cool if you could keep us up to date with your progress
>> and maybe provide some numbers. I think there are a lot of things in the
>> RecommenderJob that could be optimized by us to increase its performance
>> and scalability, I think we'd be happy to patch it for you if you
>> encounter a problem.
>>
>> --sebastian
>>
>>
>> Am 02.01.2011 10:36, schrieb Stefano Bellasio:
>>> Hi guys, happy new year :) well, after several weeks of testing finally i had a complete amazon ec2-hadoop working environment thanks to Cloudera ec2 script. Well, right now i'm doing some test with movielens (10 mln version) and i need just to compute recommendations with different similirity by RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 instances, 1 master node and 2 worker node (large instance) but even if i know that recommender is not fast, i was thinking that 3 instances are very fast...my process took about 3 hours to complete for 1 users (i specified the user that needs recommendation with a user.txt file)....and just 10 recommendations. So, my question is, what is the correct setup for my cluster? How many nodes? How many data nodes and so on? Is there something that i can do to speed up this process...my goal is to recommend with a dataset of about 20/30 GB and 200 milions of items...so i'm worried about that. 
>>>
>>> Thanks :) Stefano
>>
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Stefano Bellasio <st...@gmail.com>.

One question related to users.txt where i specify the users number: how can i type more users? what format? right now i think is one number for each row, is right? Thanks

Il giorno 02/gen/2011, alle ore 11.08, Sebastian Schelter ha scritto:

> Hi Stefano, happy new year too!
> 
> The running time of RecommenderJob is neither proportional to the number
> of users you wanna compute recommendations for nor to the number of
> recommendations per single user. Those parameters just influence the
> last step of the job, but most time will be spent before when computing
> item-item-similarities, which is done independently of the number of
> users you wanna have recommendations for or the number of
> recommendations per user.
> 
> We have some parameters to control the amount of data considered in the
> recommendation process, have you tried adjusting them to your needs? If
> you haven't I think playing with those should be the best place to start
> for you:
> 
>  --maxPrefsPerUser maxPrefsPerUser
> 	Maximum number of preferences considered per user in final
> 	recommendation phase
> 
>  --maxSimilaritiesPerItem maxSimilaritiesPerItem
> 	Maximum number of similarities considered per item
> 
>  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem
> 	try to cap the number of cooccurrences per item to this number
> 
> 
> It would be very cool if you could keep us up to date with your progress
> and maybe provide some numbers. I think there are a lot of things in the
> RecommenderJob that could be optimized by us to increase its performance
> and scalability, I think we'd be happy to patch it for you if you
> encounter a problem.
> 
> --sebastian
> 
> 
> Am 02.01.2011 10:36, schrieb Stefano Bellasio:
>> Hi guys, happy new year :) well, after several weeks of testing finally i had a complete amazon ec2-hadoop working environment thanks to Cloudera ec2 script. Well, right now i'm doing some test with movielens (10 mln version) and i need just to compute recommendations with different similirity by RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 instances, 1 master node and 2 worker node (large instance) but even if i know that recommender is not fast, i was thinking that 3 instances are very fast...my process took about 3 hours to complete for 1 users (i specified the user that needs recommendation with a user.txt file)....and just 10 recommendations. So, my question is, what is the correct setup for my cluster? How many nodes? How many data nodes and so on? Is there something that i can do to speed up this process...my goal is to recommend with a dataset of about 20/30 GB and 200 milions of items...so i'm worried about that. 
>> 
>> Thanks :) Stefano
>

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Stefano, happy new year too!

The running time of RecommenderJob is neither proportional to the number
of users you wanna compute recommendations for nor to the number of
recommendations per single user. Those parameters just influence the
last step of the job, but most time will be spent before when computing
item-item-similarities, which is done independently of the number of
users you wanna have recommendations for or the number of
recommendations per user.

We have some parameters to control the amount of data considered in the
recommendation process, have you tried adjusting them to your needs? If
you haven't I think playing with those should be the best place to start
for you:

--maxPrefsPerUser maxPrefsPerUser
Maximum number of preferences considered per user in final
recommendation phase

--maxSimilaritiesPerItem maxSimilaritiesPerItem
Maximum number of similarities considered per item

--maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem
try to cap the number of cooccurrences per item to this number

It would be very cool if you could keep us up to date with your progress
and maybe provide some numbers. I think there are a lot of things in the
RecommenderJob that could be optimized by us to increase its performance
and scalability, I think we'd be happy to patch it for you if you
encounter a problem.

--sebastian

Am 02.01.2011 10:36, schrieb Stefano Bellasio:
> Hi guys, happy new year :) well, after several weeks of testing finally i had a complete amazon ec2-hadoop working environment thanks to Cloudera ec2 script. Well, right now i'm doing some test with movielens (10 mln version) and i need just to compute recommendations with different similirity by RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 instances, 1 master node and 2 worker node (large instance) but even if i know that recommender is not fast, i was thinking that 3 instances are very fast...my process took about 3 hours to complete for 1 users (i specified the user that needs recommendation with a user.txt file)....and just 10 recommendations. So, my question is, what is the correct setup for my cluster? How many nodes? How many data nodes and so on? Is there something that i can do to speed up this process...my goal is to recommend with a dataset of about 20/30 GB and 200 milions of items...so i'm worried about that.
>
> Thanks :) Stefano