You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com> on 2012/05/10 08:33:12 UTC

How to run a mahout clustering job through a web service

Hi, 

 

I would like to run KMeans clustering job from a web application. So I
want the Mahout jobs to be exposed as a web service or at least HTTP
servlet. Is it possible? Any suggestions?

 

Regards,

Anand.C

Re: Re: 40 hours to run 1/2 Netflix Data?

Posted by 许春玲 <xu...@sari.ac.cn>.

Ted,
Yes, Memory per node is only 16G.Usage of Memory cached is 100% as attached file show. And CPU is 100% too.
And Max size of local disk hadoop temp is 160G, and it will be used 100% .
It like that key point is the Sixth step of recommonder, for every time job fail at this step.

I have several tests, the log as attached. From firt time to fifth time,I cut the size 1/2(like below) and every time the job fail at Sixth step. Even when I cut the data size to about 100M as large of
groupLen movie rating file, it still fail(btw, I run 100M groupLen Movie rating cost about 16 Minutes)
-rw-r--r--   3 hdfs supergroup 1505255088 2012-04-20 16:43 /user/hdfs/NetFlix_data
-rw-r--r--   3 hdfs supergroup 1058793314 2012-04-24 10:45 /user/hdfs/netFlixData2
-rw-r--r--   3 hdfs supergroup  793294103 2012-04-26 08:59 /user/hdfs/netFlixData3
-rw-r--r--   3 hdfs supergroup  476054038 2012-04-27 09:51 /user/hdfs/netFlixData4
-rw-r--r--   3 hdfs supergroup  135210043 2012-04-28 13:53 /user/hdfs/netFlixData6

So, I think cut userId to 1/2 to reduce the size of Matrix. When I do this, recommendor finished, but it take about 40 hours.
and the mapred conf of my cluster is:

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>7</value>
</property>
<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>7</value>
</property>

<property>
  <name>mapred.map.child.java.opts</name>
  <value>-Xmx512M</value>
</property>
<property>
  <name>mapred.reduce.child.java.opts</name>
  <value>-Xmx512M</value>
</property>
<property>
  <name>mapred.child.ulimit</name>
  <value>-Xmx600M</value>
</property> 


> -----原始邮件-----
> 发件人: "Ted Dunning" <te...@gmail.com>
> 发送时间: 2012年5月14日 星期一
> 收件人: user@mahout.apache.org, ssc@apache.org
> 抄送: 
> 主题: Re: 40 hours to run 1/2 Netflix Data?
> 
> 许春玲,
> 
> The nodes here are relatively under-provisioned with respect to memory.
>  Current standard practice is to use provide 4-6 GB per core.  These
> machines have half to a third that much memory.  As a result, it is pretty
> easy to cause swapping if you have too many map or reduce slots configured
> on these machines.  That would be my first suspicion.
> 
> A second worry is that you apparently only have a single disk per node.
>  This will substantially slow down your processing.  Even normal Hadoop can
> move 300 MB/s/node with more drives and optimized systems like MapR can
> move more than 1GB/s/node.  With a single drive, you are going to be
> severely limited in terms of I/O bandwidth.
> 
> Additionally, any swapping that you are doing is going to eat away even
> further.
> 
> Have you looked at your swap rates, I/O rates, network rates and CPU usage
> during the execution of this program?
> 
> On Sun, May 13, 2012 at 10:44 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> > Hi,
> >
> > something must be completely going wrong in this experiment. Please use
> > the latest version of Mahout (Mahout 0.6) and tell us exactly at which
> > point the job fails.
> >
> > I have been able to process datasets seven times as large as Netflix
> > (http://webscope.sandbox.yahoo.com/catalog.php?datatype=r) in a few
> > hours on a 6 machine cluster.
> >
> > --sebastian
> >
> > On 14.05.2012 03:44, 许春玲 wrote:
> > > Hi,
> > >
> > >    I run item recommemder base on Netflix, but it always fail for not
> > > enough local disk space. So, I cut the User Id to half(not user account
> > but user Id),to reduce the temp data. Now, it finish but
> > > take 40 hours. The command like follow:
> > >
> > > hadoop jar
> > /app/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar
> > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> > -Dmapred.map.tasks=196 -Dmapred.reduce.tasks=196
> > -Dmapred.input.dir=NetFlix_data_new -Dmapred.output.dir=output_netflix8
> > >
> > > my hadoop cluster:
> > >
> > > 28 nodes
> > > 16G memory per node
> > > 8 core per node
> > > 250G local disk per node
> > >
> > >
> > >
> > >
> >
> >

Re: 40 hours to run 1/2 Netflix Data?

Posted by Ted Dunning <te...@gmail.com>.

许春玲,

The nodes here are relatively under-provisioned with respect to memory.
 Current standard practice is to use provide 4-6 GB per core.  These
machines have half to a third that much memory.  As a result, it is pretty
easy to cause swapping if you have too many map or reduce slots configured
on these machines.  That would be my first suspicion.

A second worry is that you apparently only have a single disk per node.
 This will substantially slow down your processing.  Even normal Hadoop can
move 300 MB/s/node with more drives and optimized systems like MapR can
move more than 1GB/s/node.  With a single drive, you are going to be
severely limited in terms of I/O bandwidth.

Additionally, any swapping that you are doing is going to eat away even
further.

Have you looked at your swap rates, I/O rates, network rates and CPU usage
during the execution of this program?

On Sun, May 13, 2012 at 10:44 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi,
>
> something must be completely going wrong in this experiment. Please use
> the latest version of Mahout (Mahout 0.6) and tell us exactly at which
> point the job fails.
>
> I have been able to process datasets seven times as large as Netflix
> (http://webscope.sandbox.yahoo.com/catalog.php?datatype=r) in a few
> hours on a 6 machine cluster.
>
> --sebastian
>
> On 14.05.2012 03:44, 许春玲 wrote:
> > Hi,
> >
> >    I run item recommemder base on Netflix, but it always fail for not
> > enough local disk space. So, I cut the User Id to half(not user account
> but user Id),to reduce the temp data. Now, it finish but
> > take 40 hours. The command like follow:
> >
> > hadoop jar
> /app/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.map.tasks=196 -Dmapred.reduce.tasks=196
> -Dmapred.input.dir=NetFlix_data_new -Dmapred.output.dir=output_netflix8
> >
> > my hadoop cluster:
> >
> > 28 nodes
> > 16G memory per node
> > 8 core per node
> > 250G local disk per node
> >
> >
> >
> >
>
>

Re: 40 hours to run 1/2 Netflix Data?

Posted by Sebastian Schelter <ss...@apache.org>.

Hi,

something must be completely going wrong in this experiment. Please use
the latest version of Mahout (Mahout 0.6) and tell us exactly at which
point the job fails.

I have been able to process datasets seven times as large as Netflix
(http://webscope.sandbox.yahoo.com/catalog.php?datatype=r) in a few
hours on a 6 machine cluster.

--sebastian

On 14.05.2012 03:44, 许春玲 wrote:
> Hi,
> 
>    I run item recommemder base on Netflix, but it always fail for not
> enough local disk space. So, I cut the User Id to half(not user account but user Id),to reduce the temp data. Now, it finish but 
> take 40 hours. The command like follow:
> 
> hadoop jar /app/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.map.tasks=196 -Dmapred.reduce.tasks=196 -Dmapred.input.dir=NetFlix_data_new -Dmapred.output.dir=output_netflix8
> 
> my hadoop cluster:
> 
> 28 nodes
> 16G memory per node
> 8 core per node
> 250G local disk per node
> 
> 
> 
>

40 hours to run 1/2 Netflix Data?

Posted by 许春玲 <xu...@sari.ac.cn>.

Hi,

   I run item recommemder base on Netflix, but it always fail for not
enough local disk space. So, I cut the User Id to half(not user account but user Id),to reduce the temp data. Now, it finish but 
take 40 hours. The command like follow:

hadoop jar /app/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.map.tasks=196 -Dmapred.reduce.tasks=196 -Dmapred.input.dir=NetFlix_data_new -Dmapred.output.dir=output_netflix8

my hadoop cluster:

28 nodes
16G memory per node
8 core per node
250G local disk per node

Re: How to run a mahout clustering job through a web service

Posted by Suneel Marthi <su...@yahoo.com>.

Why is Hadoop Jobtracker not useful to you?



________________________________
 From: "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>
To: user@mahout.apache.org 
Sent: Sunday, May 13, 2012 2:14 AM
Subject: RE: How to run a mahout clustering job through a web service
 
Hi, 

I found this toolkit http://code.google.com/p/hadoop-toolkit/

I think it allows monitoring mostly. I wanted some kind of a framework which will allow submitting jobs, tracking jobs and browsing tasks. Please let me know if there is any such framework. 

Meantime, I am exploring Hue to see whether it fits my needs. 

Regards,
Anand.C

-----Original Message-----
From: Lance Norskog [mailto:goksron@gmail.com] 
Sent: Friday, May 11, 2012 2:36 PM
To: user@mahout.apache.org
Subject: Re: How to run a mahout clustering job through a web service

The recommender servlet in the examples ran (is it still there?) an
online recommender, not the Hadoop recommender.

"Mahout as a web service" is not a Mahout problem. You want a toolkit
for scheduling and monitoring Hadoop jobs on clusters. There are a few
of these. Once you have this, you can run Mahout on the cluster.

On Thu, May 10, 2012 at 8:22 PM, Chandra Mohan, Ananda Vel Murugan
<An...@honeywell.com> wrote:
> Hi,
>
> My problem statement is more or less similar. I hope your case is a
> scheduled job.  Though I am interested in that, I want to be able to
> execute clustering on demand too.  I saw some example where a
> recommender servlet was used to trigger a recommender job. I could not
> get more information on that. I don't know whether Hadoop parallel
> computing capabilities can be harnessed. If you details on your
> implementation, please share them. Thanks !!!
>
> Regards,
> Anand.C
>
> -----Original Message-----
> From: Saikat Kanjilal [mailto:sxk1969@hotmail.com]
> Sent: Thursday, May 10, 2012 8:30 PM
> To: user@mahout.apache.org
> Subject: RE: How to run a mahout clustering job through a web service
>
>
> Hi Anad,We're doing something similar, kmeans should in general run
> asynchronously and dump data into a low latency database (something
> similar to cassandra) that your web application can then query and
> return results, so in a nutshell  you will have a real time component
> that serves up the results of clustering and an offfline component that
> computes your kmeans clusters.  Let me know if you want deeper details.
> Regards
>
>> Subject: How to run a mahout clustering job through a web service
>> Date: Thu, 10 May 2012 12:03:12 +0530
>> From: Ananda.Murugan@honeywell.com
>> To: user@mahout.apache.org
>>
>> Hi,
>>
>>
>>
>> I would like to run KMeans clustering job from a web application. So I
>> want the Mahout jobs to be exposed as a web service or at least HTTP
>> servlet. Is it possible? Any suggestions?
>>
>>
>>
>> Regards,
>>
>> Anand.C
>>
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

RE: How to run a mahout clustering job through a web service

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Hi, 

I found this toolkit http://code.google.com/p/hadoop-toolkit/

I think it allows monitoring mostly. I wanted some kind of a framework which will allow submitting jobs, tracking jobs and browsing tasks. Please let me know if there is any such framework. 

Meantime, I am exploring Hue to see whether it fits my needs. 

Regards,
Anand.C

-----Original Message-----
From: Lance Norskog [mailto:goksron@gmail.com] 
Sent: Friday, May 11, 2012 2:36 PM
To: user@mahout.apache.org
Subject: Re: How to run a mahout clustering job through a web service

The recommender servlet in the examples ran (is it still there?) an
online recommender, not the Hadoop recommender.

"Mahout as a web service" is not a Mahout problem. You want a toolkit
for scheduling and monitoring Hadoop jobs on clusters. There are a few
of these. Once you have this, you can run Mahout on the cluster.

On Thu, May 10, 2012 at 8:22 PM, Chandra Mohan, Ananda Vel Murugan
<An...@honeywell.com> wrote:
> Hi,
>
> My problem statement is more or less similar. I hope your case is a
> scheduled job.  Though I am interested in that, I want to be able to
> execute clustering on demand too.  I saw some example where a
> recommender servlet was used to trigger a recommender job. I could not
> get more information on that. I don't know whether Hadoop parallel
> computing capabilities can be harnessed. If you details on your
> implementation, please share them. Thanks !!!
>
> Regards,
> Anand.C
>
> -----Original Message-----
> From: Saikat Kanjilal [mailto:sxk1969@hotmail.com]
> Sent: Thursday, May 10, 2012 8:30 PM
> To: user@mahout.apache.org
> Subject: RE: How to run a mahout clustering job through a web service
>
>
> Hi Anad,We're doing something similar, kmeans should in general run
> asynchronously and dump data into a low latency database (something
> similar to cassandra) that your web application can then query and
> return results, so in a nutshell  you will have a real time component
> that serves up the results of clustering and an offfline component that
> computes your kmeans clusters.  Let me know if you want deeper details.
> Regards
>
>> Subject: How to run a mahout clustering job through a web service
>> Date: Thu, 10 May 2012 12:03:12 +0530
>> From: Ananda.Murugan@honeywell.com
>> To: user@mahout.apache.org
>>
>> Hi,
>>
>>
>>
>> I would like to run KMeans clustering job from a web application. So I
>> want the Mahout jobs to be exposed as a web service or at least HTTP
>> servlet. Is it possible? Any suggestions?
>>
>>
>>
>> Regards,
>>
>> Anand.C
>>
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: How to run a mahout clustering job through a web service

Posted by Lance Norskog <go...@gmail.com>.

The recommender servlet in the examples ran (is it still there?) an
online recommender, not the Hadoop recommender.

"Mahout as a web service" is not a Mahout problem. You want a toolkit
for scheduling and monitoring Hadoop jobs on clusters. There are a few
of these. Once you have this, you can run Mahout on the cluster.

On Thu, May 10, 2012 at 8:22 PM, Chandra Mohan, Ananda Vel Murugan
<An...@honeywell.com> wrote:
> Hi,
>
> My problem statement is more or less similar. I hope your case is a
> scheduled job.  Though I am interested in that, I want to be able to
> execute clustering on demand too.  I saw some example where a
> recommender servlet was used to trigger a recommender job. I could not
> get more information on that. I don't know whether Hadoop parallel
> computing capabilities can be harnessed. If you details on your
> implementation, please share them. Thanks !!!
>
> Regards,
> Anand.C
>
> -----Original Message-----
> From: Saikat Kanjilal [mailto:sxk1969@hotmail.com]
> Sent: Thursday, May 10, 2012 8:30 PM
> To: user@mahout.apache.org
> Subject: RE: How to run a mahout clustering job through a web service
>
>
> Hi Anad,We're doing something similar, kmeans should in general run
> asynchronously and dump data into a low latency database (something
> similar to cassandra) that your web application can then query and
> return results, so in a nutshell  you will have a real time component
> that serves up the results of clustering and an offfline component that
> computes your kmeans clusters.  Let me know if you want deeper details.
> Regards
>
>> Subject: How to run a mahout clustering job through a web service
>> Date: Thu, 10 May 2012 12:03:12 +0530
>> From: Ananda.Murugan@honeywell.com
>> To: user@mahout.apache.org
>>
>> Hi,
>>
>>
>>
>> I would like to run KMeans clustering job from a web application. So I
>> want the Mahout jobs to be exposed as a web service or at least HTTP
>> servlet. Is it possible? Any suggestions?
>>
>>
>>
>> Regards,
>>
>> Anand.C
>>
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

RE: How to run a mahout clustering job through a web service

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Hi, 

My problem statement is more or less similar. I hope your case is a
scheduled job.  Though I am interested in that, I want to be able to
execute clustering on demand too.  I saw some example where a
recommender servlet was used to trigger a recommender job. I could not
get more information on that. I don't know whether Hadoop parallel
computing capabilities can be harnessed. If you details on your
implementation, please share them. Thanks !!! 

Regards,
Anand.C

-----Original Message-----
From: Saikat Kanjilal [mailto:sxk1969@hotmail.com] 
Sent: Thursday, May 10, 2012 8:30 PM
To: user@mahout.apache.org
Subject: RE: How to run a mahout clustering job through a web service


Hi Anad,We're doing something similar, kmeans should in general run
asynchronously and dump data into a low latency database (something
similar to cassandra) that your web application can then query and
return results, so in a nutshell  you will have a real time component
that serves up the results of clustering and an offfline component that
computes your kmeans clusters.  Let me know if you want deeper details.
Regards

> Subject: How to run a mahout clustering job through a web service
> Date: Thu, 10 May 2012 12:03:12 +0530
> From: Ananda.Murugan@honeywell.com
> To: user@mahout.apache.org
> 
> Hi, 
> 
>  
> 
> I would like to run KMeans clustering job from a web application. So I
> want the Mahout jobs to be exposed as a web service or at least HTTP
> servlet. Is it possible? Any suggestions?
> 
>  
> 
> Regards,
> 
> Anand.C
> 
>  
>

RE: How to run a mahout clustering job through a web service

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Hi Anad,We're doing something similar, kmeans should in general run asynchronously and dump data into a low latency database (something similar to cassandra) that your web application can then query and return results, so in a nutshell  you will have a real time component that serves up the results of clustering and an offfline component that computes your kmeans clusters.  Let me know if you want deeper details.
Regards

> Subject: How to run a mahout clustering job through a web service
> Date: Thu, 10 May 2012 12:03:12 +0530
> From: Ananda.Murugan@honeywell.com
> To: user@mahout.apache.org
> 
> Hi, 
> 
>  
> 
> I would like to run KMeans clustering job from a web application. So I
> want the Mahout jobs to be exposed as a web service or at least HTTP
> servlet. Is it possible? Any suggestions?
> 
>  
> 
> Regards,
> 
> Anand.C
> 
>  
>