You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Jason Yang <li...@gmail.com> on 2012/09/23 18:31:19 UTC

How to run multiple jobs at the same time?

Hi, all

I have implemented a K-Means algorithm in MapReduce. This program consists
of many iterations and each iteration is a MapReduce Job. here is my
pseudo-code:

-----
int count  = 0;
do
{
    ....
    SET input path = output path of last iteration;
    SET output path = new path(count);
    ...
    runJob
}
while( (!converged) && (count < maxCount) )
------

Now I got a question that what should I do if I would like to apply this
algorithm on multiple data at the same time?

Because there are dependency btw iterations, so I have to use
JobConf.runJob(), which would block until the iteration finished.

Could I use thread?

BTW, I'm using hadoop-0.20.2
-- 
YANG, Lin

Re: How to run multiple jobs at the same time?

Posted by Marcos Ortiz <ml...@uci.cu>.
Apache Mahout was built for that
Look here: 
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering

If you don't want to use the Mahout's approach (highly recommended), you 
can use
the MultipleInput class for that:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

An example of the Ton White's Book using MultipleInputs:
MultipleInputs.addInputPath(job, ncdcInputPath,
         TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
         TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);


On 09/23/2012 12:31 PM, Jason Yang wrote:
> Hi, all
>
> I have implemented a K-Means algorithm in MapReduce. This program 
> consists of many iterations and each iteration is a MapReduce Job. 
> here is my pseudo-code:
>
> -----
> int count  = 0;
> do
> {
>     ....
>     SET input path = output path of last iteration;
>     SET output path = new path(count);
>     ...
>     runJob
> }
> while( (!converged) && (count < maxCount) )
> ------
>
> Now I got a question that what should I do if I would like to apply 
> this algorithm on multiple data at the same time?
>
> Because there are dependency btw iterations, so I have to use 
> JobConf.runJob(), which would block until the iteration finished.
>
> Could I use thread?
>
> BTW, I'm using hadoop-0.20.2
> -- 
> YANG, Lin
>

-- 

Marcos Luis Ortíz Valmaseda
*Data Engineer && Sr. System Administrator at UCI*
about.me/marcosortiz <http://about.me/marcosortiz>
My Blog <http://marcosluis2186.posterous.com>
Tumblr's blog <http://marcosortiz.tumblr.com/>
@marcosluis2186 <http://twitter.com/marcosluis2186>



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: How to run multiple jobs at the same time?

Posted by Marcos Ortiz <ml...@uci.cu>.
Apache Mahout was built for that
Look here: 
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering

If you don't want to use the Mahout's approach (highly recommended), you 
can use
the MultipleInput class for that:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

An example of the Ton White's Book using MultipleInputs:
MultipleInputs.addInputPath(job, ncdcInputPath,
         TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
         TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);


On 09/23/2012 12:31 PM, Jason Yang wrote:
> Hi, all
>
> I have implemented a K-Means algorithm in MapReduce. This program 
> consists of many iterations and each iteration is a MapReduce Job. 
> here is my pseudo-code:
>
> -----
> int count  = 0;
> do
> {
>     ....
>     SET input path = output path of last iteration;
>     SET output path = new path(count);
>     ...
>     runJob
> }
> while( (!converged) && (count < maxCount) )
> ------
>
> Now I got a question that what should I do if I would like to apply 
> this algorithm on multiple data at the same time?
>
> Because there are dependency btw iterations, so I have to use 
> JobConf.runJob(), which would block until the iteration finished.
>
> Could I use thread?
>
> BTW, I'm using hadoop-0.20.2
> -- 
> YANG, Lin
>

-- 

Marcos Luis Ortíz Valmaseda
*Data Engineer && Sr. System Administrator at UCI*
about.me/marcosortiz <http://about.me/marcosortiz>
My Blog <http://marcosluis2186.posterous.com>
Tumblr's blog <http://marcosortiz.tumblr.com/>
@marcosluis2186 <http://twitter.com/marcosluis2186>



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: How to run multiple jobs at the same time?

Posted by Marcos Ortiz <ml...@uci.cu>.
Apache Mahout was built for that
Look here: 
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering

If you don't want to use the Mahout's approach (highly recommended), you 
can use
the MultipleInput class for that:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

An example of the Ton White's Book using MultipleInputs:
MultipleInputs.addInputPath(job, ncdcInputPath,
         TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
         TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);


On 09/23/2012 12:31 PM, Jason Yang wrote:
> Hi, all
>
> I have implemented a K-Means algorithm in MapReduce. This program 
> consists of many iterations and each iteration is a MapReduce Job. 
> here is my pseudo-code:
>
> -----
> int count  = 0;
> do
> {
>     ....
>     SET input path = output path of last iteration;
>     SET output path = new path(count);
>     ...
>     runJob
> }
> while( (!converged) && (count < maxCount) )
> ------
>
> Now I got a question that what should I do if I would like to apply 
> this algorithm on multiple data at the same time?
>
> Because there are dependency btw iterations, so I have to use 
> JobConf.runJob(), which would block until the iteration finished.
>
> Could I use thread?
>
> BTW, I'm using hadoop-0.20.2
> -- 
> YANG, Lin
>

-- 

Marcos Luis Ortíz Valmaseda
*Data Engineer && Sr. System Administrator at UCI*
about.me/marcosortiz <http://about.me/marcosortiz>
My Blog <http://marcosluis2186.posterous.com>
Tumblr's blog <http://marcosortiz.tumblr.com/>
@marcosluis2186 <http://twitter.com/marcosluis2186>



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: How to run multiple jobs at the same time?

Posted by Marcos Ortiz <ml...@uci.cu>.
Apache Mahout was built for that
Look here: 
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering

If you don't want to use the Mahout's approach (highly recommended), you 
can use
the MultipleInput class for that:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

An example of the Ton White's Book using MultipleInputs:
MultipleInputs.addInputPath(job, ncdcInputPath,
         TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
         TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);


On 09/23/2012 12:31 PM, Jason Yang wrote:
> Hi, all
>
> I have implemented a K-Means algorithm in MapReduce. This program 
> consists of many iterations and each iteration is a MapReduce Job. 
> here is my pseudo-code:
>
> -----
> int count  = 0;
> do
> {
>     ....
>     SET input path = output path of last iteration;
>     SET output path = new path(count);
>     ...
>     runJob
> }
> while( (!converged) && (count < maxCount) )
> ------
>
> Now I got a question that what should I do if I would like to apply 
> this algorithm on multiple data at the same time?
>
> Because there are dependency btw iterations, so I have to use 
> JobConf.runJob(), which would block until the iteration finished.
>
> Could I use thread?
>
> BTW, I'm using hadoop-0.20.2
> -- 
> YANG, Lin
>

-- 

Marcos Luis Ortíz Valmaseda
*Data Engineer && Sr. System Administrator at UCI*
about.me/marcosortiz <http://about.me/marcosortiz>
My Blog <http://marcosluis2186.posterous.com>
Tumblr's blog <http://marcosortiz.tumblr.com/>
@marcosluis2186 <http://twitter.com/marcosluis2186>



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci