You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mori Bellamy <mb...@apple.com> on 2008/07/09 22:28:21 UTC

How to chain multiple hadoop jobs?

Hey all,
I'm trying to chain multiple mapreduce jobs together to accomplish a  
complex task. I believe that the way to do it is as follows:

JobConf conf = new JobConf(getConf(), MyClass.class);
//configure job.... set mappers, reducers, etc
SequenceFileOutputFormat.setOutputPath(conf,myPath1);
JobClient.runJob(conf);

//new job
JobConf conf2 = new JobConf(getConf(),MyClass.class)
SequenceFileInputFormat.setInputPath(conf,myPath1);
//more configuration...
JobClient.runJob(conf2)

Is this the canonical way to chain jobs? I'm having some trouble with  
this method -- for especially long jobs, the latter MR tasks sometimes  
do not start up.

Re: How to chain multiple hadoop jobs?

Posted by tim robertson <ti...@gmail.com>.

Have you considered http://www.cascading.org?

On Thu, Jul 10, 2008 at 10:44 AM, Amar Kamat <am...@yahoo-inc.com> wrote:

> Deyaa Adranale wrote:
>
>> I have checked the code JobControl, it submits a set of jobs asyncronously
>> and provide methods for checking their status, suspending them, and so on.
>>
> It also supports job dependencies. A particular job can depend on other
> jobs and hence it supports chaining. *JobControl* accepts *Job* which
> internally has a list of jobs it depends on.
> Amar
>
>
>> i think what Mori means by chaining jobs is to execute them after each
>> other, so this class might not help him
>> i have run chained jobs  like Mori's code (even with a foor lop and a call
>> to runJob inside it). In my case, I can't use the JobControl, because every
>> job needs information from the output of the previous job, so they have to
>> be chained.
>> Till now, I have never encountered problems when running chained jobs,
>> although I have not tested it with datasets larger than few hundered KBs.
>>
>> hope this helps,
>> Deyaa
>>
>> Lukas Vlcek wrote:
>>
>>> Hi,
>>>
>>> May be you should try to look at JobControl (see TestJobControl.java for
>>> particular example).
>>>
>>> Regards,
>>> Lukas
>>>
>>> On Wed, Jul 9, 2008 at 10:28 PM, Mori Bellamy <mb...@apple.com>
>>> wrote:
>>>
>>>
>>>
>>>> Hey all,
>>>> I'm trying to chain multiple mapreduce jobs together to accomplish a
>>>> complex task. I believe that the way to do it is as follows:
>>>>
>>>> JobConf conf = new JobConf(getConf(), MyClass.class);
>>>> //configure job.... set mappers, reducers, etc
>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>> JobClient.runJob(conf);
>>>>
>>>> //new job
>>>> JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>> SequenceFileInputFormat.setInputPath(conf,myPath1);
>>>> //more configuration...
>>>> JobClient.runJob(conf2)
>>>>
>>>> Is this the canonical way to chain jobs? I'm having some trouble with
>>>> this
>>>> method -- for especially long jobs, the latter MR tasks sometimes do not
>>>> start up.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: How to chain multiple hadoop jobs?

Posted by Amar Kamat <am...@yahoo-inc.com>.

Deyaa Adranale wrote:
> I have checked the code JobControl, it submits a set of jobs 
> asyncronously and provide methods for checking their status, 
> suspending them, and so on.
It also supports job dependencies. A particular job can depend on other 
jobs and hence it supports chaining. *JobControl* accepts *Job* which 
internally has a list of jobs it depends on.
Amar
>
> i think what Mori means by chaining jobs is to execute them after each 
> other, so this class might not help him
> i have run chained jobs  like Mori's code (even with a foor lop and a 
> call to runJob inside it). In my case, I can't use the JobControl, 
> because every job needs information from the output of the previous 
> job, so they have to be chained.
> Till now, I have never encountered problems when running chained jobs, 
> although I have not tested it with datasets larger than few hundered KBs.
>
> hope this helps,
> Deyaa
>
> Lukas Vlcek wrote:
>> Hi,
>>
>> May be you should try to look at JobControl (see TestJobControl.java for
>> particular example).
>>
>> Regards,
>> Lukas
>>
>> On Wed, Jul 9, 2008 at 10:28 PM, Mori Bellamy <mb...@apple.com> 
>> wrote:
>>
>>  
>>> Hey all,
>>> I'm trying to chain multiple mapreduce jobs together to accomplish a
>>> complex task. I believe that the way to do it is as follows:
>>>
>>> JobConf conf = new JobConf(getConf(), MyClass.class);
>>> //configure job.... set mappers, reducers, etc
>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>> JobClient.runJob(conf);
>>>
>>> //new job
>>> JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>> SequenceFileInputFormat.setInputPath(conf,myPath1);
>>> //more configuration...
>>> JobClient.runJob(conf2)
>>>
>>> Is this the canonical way to chain jobs? I'm having some trouble 
>>> with this
>>> method -- for especially long jobs, the latter MR tasks sometimes do 
>>> not
>>> start up.
>>>
>>>     
>>
>>
>>
>>

Re: How to chain multiple hadoop jobs?

Posted by Deyaa Adranale <de...@iais.fraunhofer.de>.

I have checked the code JobControl, it submits a set of jobs 
asyncronously and provide methods for checking their status, suspending 
them, and so on.

i think what Mori means by chaining jobs is to execute them after each 
other, so this class might not help him
i have run chained jobs  like Mori's code (even with a foor lop and a 
call to runJob inside it). In my case, I can't use the JobControl, 
because every job needs information from the output of the previous job, 
so they have to be chained.
Till now, I have never encountered problems when running chained jobs, 
although I have not tested it with datasets larger than few hundered KBs.

hope this helps,
Deyaa

Lukas Vlcek wrote:
> Hi,
>
> May be you should try to look at JobControl (see TestJobControl.java for
> particular example).
>
> Regards,
> Lukas
>
> On Wed, Jul 9, 2008 at 10:28 PM, Mori Bellamy <mb...@apple.com> wrote:
>
>   
>> Hey all,
>> I'm trying to chain multiple mapreduce jobs together to accomplish a
>> complex task. I believe that the way to do it is as follows:
>>
>> JobConf conf = new JobConf(getConf(), MyClass.class);
>> //configure job.... set mappers, reducers, etc
>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>> JobClient.runJob(conf);
>>
>> //new job
>> JobConf conf2 = new JobConf(getConf(),MyClass.class)
>> SequenceFileInputFormat.setInputPath(conf,myPath1);
>> //more configuration...
>> JobClient.runJob(conf2)
>>
>> Is this the canonical way to chain jobs? I'm having some trouble with this
>> method -- for especially long jobs, the latter MR tasks sometimes do not
>> start up.
>>
>>     
>
>
>
>

Re: help with hadoop program

Posted by Mori Bellamy <mb...@apple.com>.

It seems like this problem could be done with one map-reduce task.
 From your input, map out (ID,{type,TimeStamp})

in your reduce, you can figure out how many A1's appear close to  
eachother. one naive approach is to iterate through all of the sets  
and collect them in some collection class. Then, if your custom set  
class implements Comparable, you can just call  
Collections.sort(myList). i'm sure there are faster solutions (perhaps  
you could sort them as you iterate through by hashing based on  
timestamp?)

does this answer your question?

On Jul 9, 2008, at 4:59 PM, Elia Mazzawi wrote:

> can someone point me to an example i can learn from.
>
> I have a data set that looks like this:
>
> ID    type   Timestamp
>
> A1    X   1215647404
> A2    X   1215647405
> A3    X   1215647406
> A1   Y   1215647409
>
> I want to count how many A1 Y, show up within 5 seconds of an A1 X
>
> I've written a few hadoop programs already but they were based on the
> wordcount example. and so only work with 1 line at a time.
> This problem requires looking back or remembering state? or more than
> one pass?
> I was thinking that it is possible to sort the data by ID, timestamp.
> then in that case the program only needs to look back a few lines at  
> a time?
>
> seems like a common problem so i thought I'd ask if there was an  
> example
> that is close to that or if someone has written something already.
>
> P.S. Hadoop Rocks!

help with hadoop program

Posted by Elia Mazzawi <el...@casalemedia.com>.

can someone point me to an example i can learn from.

I have a data set that looks like this:

ID    type   Timestamp

A1    X   1215647404
A2    X   1215647405
A3    X   1215647406
A1   Y   1215647409

I want to count how many A1 Y, show up within 5 seconds of an A1 X

I've written a few hadoop programs already but they were based on the
wordcount example. and so only work with 1 line at a time.
This problem requires looking back or remembering state? or more than
one pass?
I was thinking that it is possible to sort the data by ID, timestamp.
then in that case the program only needs to look back a few lines at a time?

seems like a common problem so i thought I'd ask if there was an example
that is close to that or if someone has written something already.

P.S. Hadoop Rocks!

Re: How to chain multiple hadoop jobs?

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

May be you should try to look at JobControl (see TestJobControl.java for
particular example).

Regards,
Lukas

On Wed, Jul 9, 2008 at 10:28 PM, Mori Bellamy <mb...@apple.com> wrote:

> Hey all,
> I'm trying to chain multiple mapreduce jobs together to accomplish a
> complex task. I believe that the way to do it is as follows:
>
> JobConf conf = new JobConf(getConf(), MyClass.class);
> //configure job.... set mappers, reducers, etc
> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
> JobClient.runJob(conf);
>
> //new job
> JobConf conf2 = new JobConf(getConf(),MyClass.class)
> SequenceFileInputFormat.setInputPath(conf,myPath1);
> //more configuration...
> JobClient.runJob(conf2)
>
> Is this the canonical way to chain jobs? I'm having some trouble with this
> method -- for especially long jobs, the latter MR tasks sometimes do not
> start up.
>



-- 
http://blog.lukas-vlcek.com/

RE: How to chain multiple hadoop jobs?

Posted by Sean Arietta <sa...@virginia.edu>.

Thanks for all of the help... Here is what I am working with:

1. I do use Eclipse to run the jar... There is an option in the Hadoop
plugin for Eclipse to run applications, so maybe that is causing the problem
2. I am not really updating any Hadoop conf params... Here is what I am
doing:

class TestDriver extends Configured implements Tool {

public static JobConf conf;

public int run(String[] args) {
               JobClient client = new JobClient();

		client.setConf(conf);
                while(blah blah) {
                              try 
				{
					JobClient.runJob(conf);
				} catch (Exception e) 
				{
					e.printStackTrace();
				}
                }
                return 1;
}

public static void main(String[] args) {
               conf = new JobConf(myclass.class);

		// Set output formats
		conf.setOutputKeyClass(FloatWritable.class);
		conf.setOutputValueClass(LongWritable.class);
		
		// Set input format
		conf.setInputFormat(org.superres.TrainingInputFormat.class);

		Path output_path = new Path("out");
		FileOutputFormat.setOutputPath(conf, output_path);

		// Set input path
		TrainingInputFormat.setInputPaths(conf, new Path("input"));

		// Setup Hadoop classes to be used
		conf.setMapperClass(org.superres.TestMap.class);
		conf.setCombinerClass(org.superres.TestReduce.class);
		conf.setReducerClass(org.superres.TestReduce.class);
		
		ToolRunner.run(conf, new TestDriver(), args);
}
}

So yes the main method is in the class used to "drive" the Hadoop program,
but as far as modifying the configuration, I don't think I am doing that
because it is actually set up in the main method.

Cheers,
Sean



Goel, Ankur wrote:
> 
> Hadoop typically complains if you try to re-use a JobConf object by
> modifing job parameters (Mapper, Reducer, output path etc..) and
> re-submitting it to the job client. You should be creating a new JobConf
> object for every map-reduce job and if there are some parameters that
> should be copied from previous job, then you should be doing 
> 
> JobConf newJob = new JobConf(oldJob, MyClass.class);
> ...(your changes to newJob) ...
> JobClient.runJob(newJob)
> 
> This works for me.
> 
> -----Original Message-----
> From: Mori Bellamy [mailto:mbellamy@apple.com] 
> Sent: Tuesday, July 15, 2008 4:27 AM
> To: core-user@hadoop.apache.org
> Subject: Re: How to chain multiple hadoop jobs?
> 
> Weird. I use eclipse, but that's never happened to me. When  you set up
> your JobConfs, for example:
> JobConf conf2 = new JobConf(getConf(),MyClass.class) is your "MyClass"
> in the same package as your driver program? also, do you run from
> eclipse or from the command line (i've never tried to launch a hadoop
> task from eclipse). if you run from the command line:
> 
> hadoop jar MyMRTaskWrapper.jar myEntryClass option1 option2...
> 
> and all of the requisite resources are in MyMRTaskWrapper.jar, i don't
> see what the problem would be. if this is the way you run a hadoop task,
> are you sure that all of the resources are getting compiled into the
> same jar? when you export a jar from eclipse, it won't pack up external
> resources by default. (look into addons like FatJAR for that).
> 
> 
> On Jul 14, 2008, at 2:25 PM, Sean Arietta wrote:
> 
>>
>> Well that's what I need to do also... but Hadoop complains to me when 
>> I attempt to do that. Are you using Eclipse by any chance to develop?
>> The
>> error I'm getting seems to be stemming from the fact that Hadoop 
>> thinks I am uploading a new jar for EVERY execution of 
>> JobClient.runJob() so it fails indicating the job jar file doesn't 
>> exist. Did you have to turn something on/off to get it to ignore that 
>> or are you using a different IDE?
>> Thanks!
>>
>> Cheers,
>> Sean
>>
>>
>> Mori Bellamy wrote:
>>>
>>> hey sean,
>>>
>>> i later learned that the method i originally posted (configuring 
>>> different JobConfs and then running them, blocking style, with
>>> JobClient.runJob(conf)) was sufficient for my needs. the reason it 
>>> was failing before was somehow my fault and the bugs somehow got 
>>> fixed x_X.
>>>
>>> Lukas gave me a helpful reply pointing me to TestJobControl.java (in 
>>> the hadoop source directory). it seems like this would be helpful if 
>>> your job dependencies are complex. but for me, i just need to do one 
>>> job after another (and every job only depends on the one right before
> 
>>> it), so the code i originally posted works fine.
>>
>> --
>> View this message in context: 
>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18
>> 453200.html Sent from the Hadoop core-user mailing list archive at 
>> Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18466505.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

RE: How to chain multiple hadoop jobs?

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Hadoop typically complains if you try to re-use a JobConf object by
modifing job parameters (Mapper, Reducer, output path etc..) and
re-submitting it to the job client. You should be creating a new JobConf
object for every map-reduce job and if there are some parameters that
should be copied from previous job, then you should be doing 

JobConf newJob = new JobConf(oldJob, MyClass.class);
...(your changes to newJob) ...
JobClient.runJob(newJob)

This works for me.

-----Original Message-----
From: Mori Bellamy [mailto:mbellamy@apple.com] 
Sent: Tuesday, July 15, 2008 4:27 AM
To: core-user@hadoop.apache.org
Subject: Re: How to chain multiple hadoop jobs?

Weird. I use eclipse, but that's never happened to me. When  you set up
your JobConfs, for example:
JobConf conf2 = new JobConf(getConf(),MyClass.class) is your "MyClass"
in the same package as your driver program? also, do you run from
eclipse or from the command line (i've never tried to launch a hadoop
task from eclipse). if you run from the command line:

hadoop jar MyMRTaskWrapper.jar myEntryClass option1 option2...

and all of the requisite resources are in MyMRTaskWrapper.jar, i don't
see what the problem would be. if this is the way you run a hadoop task,
are you sure that all of the resources are getting compiled into the
same jar? when you export a jar from eclipse, it won't pack up external
resources by default. (look into addons like FatJAR for that).


On Jul 14, 2008, at 2:25 PM, Sean Arietta wrote:

>
> Well that's what I need to do also... but Hadoop complains to me when 
> I attempt to do that. Are you using Eclipse by any chance to develop?
> The
> error I'm getting seems to be stemming from the fact that Hadoop 
> thinks I am uploading a new jar for EVERY execution of 
> JobClient.runJob() so it fails indicating the job jar file doesn't 
> exist. Did you have to turn something on/off to get it to ignore that 
> or are you using a different IDE?
> Thanks!
>
> Cheers,
> Sean
>
>
> Mori Bellamy wrote:
>>
>> hey sean,
>>
>> i later learned that the method i originally posted (configuring 
>> different JobConfs and then running them, blocking style, with
>> JobClient.runJob(conf)) was sufficient for my needs. the reason it 
>> was failing before was somehow my fault and the bugs somehow got 
>> fixed x_X.
>>
>> Lukas gave me a helpful reply pointing me to TestJobControl.java (in 
>> the hadoop source directory). it seems like this would be helpful if 
>> your job dependencies are complex. but for me, i just need to do one 
>> job after another (and every job only depends on the one right before

>> it), so the code i originally posted works fine.
>> On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:
>>
>>>
>>> Could you please provide some small code snippets elaborating on how

>>> you implemented that? I have a similar need as the author of this 
>>> thread and I would appreciate any help. Thanks!
>>>
>>> Cheers,
>>> Sean
>>>
>>>
>>> Joman Chu-2 wrote:
>>>>
>>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to

>>>> work well. I've run sequences involving hundreds of MapReduce jobs 
>>>> in a for loop and it hasn't died on me yet.
>>>>
>>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>>>> Hey all, I'm trying to chain multiple mapreduce jobs together to 
>>>>> accomplish a complex task. I believe that the way to do it is as
>>>>> follows:
>>>>>
>>>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure 
>>>>> job....
>>>>> set mappers, reducers, etc
>>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>>> JobClient.runJob(conf);
>>>>>
>>>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class) 
>>>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more 
>>>>> configuration... JobClient.runJob(conf2)
>>>>>
>>>>> Is this the canonical way to chain jobs? I'm having some trouble 
>>>>> with this method -- for especially long jobs, the latter MR tasks 
>>>>> sometimes do not start up.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Joman Chu
>>>> AIM: ARcanUSNUMquam
>>>> IRC: irc.liquid-silver.net
>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p
>>> 18452309.html Sent from the Hadoop core-user mailing list archive at

>>> Nabble.com.
>>>
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18
> 453200.html Sent from the Hadoop core-user mailing list archive at 
> Nabble.com.
>

Re: How to chain multiple hadoop jobs?

Posted by Mori Bellamy <mb...@apple.com>.

Weird. I use eclipse, but that's never happened to me. When  you set  
up your JobConfs, for example:
JobConf conf2 = new JobConf(getConf(),MyClass.class)
is your "MyClass" in the same package as your driver program? also, do  
you run from eclipse or from the command line (i've never tried to  
launch a hadoop task from eclipse). if you run from the command line:

hadoop jar MyMRTaskWrapper.jar myEntryClass option1 option2...

and all of the requisite resources are in MyMRTaskWrapper.jar, i don't  
see what the problem would be. if this is the way you run a hadoop  
task, are you sure that all of the resources are getting compiled into  
the same jar? when you export a jar from eclipse, it won't pack up  
external resources by default. (look into addons like FatJAR for that).


On Jul 14, 2008, at 2:25 PM, Sean Arietta wrote:

>
> Well that's what I need to do also... but Hadoop complains to me  
> when I
> attempt to do that. Are you using Eclipse by any chance to develop?  
> The
> error I'm getting seems to be stemming from the fact that Hadoop  
> thinks I am
> uploading a new jar for EVERY execution of JobClient.runJob() so it  
> fails
> indicating the job jar file doesn't exist. Did you have to turn  
> something
> on/off to get it to ignore that or are you using a different IDE?  
> Thanks!
>
> Cheers,
> Sean
>
>
> Mori Bellamy wrote:
>>
>> hey sean,
>>
>> i later learned that the method i originally posted (configuring
>> different JobConfs and then running them, blocking style, with
>> JobClient.runJob(conf)) was sufficient for my needs. the reason it  
>> was
>> failing before was somehow my fault and the bugs somehow got fixed  
>> x_X.
>>
>> Lukas gave me a helpful reply pointing me to TestJobControl.java (in
>> the hadoop source directory). it seems like this would be helpful if
>> your job dependencies are complex. but for me, i just need to do one
>> job after another (and every job only depends on the one right before
>> it), so the code i originally posted works fine.
>> On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:
>>
>>>
>>> Could you please provide some small code snippets elaborating on how
>>> you
>>> implemented that? I have a similar need as the author of this thread
>>> and I
>>> would appreciate any help. Thanks!
>>>
>>> Cheers,
>>> Sean
>>>
>>>
>>> Joman Chu-2 wrote:
>>>>
>>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to
>>>> work
>>>> well. I've run sequences involving hundreds of MapReduce jobs in a
>>>> for
>>>> loop and it hasn't died on me yet.
>>>>
>>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>>>> accomplish a complex task. I believe that the way to do it is as
>>>>> follows:
>>>>>
>>>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure
>>>>> job....
>>>>> set mappers, reducers, etc
>>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>>> JobClient.runJob(conf);
>>>>>
>>>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>>>> configuration... JobClient.runJob(conf2)
>>>>>
>>>>> Is this the canonical way to chain jobs? I'm having some trouble
>>>>> with
>>>>> this
>>>>> method -- for especially long jobs, the latter MR tasks sometimes
>>>>> do not
>>>>> start up.
>>>>>
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Joman Chu
>>>> AIM: ARcanUSNUMquam
>>>> IRC: irc.liquid-silver.net
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18453200.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>

Re: How to chain multiple hadoop jobs?

Posted by Sean Arietta <sa...@virginia.edu>.

Well that's what I need to do also... but Hadoop complains to me when I
attempt to do that. Are you using Eclipse by any chance to develop? The
error I'm getting seems to be stemming from the fact that Hadoop thinks I am
uploading a new jar for EVERY execution of JobClient.runJob() so it fails
indicating the job jar file doesn't exist. Did you have to turn something
on/off to get it to ignore that or are you using a different IDE? Thanks!

Cheers,
Sean


Mori Bellamy wrote:
> 
> hey sean,
> 
> i later learned that the method i originally posted (configuring  
> different JobConfs and then running them, blocking style, with  
> JobClient.runJob(conf)) was sufficient for my needs. the reason it was  
> failing before was somehow my fault and the bugs somehow got fixed x_X.
> 
> Lukas gave me a helpful reply pointing me to TestJobControl.java (in  
> the hadoop source directory). it seems like this would be helpful if  
> your job dependencies are complex. but for me, i just need to do one  
> job after another (and every job only depends on the one right before  
> it), so the code i originally posted works fine.
> On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:
> 
>>
>> Could you please provide some small code snippets elaborating on how  
>> you
>> implemented that? I have a similar need as the author of this thread  
>> and I
>> would appreciate any help. Thanks!
>>
>> Cheers,
>> Sean
>>
>>
>> Joman Chu-2 wrote:
>>>
>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to  
>>> work
>>> well. I've run sequences involving hundreds of MapReduce jobs in a  
>>> for
>>> loop and it hasn't died on me yet.
>>>
>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>>> accomplish a complex task. I believe that the way to do it is as  
>>>> follows:
>>>>
>>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure  
>>>> job....
>>>> set mappers, reducers, etc
>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>> JobClient.runJob(conf);
>>>>
>>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>>> configuration... JobClient.runJob(conf2)
>>>>
>>>> Is this the canonical way to chain jobs? I'm having some trouble  
>>>> with
>>>> this
>>>> method -- for especially long jobs, the latter MR tasks sometimes  
>>>> do not
>>>> start up.
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Joman Chu
>>> AIM: ARcanUSNUMquam
>>> IRC: irc.liquid-silver.net
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18453200.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: How to chain multiple hadoop jobs?

Posted by Mori Bellamy <mb...@apple.com>.

hey sean,

i later learned that the method i originally posted (configuring  
different JobConfs and then running them, blocking style, with  
JobClient.runJob(conf)) was sufficient for my needs. the reason it was  
failing before was somehow my fault and the bugs somehow got fixed x_X.

Lukas gave me a helpful reply pointing me to TestJobControl.java (in  
the hadoop source directory). it seems like this would be helpful if  
your job dependencies are complex. but for me, i just need to do one  
job after another (and every job only depends on the one right before  
it), so the code i originally posted works fine.
On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:

>
> Could you please provide some small code snippets elaborating on how  
> you
> implemented that? I have a similar need as the author of this thread  
> and I
> would appreciate any help. Thanks!
>
> Cheers,
> Sean
>
>
> Joman Chu-2 wrote:
>>
>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to  
>> work
>> well. I've run sequences involving hundreds of MapReduce jobs in a  
>> for
>> loop and it hasn't died on me yet.
>>
>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>> accomplish a complex task. I believe that the way to do it is as  
>>> follows:
>>>
>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure  
>>> job....
>>> set mappers, reducers, etc
>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>> JobClient.runJob(conf);
>>>
>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>> configuration... JobClient.runJob(conf2)
>>>
>>> Is this the canonical way to chain jobs? I'm having some trouble  
>>> with
>>> this
>>> method -- for especially long jobs, the latter MR tasks sometimes  
>>> do not
>>> start up.
>>>
>>>
>>
>>
>> -- 
>> Joman Chu
>> AIM: ARcanUSNUMquam
>> IRC: irc.liquid-silver.net
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>

Re: How to chain multiple hadoop jobs?

Posted by Joman Chu <jo...@andrew.cmu.edu>.

Here is some more complete sample code that is based on my own MapReduce jobs.

//import lots of things

public class MyMapReduceTool extends Configured implements Tool {
	public int run(String[] args) throws Exception {
		JobConf conf = new JobConf(getConf(), MyMapReduceTool.class);
		conf.setJobName("SomeName");

		conf.setMapOutputKeyClass(Text.class);
		conf.setMapOutputValueClass(Text.class);

		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(Text.class);

		conf.setMapperClass(MapClass.class);
		conf.setReducerClass(Reduce.class);

		//basically i use only sequence files for i/o in most of my jobs
		conf.setInputFormat(SequenceFileInputFormat.class);
		conf.setCompressMapOutput(true);
		conf.setMapOutputCompressionType(CompressionType.BLOCK);
		conf.setOutputFormat(SequenceFileOutputFormat.class);
		SequenceFileOutputFormat.setCompressOutput(conf, true);
		SequenceFileOutputFormat.setOutputCompressionType(conf,
CompressionType.BLOCK);

		//args parsing
		Path in = new Path(args[0]);
		Path out = new Path(args[1]);
		conf.setInputPath(in);
		conf.setOutputPath(out)

		//any other config things you might want to do

		JobClient.runJob(conf);
		return 0;
	}

	public static class MapClass extends MapReduceBase implements
Mapper<Text, Text, Text, Text> {
		public void configure(JobConf job) { //optional method
			//stuff goes here
		}
		public void map(Text key, Text value, OutputCollector<Text, Text>
output, Reporter reporter) throws IOException {
			//some stuff here
		}
		public void close() { //optional method
			//some stuff here
		}
	}

	public static class Reduce extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
		public void configure(JobConf job) { //optional method
			//stuff goes here
		}
		public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter) throws
IOException {
			//stuff goes here
		}
		public void close() { //this method is optional
			//stuff goes here
		}

	public static void main(String[] args) throws Exception {
		int res = ToolRunner.run(new Configuration(), new MyMapReduceTool(),
new String[]{some, arguments});
		System.exit(res);
	}
}

Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net


On Mon, Jul 14, 2008 at 5:46 PM, Joman Chu <jo...@andrew.cmu.edu> wrote:
> Hi, I don't have the code sitting in front of me at the moment, but
> I'll do some of it from memory and I'll post a real snippet tomorrow
> night. Hopefully, this can get you started
>
> public class MyMainClass {
>        public static void main(String[] args) {
>                ToolRunner.run(new Configuration(), new ClassThatImplementsTool(), args);
>                //make sure you see the API for other trickiness you can do.
>        }
> }
>
> public class ClassThatImplementsTool implements Tool {
>        public int run(String[] args) {
>                //this method gets called by ToolRunner.run
>                //do all sorts of configuration here
>                //ie, set your Map, Combine, Reduce class
>                //look at the Configuration class API
>        }
> }
>
> The main think to know is that the ToolRunner.run() will call your
> class's run() method.
>
> Joman Chu
> AIM: ARcanUSNUMquam
> IRC: irc.liquid-silver.net
>
>
> On Mon, Jul 14, 2008 at 4:38 PM, Sean Arietta <sa...@virginia.edu> wrote:
>>
>> Could you please provide some small code snippets elaborating on how you
>> implemented that? I have a similar need as the author of this thread and I
>> would appreciate any help. Thanks!
>>
>> Cheers,
>> Sean
>>
>>
>> Joman Chu-2 wrote:
>>>
>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
>>> well. I've run sequences involving hundreds of MapReduce jobs in a for
>>> loop and it hasn't died on me yet.
>>>
>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>>> accomplish a complex task. I believe that the way to do it is as follows:
>>>>
>>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job....
>>>> set mappers, reducers, etc
>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>> JobClient.runJob(conf);
>>>>
>>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>>> configuration... JobClient.runJob(conf2)
>>>>
>>>> Is this the canonical way to chain jobs? I'm having some trouble with
>>>> this
>>>> method -- for especially long jobs, the latter MR tasks sometimes do not
>>>> start up.
>>>>
>>>>
>>>
>>>
>>> --
>>> Joman Chu
>>> AIM: ARcanUSNUMquam
>>> IRC: irc.liquid-silver.net
>>>
>>>
>>>
>>
>> --
>> View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>>
>

Re: How to chain multiple hadoop jobs?

Posted by Joman Chu <jo...@andrew.cmu.edu>.

Hi, I don't have the code sitting in front of me at the moment, but
I'll do some of it from memory and I'll post a real snippet tomorrow
night. Hopefully, this can get you started

public class MyMainClass {
	public static void main(String[] args) {
		ToolRunner.run(new Configuration(), new ClassThatImplementsTool(), args);
		//make sure you see the API for other trickiness you can do.
	}
}

public class ClassThatImplementsTool implements Tool {
	public int run(String[] args) {
		//this method gets called by ToolRunner.run
		//do all sorts of configuration here
		//ie, set your Map, Combine, Reduce class
		//look at the Configuration class API
	}
}

The main think to know is that the ToolRunner.run() will call your
class's run() method.

Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net


On Mon, Jul 14, 2008 at 4:38 PM, Sean Arietta <sa...@virginia.edu> wrote:
>
> Could you please provide some small code snippets elaborating on how you
> implemented that? I have a similar need as the author of this thread and I
> would appreciate any help. Thanks!
>
> Cheers,
> Sean
>
>
> Joman Chu-2 wrote:
>>
>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
>> well. I've run sequences involving hundreds of MapReduce jobs in a for
>> loop and it hasn't died on me yet.
>>
>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>> accomplish a complex task. I believe that the way to do it is as follows:
>>>
>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job....
>>> set mappers, reducers, etc
>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>> JobClient.runJob(conf);
>>>
>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>> configuration... JobClient.runJob(conf2)
>>>
>>> Is this the canonical way to chain jobs? I'm having some trouble with
>>> this
>>> method -- for especially long jobs, the latter MR tasks sometimes do not
>>> start up.
>>>
>>>
>>
>>
>> --
>> Joman Chu
>> AIM: ARcanUSNUMquam
>> IRC: irc.liquid-silver.net
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
>

Re: How to chain multiple hadoop jobs?

Posted by Sean Arietta <sa...@virginia.edu>.

Could you please provide some small code snippets elaborating on how you
implemented that? I have a similar need as the author of this thread and I
would appreciate any help. Thanks!

Cheers,
Sean


Joman Chu-2 wrote:
> 
> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
> well. I've run sequences involving hundreds of MapReduce jobs in a for
> loop and it hasn't died on me yet.
> 
> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>> accomplish a complex task. I believe that the way to do it is as follows:
>> 
>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job....
>> set mappers, reducers, etc 
>> SequenceFileOutputFormat.setOutputPath(conf,myPath1); 
>> JobClient.runJob(conf);
>> 
>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class) 
>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>> configuration... JobClient.runJob(conf2)
>> 
>> Is this the canonical way to chain jobs? I'm having some trouble with
>> this
>> method -- for especially long jobs, the latter MR tasks sometimes do not
>> start up.
>> 
>> 
> 
> 
> -- 
> Joman Chu
> AIM: ARcanUSNUMquam
> IRC: irc.liquid-silver.net
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: How to chain multiple hadoop jobs?

Posted by Joman Chu <jo...@andrew.cmu.edu>.

Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work well. I've run sequences involving hundreds of MapReduce jobs in a for loop and it hasn't died on me yet.

On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
> Hey all, I'm trying to chain multiple mapreduce jobs together to
> accomplish a complex task. I believe that the way to do it is as follows:
> 
> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job....
> set mappers, reducers, etc 
> SequenceFileOutputFormat.setOutputPath(conf,myPath1); 
> JobClient.runJob(conf);
> 
> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class) 
> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
> configuration... JobClient.runJob(conf2)
> 
> Is this the canonical way to chain jobs? I'm having some trouble with this
> method -- for especially long jobs, the latter MR tasks sometimes do not
> start up.
> 
> 


-- 
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net