You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mori Bellamy <mb...@apple.com> on 2008/07/09 22:28:21 UTC
How to chain multiple hadoop jobs?
Hey all,
I'm trying to chain multiple mapreduce jobs together to accomplish a
complex task. I believe that the way to do it is as follows:
JobConf conf = new JobConf(getConf(), MyClass.class);
//configure job.... set mappers, reducers, etc
SequenceFileOutputFormat.setOutputPath(conf,myPath1);
JobClient.runJob(conf);
//new job
JobConf conf2 = new JobConf(getConf(),MyClass.class)
SequenceFileInputFormat.setInputPath(conf,myPath1);
//more configuration...
JobClient.runJob(conf2)
Is this the canonical way to chain jobs? I'm having some trouble with
this method -- for especially long jobs, the latter MR tasks sometimes
do not start up.
Re: How to chain multiple hadoop jobs?
Posted by tim robertson <ti...@gmail.com>.
Have you considered http://www.cascading.org?
On Thu, Jul 10, 2008 at 10:44 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
> Deyaa Adranale wrote:
>
>> I have checked the code JobControl, it submits a set of jobs asyncronously
>> and provide methods for checking their status, suspending them, and so on.
>>
> It also supports job dependencies. A particular job can depend on other
> jobs and hence it supports chaining. *JobControl* accepts *Job* which
> internally has a list of jobs it depends on.
> Amar
>
>
>> i think what Mori means by chaining jobs is to execute them after each
>> other, so this class might not help him
>> i have run chained jobs like Mori's code (even with a foor lop and a call
>> to runJob inside it). In my case, I can't use the JobControl, because every
>> job needs information from the output of the previous job, so they have to
>> be chained.
>> Till now, I have never encountered problems when running chained jobs,
>> although I have not tested it with datasets larger than few hundered KBs.
>>
>> hope this helps,
>> Deyaa
>>
>> Lukas Vlcek wrote:
>>
>>> Hi,
>>>
>>> May be you should try to look at JobControl (see TestJobControl.java for
>>> particular example).
>>>
>>> Regards,
>>> Lukas
>>>
>>> On Wed, Jul 9, 2008 at 10:28 PM, Mori Bellamy <mb...@apple.com>
>>> wrote:
>>>
>>>
>>>
>>>> Hey all,
>>>> I'm trying to chain multiple mapreduce jobs together to accomplish a
>>>> complex task. I believe that the way to do it is as follows:
>>>>
>>>> JobConf conf = new JobConf(getConf(), MyClass.class);
>>>> //configure job.... set mappers, reducers, etc
>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>> JobClient.runJob(conf);
>>>>
>>>> //new job
>>>> JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>> SequenceFileInputFormat.setInputPath(conf,myPath1);
>>>> //more configuration...
>>>> JobClient.runJob(conf2)
>>>>
>>>> Is this the canonical way to chain jobs? I'm having some trouble with
>>>> this
>>>> method -- for especially long jobs, the latter MR tasks sometimes do not
>>>> start up.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>
Re: How to chain multiple hadoop jobs?
Posted by Amar Kamat <am...@yahoo-inc.com>.
Deyaa Adranale wrote:
> I have checked the code JobControl, it submits a set of jobs
> asyncronously and provide methods for checking their status,
> suspending them, and so on.
It also supports job dependencies. A particular job can depend on other
jobs and hence it supports chaining. *JobControl* accepts *Job* which
internally has a list of jobs it depends on.
Amar
>
> i think what Mori means by chaining jobs is to execute them after each
> other, so this class might not help him
> i have run chained jobs like Mori's code (even with a foor lop and a
> call to runJob inside it). In my case, I can't use the JobControl,
> because every job needs information from the output of the previous
> job, so they have to be chained.
> Till now, I have never encountered problems when running chained jobs,
> although I have not tested it with datasets larger than few hundered KBs.
>
> hope this helps,
> Deyaa
>
> Lukas Vlcek wrote:
>> Hi,
>>
>> May be you should try to look at JobControl (see TestJobControl.java for
>> particular example).
>>
>> Regards,
>> Lukas
>>
>> On Wed, Jul 9, 2008 at 10:28 PM, Mori Bellamy <mb...@apple.com>
>> wrote:
>>
>>
>>> Hey all,
>>> I'm trying to chain multiple mapreduce jobs together to accomplish a
>>> complex task. I believe that the way to do it is as follows:
>>>
>>> JobConf conf = new JobConf(getConf(), MyClass.class);
>>> //configure job.... set mappers, reducers, etc
>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>> JobClient.runJob(conf);
>>>
>>> //new job
>>> JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>> SequenceFileInputFormat.setInputPath(conf,myPath1);
>>> //more configuration...
>>> JobClient.runJob(conf2)
>>>
>>> Is this the canonical way to chain jobs? I'm having some trouble
>>> with this
>>> method -- for especially long jobs, the latter MR tasks sometimes do
>>> not
>>> start up.
>>>
>>>
>>
>>
>>
>>
Re: How to chain multiple hadoop jobs?
Posted by Deyaa Adranale <de...@iais.fraunhofer.de>.
I have checked the code JobControl, it submits a set of jobs
asyncronously and provide methods for checking their status, suspending
them, and so on.
i think what Mori means by chaining jobs is to execute them after each
other, so this class might not help him
i have run chained jobs like Mori's code (even with a foor lop and a
call to runJob inside it). In my case, I can't use the JobControl,
because every job needs information from the output of the previous job,
so they have to be chained.
Till now, I have never encountered problems when running chained jobs,
although I have not tested it with datasets larger than few hundered KBs.
hope this helps,
Deyaa
Lukas Vlcek wrote:
> Hi,
>
> May be you should try to look at JobControl (see TestJobControl.java for
> particular example).
>
> Regards,
> Lukas
>
> On Wed, Jul 9, 2008 at 10:28 PM, Mori Bellamy <mb...@apple.com> wrote:
>
>
>> Hey all,
>> I'm trying to chain multiple mapreduce jobs together to accomplish a
>> complex task. I believe that the way to do it is as follows:
>>
>> JobConf conf = new JobConf(getConf(), MyClass.class);
>> //configure job.... set mappers, reducers, etc
>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>> JobClient.runJob(conf);
>>
>> //new job
>> JobConf conf2 = new JobConf(getConf(),MyClass.class)
>> SequenceFileInputFormat.setInputPath(conf,myPath1);
>> //more configuration...
>> JobClient.runJob(conf2)
>>
>> Is this the canonical way to chain jobs? I'm having some trouble with this
>> method -- for especially long jobs, the latter MR tasks sometimes do not
>> start up.
>>
>>
>
>
>
>
Re: help with hadoop program
Posted by Mori Bellamy <mb...@apple.com>.
It seems like this problem could be done with one map-reduce task.
From your input, map out (ID,{type,TimeStamp})
in your reduce, you can figure out how many A1's appear close to
eachother. one naive approach is to iterate through all of the sets
and collect them in some collection class. Then, if your custom set
class implements Comparable, you can just call
Collections.sort(myList). i'm sure there are faster solutions (perhaps
you could sort them as you iterate through by hashing based on
timestamp?)
does this answer your question?
On Jul 9, 2008, at 4:59 PM, Elia Mazzawi wrote:
> can someone point me to an example i can learn from.
>
> I have a data set that looks like this:
>
> ID type Timestamp
>
> A1 X 1215647404
> A2 X 1215647405
> A3 X 1215647406
> A1 Y 1215647409
>
> I want to count how many A1 Y, show up within 5 seconds of an A1 X
>
> I've written a few hadoop programs already but they were based on the
> wordcount example. and so only work with 1 line at a time.
> This problem requires looking back or remembering state? or more than
> one pass?
> I was thinking that it is possible to sort the data by ID, timestamp.
> then in that case the program only needs to look back a few lines at
> a time?
>
> seems like a common problem so i thought I'd ask if there was an
> example
> that is close to that or if someone has written something already.
>
> P.S. Hadoop Rocks!
help with hadoop program
Posted by Elia Mazzawi <el...@casalemedia.com>.
can someone point me to an example i can learn from.
I have a data set that looks like this:
ID type Timestamp
A1 X 1215647404
A2 X 1215647405
A3 X 1215647406
A1 Y 1215647409
I want to count how many A1 Y, show up within 5 seconds of an A1 X
I've written a few hadoop programs already but they were based on the
wordcount example. and so only work with 1 line at a time.
This problem requires looking back or remembering state? or more than
one pass?
I was thinking that it is possible to sort the data by ID, timestamp.
then in that case the program only needs to look back a few lines at a time?
seems like a common problem so i thought I'd ask if there was an example
that is close to that or if someone has written something already.
P.S. Hadoop Rocks!
Re: How to chain multiple hadoop jobs?
Posted by Lukas Vlcek <lu...@gmail.com>.
Hi,
May be you should try to look at JobControl (see TestJobControl.java for
particular example).
Regards,
Lukas
On Wed, Jul 9, 2008 at 10:28 PM, Mori Bellamy <mb...@apple.com> wrote:
> Hey all,
> I'm trying to chain multiple mapreduce jobs together to accomplish a
> complex task. I believe that the way to do it is as follows:
>
> JobConf conf = new JobConf(getConf(), MyClass.class);
> //configure job.... set mappers, reducers, etc
> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
> JobClient.runJob(conf);
>
> //new job
> JobConf conf2 = new JobConf(getConf(),MyClass.class)
> SequenceFileInputFormat.setInputPath(conf,myPath1);
> //more configuration...
> JobClient.runJob(conf2)
>
> Is this the canonical way to chain jobs? I'm having some trouble with this
> method -- for especially long jobs, the latter MR tasks sometimes do not
> start up.
>
--
http://blog.lukas-vlcek.com/
RE: How to chain multiple hadoop jobs?
Posted by Sean Arietta <sa...@virginia.edu>.
Thanks for all of the help... Here is what I am working with:
1. I do use Eclipse to run the jar... There is an option in the Hadoop
plugin for Eclipse to run applications, so maybe that is causing the problem
2. I am not really updating any Hadoop conf params... Here is what I am
doing:
class TestDriver extends Configured implements Tool {
public static JobConf conf;
public int run(String[] args) {
JobClient client = new JobClient();
client.setConf(conf);
while(blah blah) {
try
{
JobClient.runJob(conf);
} catch (Exception e)
{
e.printStackTrace();
}
}
return 1;
}
public static void main(String[] args) {
conf = new JobConf(myclass.class);
// Set output formats
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(LongWritable.class);
// Set input format
conf.setInputFormat(org.superres.TrainingInputFormat.class);
Path output_path = new Path("out");
FileOutputFormat.setOutputPath(conf, output_path);
// Set input path
TrainingInputFormat.setInputPaths(conf, new Path("input"));
// Setup Hadoop classes to be used
conf.setMapperClass(org.superres.TestMap.class);
conf.setCombinerClass(org.superres.TestReduce.class);
conf.setReducerClass(org.superres.TestReduce.class);
ToolRunner.run(conf, new TestDriver(), args);
}
}
So yes the main method is in the class used to "drive" the Hadoop program,
but as far as modifying the configuration, I don't think I am doing that
because it is actually set up in the main method.
Cheers,
Sean
Goel, Ankur wrote:
>
> Hadoop typically complains if you try to re-use a JobConf object by
> modifing job parameters (Mapper, Reducer, output path etc..) and
> re-submitting it to the job client. You should be creating a new JobConf
> object for every map-reduce job and if there are some parameters that
> should be copied from previous job, then you should be doing
>
> JobConf newJob = new JobConf(oldJob, MyClass.class);
> ...(your changes to newJob) ...
> JobClient.runJob(newJob)
>
> This works for me.
>
> -----Original Message-----
> From: Mori Bellamy [mailto:mbellamy@apple.com]
> Sent: Tuesday, July 15, 2008 4:27 AM
> To: core-user@hadoop.apache.org
> Subject: Re: How to chain multiple hadoop jobs?
>
> Weird. I use eclipse, but that's never happened to me. When you set up
> your JobConfs, for example:
> JobConf conf2 = new JobConf(getConf(),MyClass.class) is your "MyClass"
> in the same package as your driver program? also, do you run from
> eclipse or from the command line (i've never tried to launch a hadoop
> task from eclipse). if you run from the command line:
>
> hadoop jar MyMRTaskWrapper.jar myEntryClass option1 option2...
>
> and all of the requisite resources are in MyMRTaskWrapper.jar, i don't
> see what the problem would be. if this is the way you run a hadoop task,
> are you sure that all of the resources are getting compiled into the
> same jar? when you export a jar from eclipse, it won't pack up external
> resources by default. (look into addons like FatJAR for that).
>
>
> On Jul 14, 2008, at 2:25 PM, Sean Arietta wrote:
>
>>
>> Well that's what I need to do also... but Hadoop complains to me when
>> I attempt to do that. Are you using Eclipse by any chance to develop?
>> The
>> error I'm getting seems to be stemming from the fact that Hadoop
>> thinks I am uploading a new jar for EVERY execution of
>> JobClient.runJob() so it fails indicating the job jar file doesn't
>> exist. Did you have to turn something on/off to get it to ignore that
>> or are you using a different IDE?
>> Thanks!
>>
>> Cheers,
>> Sean
>>
>>
>> Mori Bellamy wrote:
>>>
>>> hey sean,
>>>
>>> i later learned that the method i originally posted (configuring
>>> different JobConfs and then running them, blocking style, with
>>> JobClient.runJob(conf)) was sufficient for my needs. the reason it
>>> was failing before was somehow my fault and the bugs somehow got
>>> fixed x_X.
>>>
>>> Lukas gave me a helpful reply pointing me to TestJobControl.java (in
>>> the hadoop source directory). it seems like this would be helpful if
>>> your job dependencies are complex. but for me, i just need to do one
>>> job after another (and every job only depends on the one right before
>
>>> it), so the code i originally posted works fine.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18
>> 453200.html Sent from the Hadoop core-user mailing list archive at
>> Nabble.com.
>>
>
>
>
--
View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18466505.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
RE: How to chain multiple hadoop jobs?
Posted by "Goel, Ankur" <an...@corp.aol.com>.
Hadoop typically complains if you try to re-use a JobConf object by
modifing job parameters (Mapper, Reducer, output path etc..) and
re-submitting it to the job client. You should be creating a new JobConf
object for every map-reduce job and if there are some parameters that
should be copied from previous job, then you should be doing
JobConf newJob = new JobConf(oldJob, MyClass.class);
...(your changes to newJob) ...
JobClient.runJob(newJob)
This works for me.
-----Original Message-----
From: Mori Bellamy [mailto:mbellamy@apple.com]
Sent: Tuesday, July 15, 2008 4:27 AM
To: core-user@hadoop.apache.org
Subject: Re: How to chain multiple hadoop jobs?
Weird. I use eclipse, but that's never happened to me. When you set up
your JobConfs, for example:
JobConf conf2 = new JobConf(getConf(),MyClass.class) is your "MyClass"
in the same package as your driver program? also, do you run from
eclipse or from the command line (i've never tried to launch a hadoop
task from eclipse). if you run from the command line:
hadoop jar MyMRTaskWrapper.jar myEntryClass option1 option2...
and all of the requisite resources are in MyMRTaskWrapper.jar, i don't
see what the problem would be. if this is the way you run a hadoop task,
are you sure that all of the resources are getting compiled into the
same jar? when you export a jar from eclipse, it won't pack up external
resources by default. (look into addons like FatJAR for that).
On Jul 14, 2008, at 2:25 PM, Sean Arietta wrote:
>
> Well that's what I need to do also... but Hadoop complains to me when
> I attempt to do that. Are you using Eclipse by any chance to develop?
> The
> error I'm getting seems to be stemming from the fact that Hadoop
> thinks I am uploading a new jar for EVERY execution of
> JobClient.runJob() so it fails indicating the job jar file doesn't
> exist. Did you have to turn something on/off to get it to ignore that
> or are you using a different IDE?
> Thanks!
>
> Cheers,
> Sean
>
>
> Mori Bellamy wrote:
>>
>> hey sean,
>>
>> i later learned that the method i originally posted (configuring
>> different JobConfs and then running them, blocking style, with
>> JobClient.runJob(conf)) was sufficient for my needs. the reason it
>> was failing before was somehow my fault and the bugs somehow got
>> fixed x_X.
>>
>> Lukas gave me a helpful reply pointing me to TestJobControl.java (in
>> the hadoop source directory). it seems like this would be helpful if
>> your job dependencies are complex. but for me, i just need to do one
>> job after another (and every job only depends on the one right before
>> it), so the code i originally posted works fine.
>> On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:
>>
>>>
>>> Could you please provide some small code snippets elaborating on how
>>> you implemented that? I have a similar need as the author of this
>>> thread and I would appreciate any help. Thanks!
>>>
>>> Cheers,
>>> Sean
>>>
>>>
>>> Joman Chu-2 wrote:
>>>>
>>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to
>>>> work well. I've run sequences involving hundreds of MapReduce jobs
>>>> in a for loop and it hasn't died on me yet.
>>>>
>>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>>>> accomplish a complex task. I believe that the way to do it is as
>>>>> follows:
>>>>>
>>>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure
>>>>> job....
>>>>> set mappers, reducers, etc
>>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>>> JobClient.runJob(conf);
>>>>>
>>>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>>>> configuration... JobClient.runJob(conf2)
>>>>>
>>>>> Is this the canonical way to chain jobs? I'm having some trouble
>>>>> with this method -- for especially long jobs, the latter MR tasks
>>>>> sometimes do not start up.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Joman Chu
>>>> AIM: ARcanUSNUMquam
>>>> IRC: irc.liquid-silver.net
>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p
>>> 18452309.html Sent from the Hadoop core-user mailing list archive at
>>> Nabble.com.
>>>
>>
>>
>>
>
> --
> View this message in context:
> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18
> 453200.html Sent from the Hadoop core-user mailing list archive at
> Nabble.com.
>
Re: How to chain multiple hadoop jobs?
Posted by Mori Bellamy <mb...@apple.com>.
Weird. I use eclipse, but that's never happened to me. When you set
up your JobConfs, for example:
JobConf conf2 = new JobConf(getConf(),MyClass.class)
is your "MyClass" in the same package as your driver program? also, do
you run from eclipse or from the command line (i've never tried to
launch a hadoop task from eclipse). if you run from the command line:
hadoop jar MyMRTaskWrapper.jar myEntryClass option1 option2...
and all of the requisite resources are in MyMRTaskWrapper.jar, i don't
see what the problem would be. if this is the way you run a hadoop
task, are you sure that all of the resources are getting compiled into
the same jar? when you export a jar from eclipse, it won't pack up
external resources by default. (look into addons like FatJAR for that).
On Jul 14, 2008, at 2:25 PM, Sean Arietta wrote:
>
> Well that's what I need to do also... but Hadoop complains to me
> when I
> attempt to do that. Are you using Eclipse by any chance to develop?
> The
> error I'm getting seems to be stemming from the fact that Hadoop
> thinks I am
> uploading a new jar for EVERY execution of JobClient.runJob() so it
> fails
> indicating the job jar file doesn't exist. Did you have to turn
> something
> on/off to get it to ignore that or are you using a different IDE?
> Thanks!
>
> Cheers,
> Sean
>
>
> Mori Bellamy wrote:
>>
>> hey sean,
>>
>> i later learned that the method i originally posted (configuring
>> different JobConfs and then running them, blocking style, with
>> JobClient.runJob(conf)) was sufficient for my needs. the reason it
>> was
>> failing before was somehow my fault and the bugs somehow got fixed
>> x_X.
>>
>> Lukas gave me a helpful reply pointing me to TestJobControl.java (in
>> the hadoop source directory). it seems like this would be helpful if
>> your job dependencies are complex. but for me, i just need to do one
>> job after another (and every job only depends on the one right before
>> it), so the code i originally posted works fine.
>> On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:
>>
>>>
>>> Could you please provide some small code snippets elaborating on how
>>> you
>>> implemented that? I have a similar need as the author of this thread
>>> and I
>>> would appreciate any help. Thanks!
>>>
>>> Cheers,
>>> Sean
>>>
>>>
>>> Joman Chu-2 wrote:
>>>>
>>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to
>>>> work
>>>> well. I've run sequences involving hundreds of MapReduce jobs in a
>>>> for
>>>> loop and it hasn't died on me yet.
>>>>
>>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>>>> accomplish a complex task. I believe that the way to do it is as
>>>>> follows:
>>>>>
>>>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure
>>>>> job....
>>>>> set mappers, reducers, etc
>>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>>> JobClient.runJob(conf);
>>>>>
>>>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>>>> configuration... JobClient.runJob(conf2)
>>>>>
>>>>> Is this the canonical way to chain jobs? I'm having some trouble
>>>>> with
>>>>> this
>>>>> method -- for especially long jobs, the latter MR tasks sometimes
>>>>> do not
>>>>> start up.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Joman Chu
>>>> AIM: ARcanUSNUMquam
>>>> IRC: irc.liquid-silver.net
>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18453200.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
Re: How to chain multiple hadoop jobs?
Posted by Sean Arietta <sa...@virginia.edu>.
Well that's what I need to do also... but Hadoop complains to me when I
attempt to do that. Are you using Eclipse by any chance to develop? The
error I'm getting seems to be stemming from the fact that Hadoop thinks I am
uploading a new jar for EVERY execution of JobClient.runJob() so it fails
indicating the job jar file doesn't exist. Did you have to turn something
on/off to get it to ignore that or are you using a different IDE? Thanks!
Cheers,
Sean
Mori Bellamy wrote:
>
> hey sean,
>
> i later learned that the method i originally posted (configuring
> different JobConfs and then running them, blocking style, with
> JobClient.runJob(conf)) was sufficient for my needs. the reason it was
> failing before was somehow my fault and the bugs somehow got fixed x_X.
>
> Lukas gave me a helpful reply pointing me to TestJobControl.java (in
> the hadoop source directory). it seems like this would be helpful if
> your job dependencies are complex. but for me, i just need to do one
> job after another (and every job only depends on the one right before
> it), so the code i originally posted works fine.
> On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:
>
>>
>> Could you please provide some small code snippets elaborating on how
>> you
>> implemented that? I have a similar need as the author of this thread
>> and I
>> would appreciate any help. Thanks!
>>
>> Cheers,
>> Sean
>>
>>
>> Joman Chu-2 wrote:
>>>
>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to
>>> work
>>> well. I've run sequences involving hundreds of MapReduce jobs in a
>>> for
>>> loop and it hasn't died on me yet.
>>>
>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>>> accomplish a complex task. I believe that the way to do it is as
>>>> follows:
>>>>
>>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure
>>>> job....
>>>> set mappers, reducers, etc
>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>> JobClient.runJob(conf);
>>>>
>>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>>> configuration... JobClient.runJob(conf2)
>>>>
>>>> Is this the canonical way to chain jobs? I'm having some trouble
>>>> with
>>>> this
>>>> method -- for especially long jobs, the latter MR tasks sometimes
>>>> do not
>>>> start up.
>>>>
>>>>
>>>
>>>
>>> --
>>> Joman Chu
>>> AIM: ARcanUSNUMquam
>>> IRC: irc.liquid-silver.net
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>
>
>
--
View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18453200.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: How to chain multiple hadoop jobs?
Posted by Mori Bellamy <mb...@apple.com>.
hey sean,
i later learned that the method i originally posted (configuring
different JobConfs and then running them, blocking style, with
JobClient.runJob(conf)) was sufficient for my needs. the reason it was
failing before was somehow my fault and the bugs somehow got fixed x_X.
Lukas gave me a helpful reply pointing me to TestJobControl.java (in
the hadoop source directory). it seems like this would be helpful if
your job dependencies are complex. but for me, i just need to do one
job after another (and every job only depends on the one right before
it), so the code i originally posted works fine.
On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:
>
> Could you please provide some small code snippets elaborating on how
> you
> implemented that? I have a similar need as the author of this thread
> and I
> would appreciate any help. Thanks!
>
> Cheers,
> Sean
>
>
> Joman Chu-2 wrote:
>>
>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to
>> work
>> well. I've run sequences involving hundreds of MapReduce jobs in a
>> for
>> loop and it hasn't died on me yet.
>>
>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>> accomplish a complex task. I believe that the way to do it is as
>>> follows:
>>>
>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure
>>> job....
>>> set mappers, reducers, etc
>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>> JobClient.runJob(conf);
>>>
>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>> configuration... JobClient.runJob(conf2)
>>>
>>> Is this the canonical way to chain jobs? I'm having some trouble
>>> with
>>> this
>>> method -- for especially long jobs, the latter MR tasks sometimes
>>> do not
>>> start up.
>>>
>>>
>>
>>
>> --
>> Joman Chu
>> AIM: ARcanUSNUMquam
>> IRC: irc.liquid-silver.net
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
Re: How to chain multiple hadoop jobs?
Posted by Joman Chu <jo...@andrew.cmu.edu>.
Here is some more complete sample code that is based on my own MapReduce jobs.
//import lots of things
public class MyMapReduceTool extends Configured implements Tool {
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), MyMapReduceTool.class);
conf.setJobName("SomeName");
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
//basically i use only sequence files for i/o in most of my jobs
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setCompressMapOutput(true);
conf.setMapOutputCompressionType(CompressionType.BLOCK);
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(conf, true);
SequenceFileOutputFormat.setOutputCompressionType(conf,
CompressionType.BLOCK);
//args parsing
Path in = new Path(args[0]);
Path out = new Path(args[1]);
conf.setInputPath(in);
conf.setOutputPath(out)
//any other config things you might want to do
JobClient.runJob(conf);
return 0;
}
public static class MapClass extends MapReduceBase implements
Mapper<Text, Text, Text, Text> {
public void configure(JobConf job) { //optional method
//stuff goes here
}
public void map(Text key, Text value, OutputCollector<Text, Text>
output, Reporter reporter) throws IOException {
//some stuff here
}
public void close() { //optional method
//some stuff here
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
public void configure(JobConf job) { //optional method
//stuff goes here
}
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter) throws
IOException {
//stuff goes here
}
public void close() { //this method is optional
//stuff goes here
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MyMapReduceTool(),
new String[]{some, arguments});
System.exit(res);
}
}
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net
On Mon, Jul 14, 2008 at 5:46 PM, Joman Chu <jo...@andrew.cmu.edu> wrote:
> Hi, I don't have the code sitting in front of me at the moment, but
> I'll do some of it from memory and I'll post a real snippet tomorrow
> night. Hopefully, this can get you started
>
> public class MyMainClass {
> public static void main(String[] args) {
> ToolRunner.run(new Configuration(), new ClassThatImplementsTool(), args);
> //make sure you see the API for other trickiness you can do.
> }
> }
>
> public class ClassThatImplementsTool implements Tool {
> public int run(String[] args) {
> //this method gets called by ToolRunner.run
> //do all sorts of configuration here
> //ie, set your Map, Combine, Reduce class
> //look at the Configuration class API
> }
> }
>
> The main think to know is that the ToolRunner.run() will call your
> class's run() method.
>
> Joman Chu
> AIM: ARcanUSNUMquam
> IRC: irc.liquid-silver.net
>
>
> On Mon, Jul 14, 2008 at 4:38 PM, Sean Arietta <sa...@virginia.edu> wrote:
>>
>> Could you please provide some small code snippets elaborating on how you
>> implemented that? I have a similar need as the author of this thread and I
>> would appreciate any help. Thanks!
>>
>> Cheers,
>> Sean
>>
>>
>> Joman Chu-2 wrote:
>>>
>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
>>> well. I've run sequences involving hundreds of MapReduce jobs in a for
>>> loop and it hasn't died on me yet.
>>>
>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>>> accomplish a complex task. I believe that the way to do it is as follows:
>>>>
>>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job....
>>>> set mappers, reducers, etc
>>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>>> JobClient.runJob(conf);
>>>>
>>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>>> configuration... JobClient.runJob(conf2)
>>>>
>>>> Is this the canonical way to chain jobs? I'm having some trouble with
>>>> this
>>>> method -- for especially long jobs, the latter MR tasks sometimes do not
>>>> start up.
>>>>
>>>>
>>>
>>>
>>> --
>>> Joman Chu
>>> AIM: ARcanUSNUMquam
>>> IRC: irc.liquid-silver.net
>>>
>>>
>>>
>>
>> --
>> View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>>
>
Re: How to chain multiple hadoop jobs?
Posted by Joman Chu <jo...@andrew.cmu.edu>.
Hi, I don't have the code sitting in front of me at the moment, but
I'll do some of it from memory and I'll post a real snippet tomorrow
night. Hopefully, this can get you started
public class MyMainClass {
public static void main(String[] args) {
ToolRunner.run(new Configuration(), new ClassThatImplementsTool(), args);
//make sure you see the API for other trickiness you can do.
}
}
public class ClassThatImplementsTool implements Tool {
public int run(String[] args) {
//this method gets called by ToolRunner.run
//do all sorts of configuration here
//ie, set your Map, Combine, Reduce class
//look at the Configuration class API
}
}
The main think to know is that the ToolRunner.run() will call your
class's run() method.
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net
On Mon, Jul 14, 2008 at 4:38 PM, Sean Arietta <sa...@virginia.edu> wrote:
>
> Could you please provide some small code snippets elaborating on how you
> implemented that? I have a similar need as the author of this thread and I
> would appreciate any help. Thanks!
>
> Cheers,
> Sean
>
>
> Joman Chu-2 wrote:
>>
>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
>> well. I've run sequences involving hundreds of MapReduce jobs in a for
>> loop and it hasn't died on me yet.
>>
>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>> accomplish a complex task. I believe that the way to do it is as follows:
>>>
>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job....
>>> set mappers, reducers, etc
>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>> JobClient.runJob(conf);
>>>
>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>> configuration... JobClient.runJob(conf2)
>>>
>>> Is this the canonical way to chain jobs? I'm having some trouble with
>>> this
>>> method -- for especially long jobs, the latter MR tasks sometimes do not
>>> start up.
>>>
>>>
>>
>>
>> --
>> Joman Chu
>> AIM: ARcanUSNUMquam
>> IRC: irc.liquid-silver.net
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
>
Re: How to chain multiple hadoop jobs?
Posted by Sean Arietta <sa...@virginia.edu>.
Could you please provide some small code snippets elaborating on how you
implemented that? I have a similar need as the author of this thread and I
would appreciate any help. Thanks!
Cheers,
Sean
Joman Chu-2 wrote:
>
> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
> well. I've run sequences involving hundreds of MapReduce jobs in a for
> loop and it hasn't died on me yet.
>
> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>> accomplish a complex task. I believe that the way to do it is as follows:
>>
>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job....
>> set mappers, reducers, etc
>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>> JobClient.runJob(conf);
>>
>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>> configuration... JobClient.runJob(conf2)
>>
>> Is this the canonical way to chain jobs? I'm having some trouble with
>> this
>> method -- for especially long jobs, the latter MR tasks sometimes do not
>> start up.
>>
>>
>
>
> --
> Joman Chu
> AIM: ARcanUSNUMquam
> IRC: irc.liquid-silver.net
>
>
>
--
View this message in context: http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: How to chain multiple hadoop jobs?
Posted by Joman Chu <jo...@andrew.cmu.edu>.
Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work well. I've run sequences involving hundreds of MapReduce jobs in a for loop and it hasn't died on me yet.
On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
> Hey all, I'm trying to chain multiple mapreduce jobs together to
> accomplish a complex task. I believe that the way to do it is as follows:
>
> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job....
> set mappers, reducers, etc
> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
> JobClient.runJob(conf);
>
> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
> configuration... JobClient.runJob(conf2)
>
> Is this the canonical way to chain jobs? I'm having some trouble with this
> method -- for especially long jobs, the latter MR tasks sometimes do not
> start up.
>
>
--
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net