You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Aaron Baff <Aa...@telescope.tv> on 2011/09/29 01:56:18 UTC

Running multiple MR Job's in sequence

Is it possible to submit a series of MR Jobs to the JobTracker to run in sequence (one finishes, take the output of that if successful and feed it into the next, etc), or does it need to run client side by using the JobControl or something like Oozie, or rolling our own? What I'm looking for is a fire & forget, and occasionally check back to see if it's done. So client-side doesn't need to really know anything or keep track of anything. Does something like that exist within the Hadoop framework?

--Aaron

Re: Running multiple MR Job's in sequence

Posted by Raj V <ra...@yahoo.com>.

Can't this be done with a simple shell script?


Raj



>________________________________
>From: Aaron Baff <Aa...@telescope.tv>
>To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
>Sent: Wednesday, September 28, 2011 4:56 PM
>Subject: Running multiple MR Job's in sequence
>
>Is it possible to submit a series of MR Jobs to the JobTracker to run in sequence (one finishes, take the output of that if successful and feed it into the next, etc), or does it need to run client side by using the JobControl or something like Oozie, or rolling our own? What I'm looking for is a fire & forget, and occasionally check back to see if it's done. So client-side doesn't need to really know anything or keep track of anything. Does something like that exist within the Hadoop framework?
>
>--Aaron
>
>
>

Re: Running multiple MR Job's in sequence

Posted by Joey Echeverria <jo...@cloudera.com>.

I would definitely checkout Oozie for this use case.

-Joey

On Thu, Sep 29, 2011 at 12:51 PM, Aaron Baff <Aa...@telescope.tv> wrote:
> I saw this, but wasn't sure if it was something that ran on the client and just submitted the Job's in sequence, or if that gave it all to the JobTracker, and the JobTracker took care of submitting the Jobs in sequence appropriately.
>
> Basically, I'm looking for a completely stateless client, that doesn't need to ping the JobTracker every now and then to see if a Job has completed, and then submit the next one. The ideal flow would be the client gets in a request to run the series of Jobs, it preps them all, gets them all configured, and then passes them off to the JobTracker which runs them all in order without the client application needing to do anthing further.
>
> Sounds like that doesn't really exist as part of Hadoop framework, and needs something like Oozie (or a home-built system) to do this.
>
> --Aaron
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Wednesday, September 28, 2011 9:37 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Running multiple MR Job's in sequence
>
> Within the Hadoop core project, there is JobControl you can utilize
> for this. You can view its API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/jobcontrol/package-summary.html
> and it is fairly simple to use (Create jobs in regular java API, build
> a dependency flow using JobControl atop these jobconf objects).
>
> Apache Oozie and other such tools offer higher abstractions on
> controlling a workflow, and can be considered when your needs can get
> a bit complex than just a series (easy to handle failure scenarios
> between dependent jobs, perform minor fs operations in pre/post
> processing, etc.).
>
> On Thu, Sep 29, 2011 at 5:26 AM, Aaron Baff <Aa...@telescope.tv> wrote:
>> Is it possible to submit a series of MR Jobs to the JobTracker to run in sequence (one finishes, take the output of that if successful and feed it into the next, etc), or does it need to run client side by using the JobControl or something like Oozie, or rolling our own? What I'm looking for is a fire & forget, and occasionally check back to see if it's done. So client-side doesn't need to really know anything or keep track of anything. Does something like that exist within the Hadoop framework?
>>
>> --Aaron
>>
>
>
>
> --
> Harsh J
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Running multiple MR Job's in sequence

Posted by John Conwell <jo...@iamjohn.me>.

If you are running on EC2, you can use elastic map reduce.  It has a startup
option where you specify the driver class in your jar, and it will run the
driver, I believe, on the namenode, which wont really add any overhead
because when the namenode is under stress, the driver will be sitting
quietly waiting for the job to complete...and vice versa.

Just an option.

On Thu, Sep 29, 2011 at 10:53 AM, Aaron Baff <Aa...@telescope.tv>wrote:

> Yea, we don't want it to sit there waiting for the Job to complete, even if
> it's just a few minutes.
>
> --Aaron
> -----Original Message-----
> From: turbocodr@gmail.com [mailto:turbocodr@gmail.com] On Behalf Of John
> Conwell
> Sent: Thursday, September 29, 2011 10:50 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Running multiple MR Job's in sequence
>
> After you kick off a job, say JobA, your client doesn't need to sit and
> ping
> Hadoop to see if it finished before it starts JobB.  You can have the
> client
> block until the job is complete with "Job.waitForCompletion(boolean
> verbose)".  Using this you can create a "job driver" that chains jobs
> together easily.
>
> Now, if your job takes 2 weeks to run, you cant kill your driver process.
>  If you do, JobA will finish running, but JobB will never start
>
> JohnC
>
> On Thu, Sep 29, 2011 at 9:51 AM, Aaron Baff <Aa...@telescope.tv>
> wrote:
>
> > I saw this, but wasn't sure if it was something that ran on the client
> and
> > just submitted the Job's in sequence, or if that gave it all to the
> > JobTracker, and the JobTracker took care of submitting the Jobs in
> sequence
> > appropriately.
> >
> > Basically, I'm looking for a completely stateless client, that doesn't
> need
> > to ping the JobTracker every now and then to see if a Job has completed,
> and
> > then submit the next one. The ideal flow would be the client gets in a
> > request to run the series of Jobs, it preps them all, gets them all
> > configured, and then passes them off to the JobTracker which runs them
> all
> > in order without the client application needing to do anthing further.
> >
> > Sounds like that doesn't really exist as part of Hadoop framework, and
> > needs something like Oozie (or a home-built system) to do this.
> >
> > --Aaron
> > -----Original Message-----
> > From: Harsh J [mailto:harsh@cloudera.com]
> > Sent: Wednesday, September 28, 2011 9:37 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: Running multiple MR Job's in sequence
> >
> > Within the Hadoop core project, there is JobControl you can utilize
> > for this. You can view its API at
> >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/jobcontrol/package-summary.html
> > and it is fairly simple to use (Create jobs in regular java API, build
> > a dependency flow using JobControl atop these jobconf objects).
> >
> > Apache Oozie and other such tools offer higher abstractions on
> > controlling a workflow, and can be considered when your needs can get
> > a bit complex than just a series (easy to handle failure scenarios
> > between dependent jobs, perform minor fs operations in pre/post
> > processing, etc.).
> >
> > On Thu, Sep 29, 2011 at 5:26 AM, Aaron Baff <Aa...@telescope.tv>
> > wrote:
> > > Is it possible to submit a series of MR Jobs to the JobTracker to run
> in
> > sequence (one finishes, take the output of that if successful and feed it
> > into the next, etc), or does it need to run client side by using the
> > JobControl or something like Oozie, or rolling our own? What I'm looking
> for
> > is a fire & forget, and occasionally check back to see if it's done. So
> > client-side doesn't need to really know anything or keep track of
> anything.
> > Does something like that exist within the Hadoop framework?
> > >
> > > --Aaron
> > >
> >
> >
> >
> > --
> > Harsh J
> >
>
>
>
> --
>
> Thanks,
> John C
>



-- 

Thanks,
John C

RE: Running multiple MR Job's in sequence

Posted by Aaron Baff <Aa...@telescope.tv>.

Yea, we don't want it to sit there waiting for the Job to complete, even if it's just a few minutes.

--Aaron
-----Original Message-----
From: turbocodr@gmail.com [mailto:turbocodr@gmail.com] On Behalf Of John Conwell
Sent: Thursday, September 29, 2011 10:50 AM
To: common-user@hadoop.apache.org
Subject: Re: Running multiple MR Job's in sequence

After you kick off a job, say JobA, your client doesn't need to sit and ping
Hadoop to see if it finished before it starts JobB.  You can have the client
block until the job is complete with "Job.waitForCompletion(boolean
verbose)".  Using this you can create a "job driver" that chains jobs
together easily.

Now, if your job takes 2 weeks to run, you cant kill your driver process.
 If you do, JobA will finish running, but JobB will never start

JohnC

On Thu, Sep 29, 2011 at 9:51 AM, Aaron Baff <Aa...@telescope.tv> wrote:

> I saw this, but wasn't sure if it was something that ran on the client and
> just submitted the Job's in sequence, or if that gave it all to the
> JobTracker, and the JobTracker took care of submitting the Jobs in sequence
> appropriately.
>
> Basically, I'm looking for a completely stateless client, that doesn't need
> to ping the JobTracker every now and then to see if a Job has completed, and
> then submit the next one. The ideal flow would be the client gets in a
> request to run the series of Jobs, it preps them all, gets them all
> configured, and then passes them off to the JobTracker which runs them all
> in order without the client application needing to do anthing further.
>
> Sounds like that doesn't really exist as part of Hadoop framework, and
> needs something like Oozie (or a home-built system) to do this.
>
> --Aaron
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Wednesday, September 28, 2011 9:37 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Running multiple MR Job's in sequence
>
> Within the Hadoop core project, there is JobControl you can utilize
> for this. You can view its API at
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/jobcontrol/package-summary.html
> and it is fairly simple to use (Create jobs in regular java API, build
> a dependency flow using JobControl atop these jobconf objects).
>
> Apache Oozie and other such tools offer higher abstractions on
> controlling a workflow, and can be considered when your needs can get
> a bit complex than just a series (easy to handle failure scenarios
> between dependent jobs, perform minor fs operations in pre/post
> processing, etc.).
>
> On Thu, Sep 29, 2011 at 5:26 AM, Aaron Baff <Aa...@telescope.tv>
> wrote:
> > Is it possible to submit a series of MR Jobs to the JobTracker to run in
> sequence (one finishes, take the output of that if successful and feed it
> into the next, etc), or does it need to run client side by using the
> JobControl or something like Oozie, or rolling our own? What I'm looking for
> is a fire & forget, and occasionally check back to see if it's done. So
> client-side doesn't need to really know anything or keep track of anything.
> Does something like that exist within the Hadoop framework?
> >
> > --Aaron
> >
>
>
>
> --
> Harsh J
>



--

Thanks,
John C

Re: Running multiple MR Job's in sequence

Posted by John Conwell <jo...@iamjohn.me>.

After you kick off a job, say JobA, your client doesn't need to sit and ping
Hadoop to see if it finished before it starts JobB.  You can have the client
block until the job is complete with "Job.waitForCompletion(boolean
verbose)".  Using this you can create a "job driver" that chains jobs
together easily.

Now, if your job takes 2 weeks to run, you cant kill your driver process.
 If you do, JobA will finish running, but JobB will never start

JohnC

On Thu, Sep 29, 2011 at 9:51 AM, Aaron Baff <Aa...@telescope.tv> wrote:

> I saw this, but wasn't sure if it was something that ran on the client and
> just submitted the Job's in sequence, or if that gave it all to the
> JobTracker, and the JobTracker took care of submitting the Jobs in sequence
> appropriately.
>
> Basically, I'm looking for a completely stateless client, that doesn't need
> to ping the JobTracker every now and then to see if a Job has completed, and
> then submit the next one. The ideal flow would be the client gets in a
> request to run the series of Jobs, it preps them all, gets them all
> configured, and then passes them off to the JobTracker which runs them all
> in order without the client application needing to do anthing further.
>
> Sounds like that doesn't really exist as part of Hadoop framework, and
> needs something like Oozie (or a home-built system) to do this.
>
> --Aaron
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Wednesday, September 28, 2011 9:37 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Running multiple MR Job's in sequence
>
> Within the Hadoop core project, there is JobControl you can utilize
> for this. You can view its API at
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/jobcontrol/package-summary.html
> and it is fairly simple to use (Create jobs in regular java API, build
> a dependency flow using JobControl atop these jobconf objects).
>
> Apache Oozie and other such tools offer higher abstractions on
> controlling a workflow, and can be considered when your needs can get
> a bit complex than just a series (easy to handle failure scenarios
> between dependent jobs, perform minor fs operations in pre/post
> processing, etc.).
>
> On Thu, Sep 29, 2011 at 5:26 AM, Aaron Baff <Aa...@telescope.tv>
> wrote:
> > Is it possible to submit a series of MR Jobs to the JobTracker to run in
> sequence (one finishes, take the output of that if successful and feed it
> into the next, etc), or does it need to run client side by using the
> JobControl or something like Oozie, or rolling our own? What I'm looking for
> is a fire & forget, and occasionally check back to see if it's done. So
> client-side doesn't need to really know anything or keep track of anything.
> Does something like that exist within the Hadoop framework?
> >
> > --Aaron
> >
>
>
>
> --
> Harsh J
>



-- 

Thanks,
John C

RE: Running multiple MR Job's in sequence

Posted by Aaron Baff <Aa...@telescope.tv>.

I saw this, but wasn't sure if it was something that ran on the client and just submitted the Job's in sequence, or if that gave it all to the JobTracker, and the JobTracker took care of submitting the Jobs in sequence appropriately.

Basically, I'm looking for a completely stateless client, that doesn't need to ping the JobTracker every now and then to see if a Job has completed, and then submit the next one. The ideal flow would be the client gets in a request to run the series of Jobs, it preps them all, gets them all configured, and then passes them off to the JobTracker which runs them all in order without the client application needing to do anthing further.

Sounds like that doesn't really exist as part of Hadoop framework, and needs something like Oozie (or a home-built system) to do this.

--Aaron
-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com]
Sent: Wednesday, September 28, 2011 9:37 PM
To: common-user@hadoop.apache.org
Subject: Re: Running multiple MR Job's in sequence

Within the Hadoop core project, there is JobControl you can utilize
for this. You can view its API at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/jobcontrol/package-summary.html
and it is fairly simple to use (Create jobs in regular java API, build
a dependency flow using JobControl atop these jobconf objects).

Apache Oozie and other such tools offer higher abstractions on
controlling a workflow, and can be considered when your needs can get
a bit complex than just a series (easy to handle failure scenarios
between dependent jobs, perform minor fs operations in pre/post
processing, etc.).

On Thu, Sep 29, 2011 at 5:26 AM, Aaron Baff <Aa...@telescope.tv> wrote:
> Is it possible to submit a series of MR Jobs to the JobTracker to run in sequence (one finishes, take the output of that if successful and feed it into the next, etc), or does it need to run client side by using the JobControl or something like Oozie, or rolling our own? What I'm looking for is a fire & forget, and occasionally check back to see if it's done. So client-side doesn't need to really know anything or keep track of anything. Does something like that exist within the Hadoop framework?
>
> --Aaron
>

--
Harsh J

Re: Running multiple MR Job's in sequence

Posted by Harsh J <ha...@cloudera.com>.

Within the Hadoop core project, there is JobControl you can utilize
for this. You can view its API at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/jobcontrol/package-summary.html
and it is fairly simple to use (Create jobs in regular java API, build
a dependency flow using JobControl atop these jobconf objects).

Apache Oozie and other such tools offer higher abstractions on
controlling a workflow, and can be considered when your needs can get
a bit complex than just a series (easy to handle failure scenarios
between dependent jobs, perform minor fs operations in pre/post
processing, etc.).

On Thu, Sep 29, 2011 at 5:26 AM, Aaron Baff <Aa...@telescope.tv> wrote:
> Is it possible to submit a series of MR Jobs to the JobTracker to run in sequence (one finishes, take the output of that if successful and feed it into the next, etc), or does it need to run client side by using the JobControl or something like Oozie, or rolling our own? What I'm looking for is a fire & forget, and occasionally check back to see if it's done. So client-side doesn't need to really know anything or keep track of anything. Does something like that exist within the Hadoop framework?
>
> --Aaron
>

-- 
Harsh J

Re: Running multiple MR Job's in sequence

Posted by Arko Provo Mukherjee <ar...@gmail.com>.

Hi,

The way I did it is have multiple JobConfs and running them one after the
another in the program as per the logic.

The setOutputPath in the previous job can be setInputPath in the next one if
you want to take the output from the previous job and feed it as input to
the next.

Thanks & regards
Arko

On Wed, Sep 28, 2011 at 6:56 PM, Aaron Baff <Aa...@telescope.tv> wrote:

> Is it possible to submit a series of MR Jobs to the JobTracker to run in
> sequence (one finishes, take the output of that if successful and feed it
> into the next, etc), or does it need to run client side by using the
> JobControl or something like Oozie, or rolling our own? What I'm looking for
> is a fire & forget, and occasionally check back to see if it's done. So
> client-side doesn't need to really know anything or keep track of anything.
> Does something like that exist within the Hadoop framework?
>
> --Aaron
>