You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Sean McNamara <Se...@Webtrends.com> on 2012/11/23 23:22:12 UTC

Multi-stage map/reduce jobs

It's not clear to me how to stitch together multiple map reduce jobs.  Without using cascading or something else like it, is the method basically to write to a intermediate spot, and have the next stage read from there?

If so, how are jobs responsible for cleaning up the temp/intermediate data they create?  What happens if stage 1 completes, and state 2 doesn't, do the stage 1 files get left around?

Does anyone have some insight they could share?

Thanks.

Re: Multi-stage map/reduce jobs

Posted by Jay Vyas <ja...@gmail.com>.
Hadoop is not an API for orchestrating mapreduce jobs- fortunately, there is no need for such an API.  Each mapreduce job can simple be run like a normal java class.

So, to run multiple mapreduce jobs?

Easy- you create a main()[] method in a single class which runs each job individually by invoking each job separately, using the waitForCompletion() method which blocks until a job completes.  

..this method will block until each individual job completes.

Jay Vyas 
http://jayunit100.blogspot.com

On Nov 23, 2012, at 5:22 PM, Sean McNamara <Se...@Webtrends.com> wrote:

> It's not clear to me how to stitch together multiple map reduce jobs.  Without using cascading or something else like it, is the method basically to write to a intermediate spot, and have the next stage read from there?
> 
> If so, how are jobs responsible for cleaning up the temp/intermediate data they create?  What happens if stage 1 completes, and state 2 doesn't, do the stage 1 files get left around?
> 
> Does anyone have some insight they could share?
> 
> Thanks.

Re: Multi-stage map/reduce jobs

Posted by Jay Vyas <ja...@gmail.com>.
Hadoop is not an API for orchestrating mapreduce jobs- fortunately, there is no need for such an API.  Each mapreduce job can simple be run like a normal java class.

So, to run multiple mapreduce jobs?

Easy- you create a main()[] method in a single class which runs each job individually by invoking each job separately, using the waitForCompletion() method which blocks until a job completes.  

..this method will block until each individual job completes.

Jay Vyas 
http://jayunit100.blogspot.com

On Nov 23, 2012, at 5:22 PM, Sean McNamara <Se...@Webtrends.com> wrote:

> It's not clear to me how to stitch together multiple map reduce jobs.  Without using cascading or something else like it, is the method basically to write to a intermediate spot, and have the next stage read from there?
> 
> If so, how are jobs responsible for cleaning up the temp/intermediate data they create?  What happens if stage 1 completes, and state 2 doesn't, do the stage 1 files get left around?
> 
> Does anyone have some insight they could share?
> 
> Thanks.

Re: Multi-stage map/reduce jobs

Posted by Bertrand Dechoux <de...@gmail.com>.
I will second Harsh about JobControl.

It is indeed not the role of Hadoop to provide a full workflow engine in
its core but JobControl allows you to define a graph of dependent jobs and
run them as one from a programmatic point of vue. Of course, if you were to
compare it to cascade in Cascading, you would be responsible for cleaning
'temporary' results and build your own 'results cache'.

http://hadoop.apache.org/docs/r1.0.4/api/index.html?org/apache/hadoop/mapred/jobcontrol/JobControl.html

Bertrand

On Sat, Nov 24, 2012 at 8:27 AM, Harsh J <ha...@cloudera.com> wrote:

> You probably want something like Oozie which provides DAG-like flows
> for jobs, so you can easily write in "upon-failure" and "upon-success"
> form of conditions, aside of incorporating complex logic as well.
>
> Otherwise, I guess you could do what Jay has suggested, or look at the
> JobControl classes to avoid some of the extra work needed.
>
> On Sat, Nov 24, 2012 at 3:52 AM, Sean McNamara
> <Se...@webtrends.com> wrote:
> > It's not clear to me how to stitch together multiple map reduce jobs.
> > Without using cascading or something else like it, is the method
> basically
> > to write to a intermediate spot, and have the next stage read from there?
> >
> > If so, how are jobs responsible for cleaning up the temp/intermediate
> data
> > they create?  What happens if stage 1 completes, and state 2 doesn't, do
> the
> > stage 1 files get left around?
> >
> > Does anyone have some insight they could share?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>



-- 
Bertrand Dechoux

Re: Multi-stage map/reduce jobs

Posted by Radim Kolar <hs...@filez.com>.
> Otherwise, I guess you could do what Jay has suggested, or look at the
> JobControl classes to avoid some of the extra work needed.
jobcontrol needs to be server side. just submit bunch of jobs and exit.

As it is implemented now, its waste of time to use unless you are 
beginner just learning hadoop. API is clumsy and 3rd party libraries 
(Spring batch for example) are doing much better job.

Re: Multi-stage map/reduce jobs

Posted by Bertrand Dechoux <de...@gmail.com>.
I will second Harsh about JobControl.

It is indeed not the role of Hadoop to provide a full workflow engine in
its core but JobControl allows you to define a graph of dependent jobs and
run them as one from a programmatic point of vue. Of course, if you were to
compare it to cascade in Cascading, you would be responsible for cleaning
'temporary' results and build your own 'results cache'.

http://hadoop.apache.org/docs/r1.0.4/api/index.html?org/apache/hadoop/mapred/jobcontrol/JobControl.html

Bertrand

On Sat, Nov 24, 2012 at 8:27 AM, Harsh J <ha...@cloudera.com> wrote:

> You probably want something like Oozie which provides DAG-like flows
> for jobs, so you can easily write in "upon-failure" and "upon-success"
> form of conditions, aside of incorporating complex logic as well.
>
> Otherwise, I guess you could do what Jay has suggested, or look at the
> JobControl classes to avoid some of the extra work needed.
>
> On Sat, Nov 24, 2012 at 3:52 AM, Sean McNamara
> <Se...@webtrends.com> wrote:
> > It's not clear to me how to stitch together multiple map reduce jobs.
> > Without using cascading or something else like it, is the method
> basically
> > to write to a intermediate spot, and have the next stage read from there?
> >
> > If so, how are jobs responsible for cleaning up the temp/intermediate
> data
> > they create?  What happens if stage 1 completes, and state 2 doesn't, do
> the
> > stage 1 files get left around?
> >
> > Does anyone have some insight they could share?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>



-- 
Bertrand Dechoux

Re: Multi-stage map/reduce jobs

Posted by Bertrand Dechoux <de...@gmail.com>.
I will second Harsh about JobControl.

It is indeed not the role of Hadoop to provide a full workflow engine in
its core but JobControl allows you to define a graph of dependent jobs and
run them as one from a programmatic point of vue. Of course, if you were to
compare it to cascade in Cascading, you would be responsible for cleaning
'temporary' results and build your own 'results cache'.

http://hadoop.apache.org/docs/r1.0.4/api/index.html?org/apache/hadoop/mapred/jobcontrol/JobControl.html

Bertrand

On Sat, Nov 24, 2012 at 8:27 AM, Harsh J <ha...@cloudera.com> wrote:

> You probably want something like Oozie which provides DAG-like flows
> for jobs, so you can easily write in "upon-failure" and "upon-success"
> form of conditions, aside of incorporating complex logic as well.
>
> Otherwise, I guess you could do what Jay has suggested, or look at the
> JobControl classes to avoid some of the extra work needed.
>
> On Sat, Nov 24, 2012 at 3:52 AM, Sean McNamara
> <Se...@webtrends.com> wrote:
> > It's not clear to me how to stitch together multiple map reduce jobs.
> > Without using cascading or something else like it, is the method
> basically
> > to write to a intermediate spot, and have the next stage read from there?
> >
> > If so, how are jobs responsible for cleaning up the temp/intermediate
> data
> > they create?  What happens if stage 1 completes, and state 2 doesn't, do
> the
> > stage 1 files get left around?
> >
> > Does anyone have some insight they could share?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>



-- 
Bertrand Dechoux

Re: Multi-stage map/reduce jobs

Posted by Bertrand Dechoux <de...@gmail.com>.
I will second Harsh about JobControl.

It is indeed not the role of Hadoop to provide a full workflow engine in
its core but JobControl allows you to define a graph of dependent jobs and
run them as one from a programmatic point of vue. Of course, if you were to
compare it to cascade in Cascading, you would be responsible for cleaning
'temporary' results and build your own 'results cache'.

http://hadoop.apache.org/docs/r1.0.4/api/index.html?org/apache/hadoop/mapred/jobcontrol/JobControl.html

Bertrand

On Sat, Nov 24, 2012 at 8:27 AM, Harsh J <ha...@cloudera.com> wrote:

> You probably want something like Oozie which provides DAG-like flows
> for jobs, so you can easily write in "upon-failure" and "upon-success"
> form of conditions, aside of incorporating complex logic as well.
>
> Otherwise, I guess you could do what Jay has suggested, or look at the
> JobControl classes to avoid some of the extra work needed.
>
> On Sat, Nov 24, 2012 at 3:52 AM, Sean McNamara
> <Se...@webtrends.com> wrote:
> > It's not clear to me how to stitch together multiple map reduce jobs.
> > Without using cascading or something else like it, is the method
> basically
> > to write to a intermediate spot, and have the next stage read from there?
> >
> > If so, how are jobs responsible for cleaning up the temp/intermediate
> data
> > they create?  What happens if stage 1 completes, and state 2 doesn't, do
> the
> > stage 1 files get left around?
> >
> > Does anyone have some insight they could share?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>



-- 
Bertrand Dechoux

Re: Multi-stage map/reduce jobs

Posted by Radim Kolar <hs...@filez.com>.
> Otherwise, I guess you could do what Jay has suggested, or look at the
> JobControl classes to avoid some of the extra work needed.
jobcontrol needs to be server side. just submit bunch of jobs and exit.

As it is implemented now, its waste of time to use unless you are 
beginner just learning hadoop. API is clumsy and 3rd party libraries 
(Spring batch for example) are doing much better job.

Re: Multi-stage map/reduce jobs

Posted by Radim Kolar <hs...@filez.com>.
> Otherwise, I guess you could do what Jay has suggested, or look at the
> JobControl classes to avoid some of the extra work needed.
jobcontrol needs to be server side. just submit bunch of jobs and exit.

As it is implemented now, its waste of time to use unless you are 
beginner just learning hadoop. API is clumsy and 3rd party libraries 
(Spring batch for example) are doing much better job.

Re: Multi-stage map/reduce jobs

Posted by Radim Kolar <hs...@filez.com>.
> Otherwise, I guess you could do what Jay has suggested, or look at the
> JobControl classes to avoid some of the extra work needed.
jobcontrol needs to be server side. just submit bunch of jobs and exit.

As it is implemented now, its waste of time to use unless you are 
beginner just learning hadoop. API is clumsy and 3rd party libraries 
(Spring batch for example) are doing much better job.

Re: Multi-stage map/reduce jobs

Posted by Harsh J <ha...@cloudera.com>.
You probably want something like Oozie which provides DAG-like flows
for jobs, so you can easily write in "upon-failure" and "upon-success"
form of conditions, aside of incorporating complex logic as well.

Otherwise, I guess you could do what Jay has suggested, or look at the
JobControl classes to avoid some of the extra work needed.

On Sat, Nov 24, 2012 at 3:52 AM, Sean McNamara
<Se...@webtrends.com> wrote:
> It's not clear to me how to stitch together multiple map reduce jobs.
> Without using cascading or something else like it, is the method basically
> to write to a intermediate spot, and have the next stage read from there?
>
> If so, how are jobs responsible for cleaning up the temp/intermediate data
> they create?  What happens if stage 1 completes, and state 2 doesn't, do the
> stage 1 files get left around?
>
> Does anyone have some insight they could share?
>
> Thanks.



-- 
Harsh J

Re: Multi-stage map/reduce jobs

Posted by Jay Vyas <ja...@gmail.com>.
Hadoop is not an API for orchestrating mapreduce jobs- fortunately, there is no need for such an API.  Each mapreduce job can simple be run like a normal java class.

So, to run multiple mapreduce jobs?

Easy- you create a main()[] method in a single class which runs each job individually by invoking each job separately, using the waitForCompletion() method which blocks until a job completes.  

..this method will block until each individual job completes.

Jay Vyas 
http://jayunit100.blogspot.com

On Nov 23, 2012, at 5:22 PM, Sean McNamara <Se...@Webtrends.com> wrote:

> It's not clear to me how to stitch together multiple map reduce jobs.  Without using cascading or something else like it, is the method basically to write to a intermediate spot, and have the next stage read from there?
> 
> If so, how are jobs responsible for cleaning up the temp/intermediate data they create?  What happens if stage 1 completes, and state 2 doesn't, do the stage 1 files get left around?
> 
> Does anyone have some insight they could share?
> 
> Thanks.

Re: Multi-stage map/reduce jobs

Posted by Harsh J <ha...@cloudera.com>.
You probably want something like Oozie which provides DAG-like flows
for jobs, so you can easily write in "upon-failure" and "upon-success"
form of conditions, aside of incorporating complex logic as well.

Otherwise, I guess you could do what Jay has suggested, or look at the
JobControl classes to avoid some of the extra work needed.

On Sat, Nov 24, 2012 at 3:52 AM, Sean McNamara
<Se...@webtrends.com> wrote:
> It's not clear to me how to stitch together multiple map reduce jobs.
> Without using cascading or something else like it, is the method basically
> to write to a intermediate spot, and have the next stage read from there?
>
> If so, how are jobs responsible for cleaning up the temp/intermediate data
> they create?  What happens if stage 1 completes, and state 2 doesn't, do the
> stage 1 files get left around?
>
> Does anyone have some insight they could share?
>
> Thanks.



-- 
Harsh J

Re: Multi-stage map/reduce jobs

Posted by Harsh J <ha...@cloudera.com>.
You probably want something like Oozie which provides DAG-like flows
for jobs, so you can easily write in "upon-failure" and "upon-success"
form of conditions, aside of incorporating complex logic as well.

Otherwise, I guess you could do what Jay has suggested, or look at the
JobControl classes to avoid some of the extra work needed.

On Sat, Nov 24, 2012 at 3:52 AM, Sean McNamara
<Se...@webtrends.com> wrote:
> It's not clear to me how to stitch together multiple map reduce jobs.
> Without using cascading or something else like it, is the method basically
> to write to a intermediate spot, and have the next stage read from there?
>
> If so, how are jobs responsible for cleaning up the temp/intermediate data
> they create?  What happens if stage 1 completes, and state 2 doesn't, do the
> stage 1 files get left around?
>
> Does anyone have some insight they could share?
>
> Thanks.



-- 
Harsh J

Re: Multi-stage map/reduce jobs

Posted by Harsh J <ha...@cloudera.com>.
You probably want something like Oozie which provides DAG-like flows
for jobs, so you can easily write in "upon-failure" and "upon-success"
form of conditions, aside of incorporating complex logic as well.

Otherwise, I guess you could do what Jay has suggested, or look at the
JobControl classes to avoid some of the extra work needed.

On Sat, Nov 24, 2012 at 3:52 AM, Sean McNamara
<Se...@webtrends.com> wrote:
> It's not clear to me how to stitch together multiple map reduce jobs.
> Without using cascading or something else like it, is the method basically
> to write to a intermediate spot, and have the next stage read from there?
>
> If so, how are jobs responsible for cleaning up the temp/intermediate data
> they create?  What happens if stage 1 completes, and state 2 doesn't, do the
> stage 1 files get left around?
>
> Does anyone have some insight they could share?
>
> Thanks.



-- 
Harsh J

Re: Multi-stage map/reduce jobs

Posted by Jay Vyas <ja...@gmail.com>.
Hadoop is not an API for orchestrating mapreduce jobs- fortunately, there is no need for such an API.  Each mapreduce job can simple be run like a normal java class.

So, to run multiple mapreduce jobs?

Easy- you create a main()[] method in a single class which runs each job individually by invoking each job separately, using the waitForCompletion() method which blocks until a job completes.  

..this method will block until each individual job completes.

Jay Vyas 
http://jayunit100.blogspot.com

On Nov 23, 2012, at 5:22 PM, Sean McNamara <Se...@Webtrends.com> wrote:

> It's not clear to me how to stitch together multiple map reduce jobs.  Without using cascading or something else like it, is the method basically to write to a intermediate spot, and have the next stage read from there?
> 
> If so, how are jobs responsible for cleaning up the temp/intermediate data they create?  What happens if stage 1 completes, and state 2 doesn't, do the stage 1 files get left around?
> 
> Does anyone have some insight they could share?
> 
> Thanks.