You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Kevin Burton <bu...@spinn3r.com> on 2011/09/27 21:09:47 UTC

output from one map reduce job as the input to another map reduce job?

Is it possible to connect the output of one map reduce job so that it is the
input to another map reduce job.

Basically… then reduce() outputs a key, that will be passed to another map()
function without having to store intermediate data to the filesystem.

Kevin

-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: output from one map reduce job as the input to another map reduce job?

Posted by Arun C Murthy <ac...@hortonworks.com>.

On Sep 27, 2011, at 12:09 PM, Kevin Burton wrote:

> Is it possible to connect the output of one map reduce job so that it is the input to another map reduce job.
> 
> Basically… then reduce() outputs a key, that will be passed to another map() function without having to store intermediate data to the filesystem.
> 

Currently there is no way to pipeline in such a manner - with hadoop-0.23 it's doable, but will take more effort.

Arun

Re: output from one map reduce job as the input to another map reduce job?

Posted by Niels Basjes <Ni...@basjes.nl>.

To me it sounds like the asker should checkout tools like storm and s4
instead of hadoop.

http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop

-- 
Met vriendelijke groet,
Niels Basjes
Op 27 sep. 2011 22:38 schreef "Mike Spreitzer" <ms...@us.ibm.com> het
volgende:
> It looks to me like Oozie will not do what was asked. In
>
http://yahoo.github.com/oozie/releases/3.0.0/WorkflowFunctionalSpec.html#a0_Definitions
> I see:
>
> 3.2.2 Map-Reduce Action
> ...
> The workflow job will wait until the Hadoop map/reduce job completes
> before continuing to the next action in the workflow execution path.
>
> That implies to me that the output of one job is held in some intermediate

> storage (likely HDFS) for a while before being read by the consuming
> job(s).
>
> Regards,
> Mike Spreitzer

Re: output from one map reduce job as the input to another map reduce job?

Posted by Mike Spreitzer <ms...@us.ibm.com>.

It looks to me like Oozie will not do what was asked.  In 
http://yahoo.github.com/oozie/releases/3.0.0/WorkflowFunctionalSpec.html#a0_Definitions 
I see:

3.2.2 Map-Reduce Action
...
The workflow job will wait until the Hadoop map/reduce job completes 
before continuing to the next action in the workflow execution path.

That implies to me that the output of one job is held in some intermediate 
storage (likely HDFS) for a while before being read by the consuming 
job(s).

Regards,
Mike Spreitzer

Re: output from one map reduce job as the input to another map reduce job?

Posted by Marcos Luis Ortiz Valmaseda <ma...@googlemail.com>.

Are you consider for this to Oozie? It´s a workflow engine developed for the
Yahoo! engineers
Yahoo/oozie at GitHub
https://github.com/yahoo/oozie

Oozie at InfoQ
http://www.infoq.com/articles/introductionOozie

Oozie´s examples:
http://www.infoq.com/articles/oozieexample
http://yahoo.github.com/oozie/releases/2.3.0/DG_Examples.html

Oozie at Cloudera
https://ccp.cloudera.com/display/CDHDOC/Oozie+Installation

Regards

2011/9/27 Arko Provo Mukherjee <ar...@gmail.com>

> Hi,
>
> I am not sure how you can avoid the filesystem, however, I did it as
> follows:
>
> // For Job 1
> FileInputFormat.addInputPath(job1, new Path(args[0]));
> FileOutputFormat.setOutputPath(job1, new Path(args[1]));
>
> // For job 2
> FileInputFormat.addInputPath(job2, new Path(args[1]));
> FileOutputFormat.setOutputPath(job2, new Path(args[2]));
>
> Assuming
> args[0] --> Input to first mapper
> args[1] --> Output of first reducer / Input to second mapper
> args[2] --> Out of second reducer
>
> Hope this helps!
> Warm regards
> Arko
>
> On Tue, Sep 27, 2011 at 2:09 PM, Kevin Burton <bu...@spinn3r.com> wrote:
> > Is it possible to connect the output of one map reduce job so that it is
> the
> > input to another map reduce job.
> > Basically… then reduce() outputs a key, that will be passed to another
> map()
> > function without having to store intermediate data to the filesystem.
> > Kevin
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> >
> > Location: San Francisco, CA
> > Skype: burtonator
> >
> > Skype-in: (415) 871-0687
> >
>



-- 
Marcos Luis Ortíz Valmaseda
 Linux Infrastructure Engineer
 Linux User # 418229
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186

Re: output from one map reduce job as the input to another map reduce job?

Posted by Arko Provo Mukherjee <ar...@gmail.com>.

Hi,

I am not sure how you can avoid the filesystem, however, I did it as follows:

// For Job 1
FileInputFormat.addInputPath(job1, new Path(args[0]));
FileOutputFormat.setOutputPath(job1, new Path(args[1]));

// For job 2
FileInputFormat.addInputPath(job2, new Path(args[1]));
FileOutputFormat.setOutputPath(job2, new Path(args[2]));

Assuming
args[0] --> Input to first mapper
args[1] --> Output of first reducer / Input to second mapper
args[2] --> Out of second reducer

Hope this helps!
Warm regards
Arko

On Tue, Sep 27, 2011 at 2:09 PM, Kevin Burton <bu...@spinn3r.com> wrote:
> Is it possible to connect the output of one map reduce job so that it is the
> input to another map reduce job.
> Basically… then reduce() outputs a key, that will be passed to another map()
> function without having to store intermediate data to the filesystem.
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: San Francisco, CA
> Skype: burtonator
>
> Skype-in: (415) 871-0687
>