You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by ma qiang <ma...@gmail.com> on 2008/02/21 07:21:13 UTC

how to set the result of the first mapreduce program as the input of the second mapreduce program?

Hi all:
     Here I have two mapreduce program.I need to use the result of the
first mapreduce program to computer another values which generate in
the second mapreduce program and this intermediate result is not need
to save, so I want to run the second mapreduce program automatic using
output of the first mapreduce program as the input of the second
mapreduce program. Who can tell me how?
     Thanks!
     Best Wishes!

Qiang

Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Feb 20, 2008, at 10:21 PM, ma qiang wrote:

> Hi all:
>      Here I have two mapreduce program.I need to use the result of the
> first mapreduce program to computer another values which generate in
> the second mapreduce program and this intermediate result is not need
> to save, so I want to run the second mapreduce program automatic using
> output of the first mapreduce program as the input of the second
> mapreduce program. Who can tell me how?
>      Thanks!
>      Best Wishes!
>

The output of the first job goes to HDFS (the output dir you  
specified for your job), use that same directory as the input-dir for  
your next job.

There is a 'jobcontrol' api which you can use to chain jobs:
http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/ 
mapred/jobcontrol/package-summary.html

Arun

> Qiang

Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

Posted by Amar Kamat <am...@yahoo-inc.com>.

Output of every mapreduce job in Hadoop gets stored in the DFS i.e made 
visible. You can run back to back jobs (i.e job chaining) but the output 
wont be temporary. Look at Grep.java as Hairong suggested for more 
details on job chaining. As of now there is no support for job chaining 
in Hadoop. Pig []http://incubator.apache.org/pig/] on the other hand 
implicitly does job pipelining. But for smaller and simple pipelines you 
could do manual chaining. It depends on the kind of pipelining one requires.
Amar
ma qiang wrote:
> Hi all:
>      Here I have two mapreduce program.I need to use the result of the
> first mapreduce program to computer another values which generate in
> the second mapreduce program and this intermediate result is not need
> to save, 
> so I want to run the second mapreduce program automatic using
> output of the first mapreduce program as the input of the second
> mapreduce program. Who can tell me how?
>      Thanks!
>      Best Wishes!
>
> Qiang
>

Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

Posted by Paco NATHAN <ce...@gmail.com>.

Hi Qiang,

Here is what I understand:

Pass 1
   - generate "intermediate dataset" as output from its reduce phase

Pass 2
   - take "intermediate dataset" as input
   - produce some result (an aggregate?)
   - no need to persist the "intermediate dataset"


Would it be possible to collapse this into one map/reduce?

Is there a problem with the size of the intermediate dataset? If not,
then it could simply be deleted after the second pass.

In terms of executing two map/reduce jobs sequentially in the same
run, they both need to have their own JobConf instantiated, and
different job names.

An example is in the "jyte" directory of
   http://code.google.com/p/ceteri-mapred/

A brief description about that code and its passes through the jyte data are in:
   http://ceteri.blogspot.com/2008/02/hadoop-part-2-jyte-cred-graph.html


Hope that helps,
Paco


On Wed, Feb 20, 2008 at 10:21 PM, ma qiang <ma...@gmail.com> wrote:
> Hi all:
>      Here I have two mapreduce program.I need to use the result of the
>  first mapreduce program to computer another values which generate in
>  the second mapreduce program and this intermediate result is not need
>  to save, so I want to run the second mapreduce program automatic using
>  output of the first mapreduce program as the input of the second
>  mapreduce program. Who can tell me how?
>      Thanks!
>      Best Wishes!
>
>  Qiang
>

Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

Posted by Hairong Kuang <ha...@yahoo-inc.com>.

Take a look at Grep.java under src/examples/org/apache/hadoop/examples. It
first runs a grep job and then a sort job.

Hairong


On 2/20/08 10:21 PM, "ma qiang" <ma...@gmail.com> wrote:

> Hi all:
>      Here I have two mapreduce program.I need to use the result of the
> first mapreduce program to computer another values which generate in
> the second mapreduce program and this intermediate result is not need
> to save, so I want to run the second mapreduce program automatic using
> output of the first mapreduce program as the input of the second
> mapreduce program. Who can tell me how?
>      Thanks!
>      Best Wishes!
> 
> Qiang