You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by ma qiang <ma...@gmail.com> on 2008/02/21 07:21:13 UTC
how to set the result of the first mapreduce program as the input of the second mapreduce program?
Hi all:
Here I have two mapreduce program.I need to use the result of the
first mapreduce program to computer another values which generate in
the second mapreduce program and this intermediate result is not need
to save, so I want to run the second mapreduce program automatic using
output of the first mapreduce program as the input of the second
mapreduce program. Who can tell me how?
Thanks!
Best Wishes!
Qiang
Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?
Posted by Arun C Murthy <ac...@yahoo-inc.com>.
On Feb 20, 2008, at 10:21 PM, ma qiang wrote:
> Hi all:
> Here I have two mapreduce program.I need to use the result of the
> first mapreduce program to computer another values which generate in
> the second mapreduce program and this intermediate result is not need
> to save, so I want to run the second mapreduce program automatic using
> output of the first mapreduce program as the input of the second
> mapreduce program. Who can tell me how?
> Thanks!
> Best Wishes!
>
The output of the first job goes to HDFS (the output dir you
specified for your job), use that same directory as the input-dir for
your next job.
There is a 'jobcontrol' api which you can use to chain jobs:
http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/
mapred/jobcontrol/package-summary.html
Arun
> Qiang
Re: how to set the result of the first mapreduce program as the input
of the second mapreduce program?
Posted by Amar Kamat <am...@yahoo-inc.com>.
Output of every mapreduce job in Hadoop gets stored in the DFS i.e made
visible. You can run back to back jobs (i.e job chaining) but the output
wont be temporary. Look at Grep.java as Hairong suggested for more
details on job chaining. As of now there is no support for job chaining
in Hadoop. Pig []http://incubator.apache.org/pig/] on the other hand
implicitly does job pipelining. But for smaller and simple pipelines you
could do manual chaining. It depends on the kind of pipelining one requires.
Amar
ma qiang wrote:
> Hi all:
> Here I have two mapreduce program.I need to use the result of the
> first mapreduce program to computer another values which generate in
> the second mapreduce program and this intermediate result is not need
> to save,
> so I want to run the second mapreduce program automatic using
> output of the first mapreduce program as the input of the second
> mapreduce program. Who can tell me how?
> Thanks!
> Best Wishes!
>
> Qiang
>
Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?
Posted by Paco NATHAN <ce...@gmail.com>.
Hi Qiang,
Here is what I understand:
Pass 1
- generate "intermediate dataset" as output from its reduce phase
Pass 2
- take "intermediate dataset" as input
- produce some result (an aggregate?)
- no need to persist the "intermediate dataset"
Would it be possible to collapse this into one map/reduce?
Is there a problem with the size of the intermediate dataset? If not,
then it could simply be deleted after the second pass.
In terms of executing two map/reduce jobs sequentially in the same
run, they both need to have their own JobConf instantiated, and
different job names.
An example is in the "jyte" directory of
http://code.google.com/p/ceteri-mapred/
A brief description about that code and its passes through the jyte data are in:
http://ceteri.blogspot.com/2008/02/hadoop-part-2-jyte-cred-graph.html
Hope that helps,
Paco
On Wed, Feb 20, 2008 at 10:21 PM, ma qiang <ma...@gmail.com> wrote:
> Hi all:
> Here I have two mapreduce program.I need to use the result of the
> first mapreduce program to computer another values which generate in
> the second mapreduce program and this intermediate result is not need
> to save, so I want to run the second mapreduce program automatic using
> output of the first mapreduce program as the input of the second
> mapreduce program. Who can tell me how?
> Thanks!
> Best Wishes!
>
> Qiang
>
Re: how to set the result of the first mapreduce program as the
input of the second mapreduce program?
Posted by Hairong Kuang <ha...@yahoo-inc.com>.
Take a look at Grep.java under src/examples/org/apache/hadoop/examples. It
first runs a grep job and then a sort job.
Hairong
On 2/20/08 10:21 PM, "ma qiang" <ma...@gmail.com> wrote:
> Hi all:
> Here I have two mapreduce program.I need to use the result of the
> first mapreduce program to computer another values which generate in
> the second mapreduce program and this intermediate result is not need
> to save, so I want to run the second mapreduce program automatic using
> output of the first mapreduce program as the input of the second
> mapreduce program. Who can tell me how?
> Thanks!
> Best Wishes!
>
> Qiang