You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by John Sanda <jo...@gmail.com> on 2011/03/03 03:21:55 UTC

using output from one job as input to another

Hi I am new to Hadoop, so maybe I am missing something obvious. I have
written a small map reduce program that runs two jobs. I want the output of
the first job to serve as the input to the second job. Here is what my
driver code looks like:

public int run(String[] args) throws Exception {
    Configuration conf = getConf();

    Job job = new Job(conf, "Job One");
    job.setJarByClass(CountCitations.class);

    Path in = new Path(args[0]);
    Path out1 = new Path("jobOneOutput");

    FileInputFormat.setInputPaths(job, in);
    FileOutputFormat.setOutputPath(job, out1);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.waitForCompletion(true);

    job = new Job(conf, "Job Two");
    job.setJarByClass(MyJob.class);

    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.setInputPaths(job, out1);
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(MapCounts.class);
    job.setReducerClass(ReduceCounts.class);

    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);

    return 0;
}

The output path created from the first job is a directory, and it the file
in that directory that has a name like part-r-0000 that I want to feed as
input into the second job. I am running in pseudo-distributed mode so I know
that that file name is going to be the same every run. But in a true
distributed mode that file name will be different for each node. More over,
when in distributed mode don't I want a uniform view of that output file
which will be spread across my cluster? Is there something wrong in my code?
Or can someone point me to some examples that do this?

Thanks

- John

Re: using output from one job as input to another

Posted by John Sanda <jo...@gmail.com>.
Thanks for the response. What I meant by uniform view is that I would be
able to avoid having to reference each individual part-r-xxxx file. It
wasn't immediately clear to me that the directory could be the input path.
That tells me then the problem(s) is somewhere in my MR code. Thanks!

On Wed, Mar 2, 2011 at 11:19 PM, Harsh J <qw...@gmail.com> wrote:

> Hello,
>
> On Thu, Mar 3, 2011 at 7:51 AM, John Sanda <jo...@gmail.com> wrote:
> > The output path created from the first job is a directory, and it the
> file
> > in that directory that has a name like part-r-0000 that I want to feed as
> > input into the second job. I am running in pseudo-distributed mode so I
> know
> > that that file name is going to be the same every run. But in a true
> > distributed mode that file name will be different for each node. More
> over,
>
> The default filename of many OutputFormats start with "part", and is
> not node dependent. You will get filenames in out1 as part-r-00000
> onwards to part-r-{num. of reduce tasks for your job}.
>
> > when in distributed mode don't I want a uniform view of that output file
> > which will be spread across my cluster? Is there something wrong in my
> code?
> > Or can someone point me to some examples that do this?
>
> I do not understand what you mean by uniform view. Using a directory
> as an input for a job is very much acceptable and a normal thing to do
> in file-based MR. The directories form the whole input, with files
> containing small "parts" of it. I do not see anything grossly wrong in
> your code provided.
>
> --
> Harsh J
> www.harshj.com
>



-- 

- John

Re: using output from one job as input to another

Posted by Harsh J <qw...@gmail.com>.
Hello,

On Thu, Mar 3, 2011 at 7:51 AM, John Sanda <jo...@gmail.com> wrote:
> The output path created from the first job is a directory, and it the file
> in that directory that has a name like part-r-0000 that I want to feed as
> input into the second job. I am running in pseudo-distributed mode so I know
> that that file name is going to be the same every run. But in a true
> distributed mode that file name will be different for each node. More over,

The default filename of many OutputFormats start with "part", and is
not node dependent. You will get filenames in out1 as part-r-00000
onwards to part-r-{num. of reduce tasks for your job}.

> when in distributed mode don't I want a uniform view of that output file
> which will be spread across my cluster? Is there something wrong in my code?
> Or can someone point me to some examples that do this?

I do not understand what you mean by uniform view. Using a directory
as an input for a job is very much acceptable and a normal thing to do
in file-based MR. The directories form the whole input, with files
containing small "parts" of it. I do not see anything grossly wrong in
your code provided.

-- 
Harsh J
www.harshj.com