You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kunsheng Chen <ke...@yahoo.com> on 2009/06/15 01:02:47 UTC

Could I collect results from Map-Reduce then output myself ?

Hi everyone,

I am doing a map-reduce program, it is working good.

Now I am thinking of inserting my own algorithm to pick the output results after 'Reduce' other than simply use 'output.colllect()' in Reduce to output all results.

The only thing I could think is read the output file after JobClient finishing and does some Java program for that, but I am not sure whether there are efficient method provided by hadoop to handle that.


Any idea is well appreciated,

-Kun

Re: Could I collect results from Map-Reduce then output myself ?

Posted by Aaron Kimball <aa...@cloudera.com>.

If you can make the decision locally, then it should just be performed in
the reducer itself:

if (guard) {
  output.collect(k, v);
}

If you need to know what results will be generated by other calls to
reduce() on that same machine, then you'll need to be a bit more clever. If
you know that for all jobs you'll run, your results will always fit in a
buffer in RAM, then you can put your values in an ArrayList or something and
then override Reducer.close() to dump your values into the output collector.
Then call super.close().

If you may need to generate more data than will fit in RAM, or you need the
results of multiple nodes to conference together, then this means you almost
certainly want a second MapReduce pass. Your first pass should collect() all
the results it generates. Then in a second pass, use an identity mapper that
causes the shuffler to sort the data along some axis so that the most
desirable data comes first. Then output.collect() this data a second time in
the second reducer, discarding the data that doesn't meet your criterion.
The input path to your second MR is the output path from the first one.

- Aaron

On Sun, Jun 14, 2009 at 4:02 PM, Kunsheng Chen <ke...@yahoo.com> wrote:

>
> Hi everyone,
>
> I am doing a map-reduce program, it is working good.
>
> Now I am thinking of inserting my own algorithm to pick the output results
> after 'Reduce' other than simply use 'output.colllect()' in Reduce to output
> all results.
>
> The only thing I could think is read the output file after JobClient
> finishing and does some Java program for that, but I am not sure whether
> there are efficient method provided by hadoop to handle that.
>
>
> Any idea is well appreciated,
>
> -Kun
>
>
>
>