You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Berry, Matt" <mw...@amazon.com> on 2012/07/19 18:22:12 UTC

OutputFormat Theory Question

>From what I gather about how Map Reduce operates, there isn't really any functional difference between whether a single OutputFormat object is initialized on a central node or if each reducer task initializes its own OutputFormat object. What I would like to  know however, is the relationship between the records that are passed to the OutputFormat from the reducers. Take the case of a sorting MapReduce job, where the mapper and reducer are both identity functions. In this setup, I would expect that the records being passed to the OutputFormat from the reducer are sorted and are arriving in-order.

A simplified version of my use-case is to sort a large number of records, and then write all the ones that start with A to a file named A, B to B, etc. Due to the fact that each file can only be opened for writing once, it is very important in this use case to know if the records arrive at the OutputFormat in-order so I know it is safe to close file A when I encounter a record that belongs in B.

Sincerely,
Matthew Berry

Re: OutputFormat Theory Question

Posted by Harsh J <ha...@cloudera.com>.
Matt,

The reducer's reduce(Key, <Values>) call does proceed in sorted order.
You can safely assume that when the next reduce call begins, you will
no longer get the previous Key again, and can hence close your file.
This is guaranteed by the sorter framework and several tests in MR
land cover this.

On Thu, Jul 19, 2012 at 9:52 PM, Berry, Matt <mw...@amazon.com> wrote:
> From what I gather about how Map Reduce operates, there isn’t really any
> functional difference between whether a single OutputFormat object is
> initialized on a central node or if each reducer task initializes its own
> OutputFormat object. What I would like to  know however, is the relationship
> between the records that are passed to the OutputFormat from the reducers.
> Take the case of a sorting MapReduce job, where the mapper and reducer are
> both identity functions. In this setup, I would expect that the records
> being passed to the OutputFormat from the reducer are sorted and are
> arriving in-order.
>
>
>
> A simplified version of my use-case is to sort a large number of records,
> and then write all the ones that start with A to a file named A, B to B,
> etc. Due to the fact that each file can only be opened for writing once, it
> is very important in this use case to know if the records arrive at the
> OutputFormat in-order so I know it is safe to close file A when I encounter
> a record that belongs in B.
>
>
>
> Sincerely,
>
> Matthew Berry



-- 
Harsh J