You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ryan LeCompte <le...@gmail.com> on 2008/09/06 18:35:02 UTC

Multiple output files

Hello,

I have a question regarding multiple output files that get produced as
a result of using multiple reduce tasks for a job (as opposed to only
one). If I'm using a custom writable and thus writing to a sequence
output, am I gauranteed that all of the day for a particular key will
appear in a single output file (e.g., part-0000), or is it possible
that the values could be split across multiple part-xxxx files? At the
end of the job I'm using the sequence file reader to read each custom
key/writable pair from each output file. Is it possible that the same
key could appear in multiple output files? If so, does Hadoop
automatically grab all of the values for a particular key in all of
the output files?

Thanks,
Ryan

Re: Multiple output files

Posted by Ryan LeCompte <le...@gmail.com>.
This clears up my concerns. Thanks!

Ryan



On Sep 6, 2008, at 2:17 PM, Owen O'Malley <om...@apache.org> wrote:

>
> On Sep 6, 2008, at 9:35 AM, Ryan LeCompte wrote:
>
>> I have a question regarding multiple output files that get produced  
>> as
>> a result of using multiple reduce tasks for a job (as opposed to only
>> one). If I'm using a custom writable and thus writing to a sequence
>> output, am I gauranteed that all of the day for a particular key will
>> appear in a single output file (e.g., part-0000), or is it possible
>> that the values could be split across multiple part-xxxx files?
>
> Each key will be processed by exactly one reduce. All of the keys to  
> each reduce will be sorted. The application can define a Partitioner  
> that picks the reduce for each key. The default one uses  
> key.hashCode() % numReduces, which is usually balanced. If your key  
> had both a date and time and you wanted to have all of the  
> transactions for a given day in the same reduce, you could do:
>
> class MyKey {
>  Date date;
>  Time time;
> }
>
> and use a partitioner like:
>
> public class MyPartitioner extends Partitioner {
>  public int getPartition(MyKey key, MyValue value,
>                          int numReduceTasks) {
>    return (key.date.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>  }
> }
>
> Of course the risk is that you may have very unbalanced reduce  
> sizes, depending on your data.
>
>> At the
>> end of the job I'm using the sequence file reader to read each custom
>> key/writable pair from each output file. Is it possible that the same
>> key could appear in multiple output files?
>
> No.
>
> -- Owen

Re: Multiple output files

Posted by Owen O'Malley <om...@apache.org>.
On Sep 6, 2008, at 9:35 AM, Ryan LeCompte wrote:

> I have a question regarding multiple output files that get produced as
> a result of using multiple reduce tasks for a job (as opposed to only
> one). If I'm using a custom writable and thus writing to a sequence
> output, am I gauranteed that all of the day for a particular key will
> appear in a single output file (e.g., part-0000), or is it possible
> that the values could be split across multiple part-xxxx files?

Each key will be processed by exactly one reduce. All of the keys to  
each reduce will be sorted. The application can define a Partitioner  
that picks the reduce for each key. The default one uses  
key.hashCode() % numReduces, which is usually balanced. If your key  
had both a date and time and you wanted to have all of the  
transactions for a given day in the same reduce, you could do:

class MyKey {
   Date date;
   Time time;
}

and use a partitioner like:

public class MyPartitioner extends Partitioner {
   public int getPartition(MyKey key, MyValue value,
                           int numReduceTasks) {
     return (key.date.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
   }
}

Of course the risk is that you may have very unbalanced reduce sizes,  
depending on your data.

> At the
> end of the job I'm using the sequence file reader to read each custom
> key/writable pair from each output file. Is it possible that the same
> key could appear in multiple output files?

No.

-- Owen