You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ryan LeCompte <le...@gmail.com> on 2008/09/06 18:35:02 UTC
Multiple output files
Hello,
I have a question regarding multiple output files that get produced as
a result of using multiple reduce tasks for a job (as opposed to only
one). If I'm using a custom writable and thus writing to a sequence
output, am I gauranteed that all of the day for a particular key will
appear in a single output file (e.g., part-0000), or is it possible
that the values could be split across multiple part-xxxx files? At the
end of the job I'm using the sequence file reader to read each custom
key/writable pair from each output file. Is it possible that the same
key could appear in multiple output files? If so, does Hadoop
automatically grab all of the values for a particular key in all of
the output files?
Thanks,
Ryan
Re: Multiple output files
Posted by Ryan LeCompte <le...@gmail.com>.
This clears up my concerns. Thanks!
Ryan
On Sep 6, 2008, at 2:17 PM, Owen O'Malley <om...@apache.org> wrote:
>
> On Sep 6, 2008, at 9:35 AM, Ryan LeCompte wrote:
>
>> I have a question regarding multiple output files that get produced
>> as
>> a result of using multiple reduce tasks for a job (as opposed to only
>> one). If I'm using a custom writable and thus writing to a sequence
>> output, am I gauranteed that all of the day for a particular key will
>> appear in a single output file (e.g., part-0000), or is it possible
>> that the values could be split across multiple part-xxxx files?
>
> Each key will be processed by exactly one reduce. All of the keys to
> each reduce will be sorted. The application can define a Partitioner
> that picks the reduce for each key. The default one uses
> key.hashCode() % numReduces, which is usually balanced. If your key
> had both a date and time and you wanted to have all of the
> transactions for a given day in the same reduce, you could do:
>
> class MyKey {
> Date date;
> Time time;
> }
>
> and use a partitioner like:
>
> public class MyPartitioner extends Partitioner {
> public int getPartition(MyKey key, MyValue value,
> int numReduceTasks) {
> return (key.date.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> }
> }
>
> Of course the risk is that you may have very unbalanced reduce
> sizes, depending on your data.
>
>> At the
>> end of the job I'm using the sequence file reader to read each custom
>> key/writable pair from each output file. Is it possible that the same
>> key could appear in multiple output files?
>
> No.
>
> -- Owen
Re: Multiple output files
Posted by Owen O'Malley <om...@apache.org>.
On Sep 6, 2008, at 9:35 AM, Ryan LeCompte wrote:
> I have a question regarding multiple output files that get produced as
> a result of using multiple reduce tasks for a job (as opposed to only
> one). If I'm using a custom writable and thus writing to a sequence
> output, am I gauranteed that all of the day for a particular key will
> appear in a single output file (e.g., part-0000), or is it possible
> that the values could be split across multiple part-xxxx files?
Each key will be processed by exactly one reduce. All of the keys to
each reduce will be sorted. The application can define a Partitioner
that picks the reduce for each key. The default one uses
key.hashCode() % numReduces, which is usually balanced. If your key
had both a date and time and you wanted to have all of the
transactions for a given day in the same reduce, you could do:
class MyKey {
Date date;
Time time;
}
and use a partitioner like:
public class MyPartitioner extends Partitioner {
public int getPartition(MyKey key, MyValue value,
int numReduceTasks) {
return (key.date.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
Of course the risk is that you may have very unbalanced reduce sizes,
depending on your data.
> At the
> end of the job I'm using the sequence file reader to read each custom
> key/writable pair from each output file. Is it possible that the same
> key could appear in multiple output files?
No.
-- Owen