You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2008/03/19 00:35:35 UTC

Partitioning reduce output by date

Hi,

What is the best/right way to handle partitioning of the final job output (i.e. output of reduce tasks)?  In my case, I am processing logs whose entries include dates (e.g. "2008-03-01    foo    bar    baz").  A single log file may contain a number of different dates, and I'd like to group reduce output by date so that, in the end, I have not a single part-xxxxx file but, say, 2008-03-01.txt, 2008-03-02.txt, and so on, one file for each distinct date.

If it helps, the keys in my job include the dates from the input logs, so I could parse the dates out of the keys in the reduce phase, if that's the thing to do.

I'm looking at OutputFormat and RecordWriter, but I'm not sure if that's the direction I should pursue.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Partitioning reduce output by date

Posted by Ted Dunning <td...@veoh.com>.

I think that a custom partitioner is half of the answer.  The other half is
that the reducer can open and close output files as needed.  With the
partitioner, only one file need be kept open at a time.  It is good practice
to open the files relative to the task directory so that process failure is
handled correctly.

These files are called task side effect files and are documented here:

http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Task+Side-Ef
fect+Files

On 3/18/08 5:17 PM, "Arun C Murthy" <ar...@yahoo-inc.com> wrote:

>> I have not a single part-xxxxx file but, say, 2008-03-01.txt,
>> 2008-03-02.txt, and so on, one file for each distinct date.
>> 
> 
> You want a custom partitioner...
> http://hadoop.apache.org/core/docs/current/
> mapred_tutorial.html#Partitioner

Re: Partitioning reduce output by date

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

On Mar 18, 2008, at 4:35 PM, Otis Gospodnetic wrote:

> Hi,
>
> What is the best/right way to handle partitioning of the final job  
> output (i.e. output of reduce tasks)?  In my case, I am processing  
> logs whose entries include dates (e.g. "2008-03-01    foo    bar     
> baz").  A single log file may contain a number of different dates,  
> and I'd like to group reduce output by date so that, in the end, I  
> have not a single part-xxxxx file but, say, 2008-03-01.txt,  
> 2008-03-02.txt, and so on, one file for each distinct date.
>

You want a custom partitioner...
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#Partitioner

Arun

> If it helps, the keys in my job include the dates from the input  
> logs, so I could parse the dates out of the keys in the reduce  
> phase, if that's the thing to do.
>
> I'm looking at OutputFormat and RecordWriter, but I'm not sure if  
> that's the direction I should pursue.
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>