You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Some Body <so...@squareplanet.de> on 2010/05/10 14:08:29 UTC

MultipleOutputs or Partitioner

Hi,

I'm trying to understand how to generate multiple outputs in my reducer (using 0.20.2+228).
Do I need MultipleOutput or should I partition my output in the mapper?

My reducer currently gets key/val input pairs like this which all end up in my part_r_0000 file.

    hostA_VarX_2010-05-01_morning    <FLOATVAL>
    hostA_VarY_2010-05-01_morning    <FLOATVAL>
    hostA_VarX_2010-05-01_afternoon    <FLOATVAL>
    hostA_VarY_2010-05-01_afternoon    <FLOATVAL>
    .....
    hostB_VarX_2010-05-01_morning    <FLOATVAL>
    hostB_VarY_2010-05-01_morning    <FLOATVAL>
    hostB_VarX_2010-05-01_afternoon    <FLOATVAL>
    hostB_VarY_2010-05-01_afternoon    <FLOATVAL>
    .....
    hostA_VarX_2010-05-02_morning    <FLOATVAL>
    hostA_VarY_2010-05-02_morning    <FLOATVAL>
    hostA_VarX_2010-05-02_afternoon    <FLOATVAL>
    hostA_VarY_2010-05-02_afternoon    <FLOATVAL>
    .....
    hostB_VarX_2010-05-02_morning    <FLOATVAL>
    hostB_VarY_2010-05-02_morning    <FLOATVAL>
    hostB_VarX_2010-05-02_afternoon    <FLOATVAL>
    hostB_VarY_2010-05-02_afternoon    <FLOATVAL>
    .....

But instead of 1 output file I want one output file per day/group. e.g.
    2010-05-01_morning.txt
    2010-05-01_afternoon.txt

Each <date>_<time>.txt file would contain all keys/vals for all hosts & VarNames 

Thanks,
Alan

Re: MultipleOutputs or Partitioner

Posted by Alex Kozlov <al...@cloudera.com>.

Hi Alan,

On Mon, May 10, 2010 at 5:08 AM, Some Body <so...@squareplanet.de> wrote:

> Hi,
>
> I'm trying to understand how to generate multiple outputs in my reducer
> (using 0.20.2+228).
> Do I need MultipleOutput or should I partition my output in the mapper?
>
>
The question is scalability.  If you are OK with running only 2 (or N)
reducers, "morning" and "afternoon", and they are approximately of the same
size, you should implement a custom partitioner.  However, this approach is
not scalable since you will always be stuck with a predefined number of
reducers.

A better approach is to leave the # of reducers flexible and use 'hadoop fs
-getmerge' or custom Java code afterwards to merge multiple files.

Alex K


> My reducer currently gets key/val input pairs like this which all end up in
> my part_r_0000 file.
>
>    hostA_VarX_2010-05-01_morning    <FLOATVAL>
>    hostA_VarY_2010-05-01_morning    <FLOATVAL>
>    hostA_VarX_2010-05-01_afternoon    <FLOATVAL>
>    hostA_VarY_2010-05-01_afternoon    <FLOATVAL>
>    .....
>    hostB_VarX_2010-05-01_morning    <FLOATVAL>
>    hostB_VarY_2010-05-01_morning    <FLOATVAL>
>    hostB_VarX_2010-05-01_afternoon    <FLOATVAL>
>    hostB_VarY_2010-05-01_afternoon    <FLOATVAL>
>    .....
>    hostA_VarX_2010-05-02_morning    <FLOATVAL>
>    hostA_VarY_2010-05-02_morning    <FLOATVAL>
>    hostA_VarX_2010-05-02_afternoon    <FLOATVAL>
>    hostA_VarY_2010-05-02_afternoon    <FLOATVAL>
>    .....
>    hostB_VarX_2010-05-02_morning    <FLOATVAL>
>    hostB_VarY_2010-05-02_morning    <FLOATVAL>
>    hostB_VarX_2010-05-02_afternoon    <FLOATVAL>
>    hostB_VarY_2010-05-02_afternoon    <FLOATVAL>
>    .....
>
> But instead of 1 output file I want one output file per day/group. e.g.
>    2010-05-01_morning.txt
>    2010-05-01_afternoon.txt
>
> Each <date>_<time>.txt file would contain all keys/vals for all hosts &
> VarNames
>
> Thanks,
> Alan

Re: MultipleOutputs or Partitioner

Posted by Sonal Goyal <so...@gmail.com>.

Hi Alan,

You can use MultipleOutputFormat. You can override the
generateFileName...methods to get the functionality you want.

A partitioner controls how data moves from the mapper to the reducer, so if
you take that approach, you will have to specify the number of reducers as
the number of files you want, which is not the best option if some days have
more data than the others. You also dont have control over the file name.
See Tom White's Hadoop The Definitive Guide for an excellent example and
usage.

Thanks and Regards,
Sonal
www.meghsoft.com


On Mon, May 10, 2010 at 5:38 PM, Some Body <so...@squareplanet.de> wrote:

> Hi,
>
> I'm trying to understand how to generate multiple outputs in my reducer
> (using 0.20.2+228).
> Do I need MultipleOutput or should I partition my output in the mapper?
>
> My reducer currently gets key/val input pairs like this which all end up in
> my part_r_0000 file.
>
>    hostA_VarX_2010-05-01_morning    <FLOATVAL>
>    hostA_VarY_2010-05-01_morning    <FLOATVAL>
>    hostA_VarX_2010-05-01_afternoon    <FLOATVAL>
>    hostA_VarY_2010-05-01_afternoon    <FLOATVAL>
>    .....
>    hostB_VarX_2010-05-01_morning    <FLOATVAL>
>    hostB_VarY_2010-05-01_morning    <FLOATVAL>
>    hostB_VarX_2010-05-01_afternoon    <FLOATVAL>
>    hostB_VarY_2010-05-01_afternoon    <FLOATVAL>
>    .....
>    hostA_VarX_2010-05-02_morning    <FLOATVAL>
>    hostA_VarY_2010-05-02_morning    <FLOATVAL>
>    hostA_VarX_2010-05-02_afternoon    <FLOATVAL>
>    hostA_VarY_2010-05-02_afternoon    <FLOATVAL>
>    .....
>    hostB_VarX_2010-05-02_morning    <FLOATVAL>
>    hostB_VarY_2010-05-02_morning    <FLOATVAL>
>    hostB_VarX_2010-05-02_afternoon    <FLOATVAL>
>    hostB_VarY_2010-05-02_afternoon    <FLOATVAL>
>    .....
>
> But instead of 1 output file I want one output file per day/group. e.g.
>    2010-05-01_morning.txt
>    2010-05-01_afternoon.txt
>
> Each <date>_<time>.txt file would contain all keys/vals for all hosts &
> VarNames
>
> Thanks,
> Alan