You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Panagiotis Antonopoulos <an...@hotmail.com> on 2011/09/02 15:14:47 UTC

Problem when using MultipleOutputs with many files

Hello guys,

I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the HFiles (which are the output of my MR job) so that each file can fit into one region of the table where I am going to bulk load them.

Therefore I have one MultipleOutput per region and as a result I had 280 different outputs.
I just realized that using so many outputs makes my job a lot slower than it is when I have just one output.

Do you know what goes wrong? Has anyone noticed the same?

Thank you!
Panagiotis

RE: Problem when using MultipleOutputs with many files

Posted by Panagiotis Antonopoulos <an...@hotmail.com>.

Thanks both of you!

Harsh must be right.
The source file of the Hbase version that I use seems to have the changes mentioned at https://issues.apache.org/jira/browse/MAPREDUCE-1853

> From: harsh@cloudera.com
> Date: Fri, 2 Sep 2011 23:03:09 +0530
> Subject: Re: Problem when using MultipleOutputs with many files
> To: mapreduce-user@hadoop.apache.org
> 
> Hello David,
> 
> MAPREDUCE-1853 was back-ported already into CDH3u0 [1]. That shouldn't
> be the cause of Panagiotis's performance breaker, hence.
> 
> P.s. Please do not upgrade to 0.21.x series in production, as it is
> not deemed stable yet. This is noted on the Apache Hadoop website as
> well.
> 
> [1] - http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.21.releasenotes.html
> 
> On Fri, Sep 2, 2011 at 8:39 PM, David Rosenstrauch <da...@darose.net> wrote:
> > On 09/02/2011 09:14 AM, Panagiotis Antonopoulos wrote:
> >>
> >> Hello guys,
> >>
> >> I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the
> >> HFiles (which are the output of my MR job) so that each file can fit into
> >> one region of the table where I am going to bulk load them.
> >>
> >> Therefore I have one MultipleOutput per region and as a result I had 280
> >> different outputs.
> >> I just realized that using so many outputs makes my job a lot slower than
> >> it is when I have just one output.
> >>
> >> Do you know what goes wrong? Has anyone noticed the same?
> >>
> >> Thank you!
> >> Panagiotis
> >
> >
> > You're probably running into this bug, which crushes the performance of
> > MultipleOutputs:
> >
> > https://issues.apache.org/jira/browse/MAPREDUCE-1853
> >
> > Apparently it's fixed in v0.21, so try to upgrade if you can.
> >
> > I wasn't able to in our code however (we were also using Cloudera CDH, which
> > as you see is 0.20).  What I eventually wound up doing to work around it was
> > to use our own local copy of the MultipleOutputs class (I called it
> > BugFixMultipleOutputs_0_20) which I manually patched with the fix.
> >
> > HTH,
> >
> > DR
> >
> 
> 
> 
> -- 
> Harsh J

Re: Problem when using MultipleOutputs with many files

Posted by Harsh J <ha...@cloudera.com>.

Hello David,

MAPREDUCE-1853 was back-ported already into CDH3u0 [1]. That shouldn't
be the cause of Panagiotis's performance breaker, hence.

P.s. Please do not upgrade to 0.21.x series in production, as it is
not deemed stable yet. This is noted on the Apache Hadoop website as
well.

[1] - http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.21.releasenotes.html

On Fri, Sep 2, 2011 at 8:39 PM, David Rosenstrauch <da...@darose.net> wrote:
> On 09/02/2011 09:14 AM, Panagiotis Antonopoulos wrote:
>>
>> Hello guys,
>>
>> I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the
>> HFiles (which are the output of my MR job) so that each file can fit into
>> one region of the table where I am going to bulk load them.
>>
>> Therefore I have one MultipleOutput per region and as a result I had 280
>> different outputs.
>> I just realized that using so many outputs makes my job a lot slower than
>> it is when I have just one output.
>>
>> Do you know what goes wrong? Has anyone noticed the same?
>>
>> Thank you!
>> Panagiotis
>
>
> You're probably running into this bug, which crushes the performance of
> MultipleOutputs:
>
> https://issues.apache.org/jira/browse/MAPREDUCE-1853
>
> Apparently it's fixed in v0.21, so try to upgrade if you can.
>
> I wasn't able to in our code however (we were also using Cloudera CDH, which
> as you see is 0.20).  What I eventually wound up doing to work around it was
> to use our own local copy of the MultipleOutputs class (I called it
> BugFixMultipleOutputs_0_20) which I manually patched with the fix.
>
> HTH,
>
> DR
>



-- 
Harsh J

Re: Problem when using MultipleOutputs with many files

Posted by David Rosenstrauch <da...@darose.net>.

On 09/02/2011 09:14 AM, Panagiotis Antonopoulos wrote:
>
> Hello guys,
>
> I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the HFiles (which are the output of my MR job) so that each file can fit into one region of the table where I am going to bulk load them.
>
> Therefore I have one MultipleOutput per region and as a result I had 280 different outputs.
> I just realized that using so many outputs makes my job a lot slower than it is when I have just one output.
>
> Do you know what goes wrong? Has anyone noticed the same?
>
> Thank you!
> Panagiotis

You're probably running into this bug, which crushes the performance of 
MultipleOutputs:

https://issues.apache.org/jira/browse/MAPREDUCE-1853

Apparently it's fixed in v0.21, so try to upgrade if you can.

I wasn't able to in our code however (we were also using Cloudera CDH, 
which as you see is 0.20).  What I eventually wound up doing to work 
around it was to use our own local copy of the MultipleOutputs class (I 
called it BugFixMultipleOutputs_0_20) which I manually patched with the fix.

HTH,

DR