You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Niels Basjes <Ni...@basjes.nl> on 2011/12/24 15:23:33 UTC

Gzip progress during map phase.

Hi,

I noticed that the mapper progress indication in the hadoop cdh3
distribution jumps from 0% to 100% for each gzipped input file. So when
running with big gzipped input files the job appears to be stuck.

I was unable to find a jira issue that describes this effect.
Before I dive into this I have a few questions to you guys:
1) is this a known effect for the 0.20 version? If so what is the jira
issue?
2) is this specific to gzip?
3) is this effect still present in the MRv2/yarn version of Hadoop?

Thanks.
-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )

Re: Gzip progress during map phase.

Posted by Niels Basjes <Ni...@basjes.nl>.
Yes, this is what i was looking for.
Thanks

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 27 dec. 2011 12:08 schreef "Koji Noguchi" <kn...@yahoo-inc.com> het
volgende:

> Assuming you're using TextInputFormat, it sounds like
> https://issues.apache.org/jira/browse/MAPREDUCE-773
> In 0.21.  Don't know about CDH.
>
> Koji
>
>
> On 12/27/11 2:00 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:
>
> > I would not expect this. I would expect behaviour that is independent of
> > the way the splits are created.
> >
> > --
> > Met vriendelijke groet,
> > Niels Basjes
> > (Verstuurd vanaf mobiel )
> > Op 26 dec. 2011 07:57 schreef "Anthony Urso" <an...@cs.ucla.edu> het
> > volgende:
> >
> >> Gzip files (unlike uncompressed files) are not splittable, which may be
> >> causing the behavior that you described.
> >> On Dec 24, 2011 6:24 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:
> >>
> >>> Hi,
> >>>
> >>> I noticed that the mapper progress indication in the hadoop cdh3
> >>> distribution jumps from 0% to 100% for each gzipped input file. So when
> >>> running with big gzipped input files the job appears to be stuck.
> >>>
> >>> I was unable to find a jira issue that describes this effect.
> >>> Before I dive into this I have a few questions to you guys:
> >>> 1) is this a known effect for the 0.20 version? If so what is the jira
> >>> issue?
> >>> 2) is this specific to gzip?
> >>> 3) is this effect still present in the MRv2/yarn version of Hadoop?
> >>>
> >>> Thanks.
> >>> --
> >>> Met vriendelijke groet,
> >>> Niels Basjes
> >>> (Verstuurd vanaf mobiel )
> >>>
> >>
>
>

Re: Gzip progress during map phase.

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
Assuming you're using TextInputFormat, it sounds like
https://issues.apache.org/jira/browse/MAPREDUCE-773
In 0.21.  Don't know about CDH.

Koji


On 12/27/11 2:00 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:

> I would not expect this. I would expect behaviour that is independent of
> the way the splits are created.
> 
> -- 
> Met vriendelijke groet,
> Niels Basjes
> (Verstuurd vanaf mobiel )
> Op 26 dec. 2011 07:57 schreef "Anthony Urso" <an...@cs.ucla.edu> het
> volgende:
> 
>> Gzip files (unlike uncompressed files) are not splittable, which may be
>> causing the behavior that you described.
>> On Dec 24, 2011 6:24 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:
>> 
>>> Hi,
>>> 
>>> I noticed that the mapper progress indication in the hadoop cdh3
>>> distribution jumps from 0% to 100% for each gzipped input file. So when
>>> running with big gzipped input files the job appears to be stuck.
>>> 
>>> I was unable to find a jira issue that describes this effect.
>>> Before I dive into this I have a few questions to you guys:
>>> 1) is this a known effect for the 0.20 version? If so what is the jira
>>> issue?
>>> 2) is this specific to gzip?
>>> 3) is this effect still present in the MRv2/yarn version of Hadoop?
>>> 
>>> Thanks.
>>> --
>>> Met vriendelijke groet,
>>> Niels Basjes
>>> (Verstuurd vanaf mobiel )
>>> 
>> 


Re: Gzip progress during map phase.

Posted by Niels Basjes <Ni...@basjes.nl>.
I would not expect this. I would expect behaviour that is independent of
the way the splits are created.

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 26 dec. 2011 07:57 schreef "Anthony Urso" <an...@cs.ucla.edu> het
volgende:

> Gzip files (unlike uncompressed files) are not splittable, which may be
> causing the behavior that you described.
> On Dec 24, 2011 6:24 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:
>
> > Hi,
> >
> > I noticed that the mapper progress indication in the hadoop cdh3
> > distribution jumps from 0% to 100% for each gzipped input file. So when
> > running with big gzipped input files the job appears to be stuck.
> >
> > I was unable to find a jira issue that describes this effect.
> > Before I dive into this I have a few questions to you guys:
> > 1) is this a known effect for the 0.20 version? If so what is the jira
> > issue?
> > 2) is this specific to gzip?
> > 3) is this effect still present in the MRv2/yarn version of Hadoop?
> >
> > Thanks.
> > --
> > Met vriendelijke groet,
> > Niels Basjes
> > (Verstuurd vanaf mobiel )
> >
>

Re: Gzip progress during map phase.

Posted by Anthony Urso <an...@cs.ucla.edu>.
Gzip files (unlike uncompressed files) are not splittable, which may be
causing the behavior that you described.
On Dec 24, 2011 6:24 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:

> Hi,
>
> I noticed that the mapper progress indication in the hadoop cdh3
> distribution jumps from 0% to 100% for each gzipped input file. So when
> running with big gzipped input files the job appears to be stuck.
>
> I was unable to find a jira issue that describes this effect.
> Before I dive into this I have a few questions to you guys:
> 1) is this a known effect for the 0.20 version? If so what is the jira
> issue?
> 2) is this specific to gzip?
> 3) is this effect still present in the MRv2/yarn version of Hadoop?
>
> Thanks.
> --
> Met vriendelijke groet,
> Niels Basjes
> (Verstuurd vanaf mobiel )
>