You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tomas Hudik <xh...@gmail.com> on 2015/05/05 12:51:35 UTC

concatenated gzip/bzip in Pig 0.11 and higher

Hi,
I read a section:
https://pig.apache.org/docs/r0.11.1/func.html#handling-compression

according to which any concatenated bzip/gzip files will produce strange
results.
I did a test - concatenated some files and processed them. However, all the
results were identical to ones that were produces on non-concatenated
files. Why? They should be different...

Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835

My questions:
1. is https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
still correct and concatenation will produce wrong results? Is this true
for any concatenated files or it might happanes once a time
2. is there any way how to find out whether tar.gz or tar.bz2 is
concatenated?

Re: concatenated gzip/bzip in Pig 0.11 and higher

Posted by Tomas Hudik <xh...@gmail.com>.
splittable bzip is supported by Hadoop (
https://issues.apache.org/jira/browse/HADOOP-4012 , since version 0.21).
I have opened Jira ticket for handling concatenated bzip/gzip files
already: https://issues.apache.org/jira/browse/PIG-4533.
It seems:
1. if bzip files would be let to be processed by Hadoop - we are fine
2. (if 1 is true) documentation needs to be improved (delete, or make
obsolte the part about "handling-compression")

On Mon, May 18, 2015 at 9:51 PM, Daniel Dai <da...@hortonworks.com> wrote:

> I am not very sure but seems Hadoop does not support splittable bzip
> initially so Pig implement its own.
>
>
> On 5/18/15, 12:43 AM, "Tomas Hudik" <xh...@gmail.com> wrote:
>
> >thank you Daniel.
> >
> >follow  up question: is there any reasosn why bzip is processed by pig but
> >gzip is processed in Hadoop?
> >
> >thanks, Tomas
> >
> >On Mon, May 18, 2015 at 8:35 AM, Daniel Dai <da...@hortonworks.com>
> wrote:
> >
> >> The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop
> >> fixed concatenated gzip, Pig should be fixed as well. Bzip however, is
> >> processed by Pig code, that does not support concatenation.
> >>
> >> It seems we need to update the documentation.
> >>
> >> Daniel
> >>
> >> On 5/5/15, 3:51 AM, "Tomas Hudik" <xh...@gmail.com> wrote:
> >>
> >> >Hi,
> >> >I read a section:
> >> >https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> >> >
> >> >according to which any concatenated bzip/gzip files will produce
> >>strange
> >> >results.
> >> >I did a test - concatenated some files and processed them. However, all
> >> >the
> >> >results were identical to ones that were produces on non-concatenated
> >> >files. Why? They should be different...
> >> >
> >> >Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835
> >> >
> >> >My questions:
> >> >1. is
> >>https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> >> >still correct and concatenation will produce wrong results? Is this
> >>true
> >> >for any concatenated files or it might happanes once a time
> >> >2. is there any way how to find out whether tar.gz or tar.bz2 is
> >> >concatenated?
> >>
> >>
>
>

Re: concatenated gzip/bzip in Pig 0.11 and higher

Posted by Daniel Dai <da...@hortonworks.com>.
I am not very sure but seems Hadoop does not support splittable bzip
initially so Pig implement its own.


On 5/18/15, 12:43 AM, "Tomas Hudik" <xh...@gmail.com> wrote:

>thank you Daniel.
>
>follow  up question: is there any reasosn why bzip is processed by pig but
>gzip is processed in Hadoop?
>
>thanks, Tomas
>
>On Mon, May 18, 2015 at 8:35 AM, Daniel Dai <da...@hortonworks.com> wrote:
>
>> The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop
>> fixed concatenated gzip, Pig should be fixed as well. Bzip however, is
>> processed by Pig code, that does not support concatenation.
>>
>> It seems we need to update the documentation.
>>
>> Daniel
>>
>> On 5/5/15, 3:51 AM, "Tomas Hudik" <xh...@gmail.com> wrote:
>>
>> >Hi,
>> >I read a section:
>> >https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
>> >
>> >according to which any concatenated bzip/gzip files will produce
>>strange
>> >results.
>> >I did a test - concatenated some files and processed them. However, all
>> >the
>> >results were identical to ones that were produces on non-concatenated
>> >files. Why? They should be different...
>> >
>> >Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835
>> >
>> >My questions:
>> >1. is 
>>https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
>> >still correct and concatenation will produce wrong results? Is this
>>true
>> >for any concatenated files or it might happanes once a time
>> >2. is there any way how to find out whether tar.gz or tar.bz2 is
>> >concatenated?
>>
>>


Re: concatenated gzip/bzip in Pig 0.11 and higher

Posted by Tomas Hudik <xh...@gmail.com>.
thank you Daniel.

follow  up question: is there any reasosn why bzip is processed by pig but
gzip is processed in Hadoop?

thanks, Tomas

On Mon, May 18, 2015 at 8:35 AM, Daniel Dai <da...@hortonworks.com> wrote:

> The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop
> fixed concatenated gzip, Pig should be fixed as well. Bzip however, is
> processed by Pig code, that does not support concatenation.
>
> It seems we need to update the documentation.
>
> Daniel
>
> On 5/5/15, 3:51 AM, "Tomas Hudik" <xh...@gmail.com> wrote:
>
> >Hi,
> >I read a section:
> >https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> >
> >according to which any concatenated bzip/gzip files will produce strange
> >results.
> >I did a test - concatenated some files and processed them. However, all
> >the
> >results were identical to ones that were produces on non-concatenated
> >files. Why? They should be different...
> >
> >Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835
> >
> >My questions:
> >1. is https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> >still correct and concatenation will produce wrong results? Is this true
> >for any concatenated files or it might happanes once a time
> >2. is there any way how to find out whether tar.gz or tar.bz2 is
> >concatenated?
>
>

Re: concatenated gzip/bzip in Pig 0.11 and higher

Posted by Daniel Dai <da...@hortonworks.com>.
The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop
fixed concatenated gzip, Pig should be fixed as well. Bzip however, is
processed by Pig code, that does not support concatenation.

It seems we need to update the documentation.

Daniel

On 5/5/15, 3:51 AM, "Tomas Hudik" <xh...@gmail.com> wrote:

>Hi,
>I read a section:
>https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
>
>according to which any concatenated bzip/gzip files will produce strange
>results.
>I did a test - concatenated some files and processed them. However, all
>the
>results were identical to ones that were produces on non-concatenated
>files. Why? They should be different...
>
>Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835
>
>My questions:
>1. is https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
>still correct and concatenation will produce wrong results? Is this true
>for any concatenated files or it might happanes once a time
>2. is there any way how to find out whether tar.gz or tar.bz2 is
>concatenated?