You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Tomas Hudik (JIRA)" <ji...@apache.org> on 2015/06/05 10:05:00 UTC

[jira] [Updated] (PIG-4533) support of concatenated bz2/gz files

     [ https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tomas Hudik updated PIG-4533:
-----------------------------
    Description: 
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED FILES generated in this manner: ..."_

This is not true for tar.gz, since
# I did a test - concatenated&compress some files and processed them. The same was done with the raw files (no compression). The results were identical
# Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop (1 and 2) are supporting this already. 

Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common). 
Therefore, 
# tar.bz2 should be handled by hadoop-common as well (there is no need to be handled by Pig anymore). (I believe https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be removed)
# correct documentation accordingly (concatenated tar.gz, tar.bz2 are processing correctly)





  was:
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED FILES generated in this manner: ..."_

I doubt this is still true, since
1. I did a test - concatenated some files and processed them. However, all the
results were identical to ones that were produces on non-concatenated
files. Why? They should be different...
2. Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
https://issues.apache.org/jira/i#browse/HADOOP-6835 says this was fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop (1 and 2) are supporting this. I suppose Pig do not make compression on its own but rather depends on hadoop-core (hadoo-common respectively) libraries.

If I'm right, the documentation should be fixed (delete the part about concatinated compression files problems)







> support of concatenated bz2/gz files
> ------------------------------------
>
>                 Key: PIG-4533
>                 URL: https://issues.apache.org/jira/browse/PIG-4533
>             Project: Pig
>          Issue Type: Bug
>          Components: documentation, parser
>            Reporter: Tomas Hudik
>             Fix For: 0.16.0
>
>
> Documentation (since 0.11.1 at least) says :
> http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> _"Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED FILES generated in this manner: ..."_
> This is not true for tar.gz, since
> # I did a test - concatenated&compress some files and processed them. The same was done with the raw files (no compression). The results were identical
> # Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
> https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop (1 and 2) are supporting this already. 
> Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common). 
> Therefore, 
> # tar.bz2 should be handled by hadoop-common as well (there is no need to be handled by Pig anymore). (I believe https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be removed)
> # correct documentation accordingly (concatenated tar.gz, tar.bz2 are processing correctly)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)