You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Sam Pullara (JIRA)" <ji...@apache.org> on 2007/12/02 00:01:43 UTC

[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547528 ] 

Sam Pullara commented on PIG-42:
--------------------------------

Is there any reason you decided not to use the gzip ID instead of empty files?  It seems like it would be better if people could generate these files themselves easily without using PIG at all.  Each gzip file will start with "1F 8B 08 08" [1] if you use this mechanism to create them:

gzip -c test1 test2 > test.gz     [2]

In the few times that it is wrong you will get an exception from your gzip stream and you can try again at the next boundary.

[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] man gzip



> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.