You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Benjamin Reed (JIRA)" <ji...@apache.org> on 2007/12/01 23:16:43 UTC

[jira] Created: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Pig should be able to split Gzip files like it can split Bzip files
-------------------------------------------------------------------

                 Key: PIG-42
                 URL: https://issues.apache.org/jira/browse/PIG-42
             Project: Pig
          Issue Type: Improvement
          Components: impl
            Reporter: Benjamin Reed


It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.

Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549464 ] 

Benjamin Reed commented on PIG-42:
----------------------------------

There are two problems with just using an empty file.

1) The signature is just too small to reliably detect the split. Misdetecting the split isn't as easy as retrying because it usually means you get an OutOfMemoryError are you may have already returned bad data.

2) You have to revert to relying on a extension to detect splitability. This ends up being pretty hokey because most gzip utilities are looking for a .gz extension. The splittable gzip format is completely compatible with existing gzip utilities. Also, if a user puts the wrong extension splits may not happen when they could or we may try to split files that we cannot.

Plus its really nice to be able to do a head file.gz and see right away whether the file is splittable or not.

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549486 ] 

Olga Natkovich commented on PIG-42:
-----------------------------------

Ben, how much testing did this code go through?

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-42:
---------------------------------

    Assignee: Benjamin Reed

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>            Assignee: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549494 ] 

Benjamin Reed commented on PIG-42:
----------------------------------

The patch is not ready to commit yet. It's a work in progress patch. I talked to Utkarash about this and it's missing a termination of the split. Currently each split will not terminate correctly.There is a termination hook that bzip uses that I need to latch into.

Basically here are the things I need to add to finish:

1) Terminate split processing correctly
2) Add test cases
3) Encode block size as part of the header so that we can get almost "perfect" splits. (For example a file that is compressed as 128M blocks should not be split on 64M boundaries even if the block size of the filesystem is 128M.)

I'll try to get a committable patch this weekend.





> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-42:
-----------------------------

    Attachment: gzip.patch

The attached patch implements the method of splitting GZipped files as outlined in the issue description. It uses the same hooks as BZip. We need to review to make sure it terminates properly.

If the gzipped file is not setup for splits, we fall back to not splitting the file.

An unsplittable gzipped dataset can be converted to a splittable one with the following Pig Latin:

a = load 'orig.gz';
store a into 'splittable.gz';

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-42:
------------------------------

    Patch Info:   (was: [Patch Available])

cleared patch available flag since this patch is not yet ready for review

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>            Assignee: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557089#action_12557089 ] 

Olga Natkovich commented on PIG-42:
-----------------------------------

Looks like hadoop guys might do it soon: https://issues.apache.org/jira/browse/HADOOP-1824

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549337 ] 

Owen O'Malley commented on PIG-42:
----------------------------------

It seems a lot more friendly to define the format like:

{code}
% touch empty
% gzip -nc part0 empty part1 empty part2 empty part3 > big.sgz
{code}

That would let the user do:
{code}
% gzcat big.sgz
{code}

to get their file back. I'd also use filenames rather than a header to reflect whether a file is in this format, but that is mostly just a personal preference.

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629166#action_12629166 ] 

Tom White commented on PIG-42:
------------------------------

It would be nice if the format could be generated using standard tools. By modifying the gzip flag header so that it refers to the file name (which the gzip tool can set), rather than a comment (which it cannot) we can generate compatible files using the following:

{noformat}
touch -mt 197007130719.25 Split
gzip -c Split file1 Split file2 > file.gz
{noformat}

Then the first split file has the following hexdump:
{noformat}
hexdump -n 26 -C file.gz
00000000  1f 8b 08 08 6d ca fe 00  00 03 53 70 6c 69 74 00  |....m.....Split.|
00000010  03 00 00 00 00 00 00 00  00 00                    |..........|
0000001a
{noformat}

Note that the OS flag is 03 (Unix) rather than FF (unknown), but that should be OK as the code doesn't use it when searching for the signature.


> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>            Assignee: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547882 ] 

Benjamin Reed commented on PIG-42:
----------------------------------

There are two reasons I use an empty file with a comment:

1) It allows me to test that a gzip file is infact splittable. We need to know up front that we can split the gzip file. If the gzip isn't split at regular intervals, it's going to waste a lot of time! The signature is more than a marker, it is meta-data that indicates that it can be split. You will also notice that if you do 'head' on the file you can see that it is splittable.

2) It gives you a much more reliable signature. (20 bytes instead of 4)

You can still use standard tools without using Pig:

cat signature.gz > test.gz; gzip -c test1 >> test.gz; cat signature.gz >> test.gz; gzip -c test2 >> test.gz

You use standard gunzip to decompress. You can also easily find the split boundaries outside of pig by looking for the signature.gz sequence.

This also allows you to better control the grouping. If your gzip file is bigger than 4G, it will be a concatenation, so there may be time that you want to process concatenated gzip files together without splitting. Using the empty signature file allows you to do that.

Now that I think about it more, it might also be good to reserve some bytes in the signature.gz to put a block size. That way when can do intelligent splits when the fs blocksize doesn't correspond to the gzip blocksize or the number of requested splits are very high.

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Sam Pullara (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549478 ] 

Sam Pullara commented on PIG-42:
--------------------------------

Ok, I'm convinced.  Ship it!

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

Posted by "Sam Pullara (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547528 ] 

Sam Pullara commented on PIG-42:
--------------------------------

Is there any reason you decided not to use the gzip ID instead of empty files?  It seems like it would be better if people could generate these files themselves easily without using PIG at all.  Each gzip file will start with "1F 8B 08 08" [1] if you use this mechanism to create them:

gzip -c test1 test2 > test.gz     [2]

In the few times that it is wrong you will get an exception from your gzip stream and you can try again at the next boundary.

[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] man gzip



> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.