You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2010/07/16 02:24:50 UTC

[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

    [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888972#action_12888972 ] 

Alan Gates commented on PIG-1501:
---------------------------------

Enabling compression directly on BinStorage as is will be bad.  bzip is splittable but very slow, and gzip isn't splittable.

To do this we need to look at using SequenceFiles for moving data between MR jobs.  We can have a null key and value type of Tuple and use SequenceFileInput/OutputFormat.  This will enable us to use the block level compression in sequence files.  For now we can continue with the same serialization used in BinStorage, though in the future we may want to change this as well.

> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>
> We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.