You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2009/02/26 20:57:01 UTC

[jira] Created: (PIG-686) PERFORMANCE: improve how data is stored between M-R jobs and between Map and Reduce

PERFORMANCE: improve how data is stored between M-R jobs and between Map and Reduce
-----------------------------------------------------------------------------------

                 Key: PIG-686
                 URL: https://issues.apache.org/jira/browse/PIG-686
             Project: Pig
          Issue Type: Improvement
    Affects Versions: types_branch
            Reporter: Olga Natkovich
             Fix For: types_branch


Currently, there is quite a bit of overhead in how the data is serialized in both cases because a type information is stored with each field.

However, most of the time the data has known and consistent schema in which case, it is sufficient to store the schema once. 

This change could really decrease the ammount of intermediate data generated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-686) PERFORMANCE: improve how data is stored between M-R jobs and between Map and Reduce

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich resolved PIG-686.
--------------------------------

    Resolution: Won't Fix

We have experimented with this work and the performance gains (at most 5-7%) are not sufficient for the complexity it would add to the code. Hopefully, once we integrate with AVRO, we get the improvement.

> PERFORMANCE: improve how data is stored between M-R jobs and between Map and Reduce
> -----------------------------------------------------------------------------------
>
>                 Key: PIG-686
>                 URL: https://issues.apache.org/jira/browse/PIG-686
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>
> Currently, there is quite a bit of overhead in how the data is serialized in both cases because a type information is stored with each field.
> However, most of the time the data has known and consistent schema in which case, it is sufficient to store the schema once. 
> This change could really decrease the ammount of intermediate data generated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.