You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2010/06/29 02:02:51 UTC

[jira] Created: (PIG-1474) Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple

Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple
--------------------------------------------------------------------------------

                 Key: PIG-1474
                 URL: https://issues.apache.org/jira/browse/PIG-1474
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.8.0
            Reporter: Thejas M Nair
            Assignee: Thejas M Nair
             Fix For: 0.8.0


Avoid sedes when possible for data loaded using PigStorage by implementing approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes .

The write() and readFields() functions of tuple returned by TupleFactory  is used to serialize data between Map and Reduce. By using a tuple that knows the serialization format of the loader, we avoid sedes at Map Recue boundary and use the load functions serialized format between Map and Reduce . 
To use a new custom tuple for this purpose, a custom TupleFactory that returns tuples of this type has to be specified using the property "pig.data.tuple.factory.name" .
This approach will work only for a set of load functions in the query that share same serialization format for map and bags. If this approach proves to be very useful, it will build a case for more extensible approach.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1474) Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1474:
-------------------------------

    Fix Version/s: 0.9.0
                       (was: 0.8.0)

Unlinking from 0.8 release.
I was planning to use the lazy implementations of Map and Bag for this that were proposed in PIG-1473. Those objects would have had a copy of the seralized versions of map and bag. But the plan in the jira had to be abandoned for reasons mentioned there. A different approach is required to solve the issue.


> Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1474
>                 URL: https://issues.apache.org/jira/browse/PIG-1474
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.9.0
>
>
> Avoid sedes when possible for data loaded using PigStorage by implementing approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes .
> The write() and readFields() functions of tuple returned by TupleFactory  is used to serialize data between Map and Reduce. By using a tuple that knows the serialization format of the loader, we avoid sedes at Map Recue boundary and use the load functions serialized format between Map and Reduce . 
> To use a new custom tuple for this purpose, a custom TupleFactory that returns tuples of this type has to be specified using the property "pig.data.tuple.factory.name" .
> This approach will work only for a set of load functions in the query that share same serialization format for map and bags. If this approach proves to be very useful, it will build a case for more extensible approach.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.