You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2010/06/29 00:49:51 UTC

[jira] Created: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Optimize serialization/deserialization between Map and Reduce and between MR jobs
---------------------------------------------------------------------------------

                 Key: PIG-1472
                 URL: https://issues.apache.org/jira/browse/PIG-1472
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.8.0
            Reporter: Thejas M Nair
            Assignee: Thejas M Nair
             Fix For: 0.8.0


In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.

There are a few optimizations that have shown to improve the performance of sedes in my tests -
1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 

Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885835#action_12885835 ] 

Thejas M Nair commented on PIG-1472:
------------------------------------

I ran the pigmix1 queries with changes to specify types for columns in load function, with this patch, and I see significant performance improvements for following queries -
The time is in secs, and the time in table is the best out of 3 runs. The queries that show significant improvement (> 10%) are ones where bag/map columns are written across Map/Reduce boundaries. (In pigmix v1 L2, L3 queries don't prune the bag and map columns before group-by or join.)

||pigmix1 query || before patch || after patch|| % diff ||
|L1 | 211 | 198 | 6%|
| L2 | 514|424 | 17% |
| L3| 670|541 | 19%|
| L4 | 133|123 | 7.5%| 
| L5| 118|113 | | 
| L6 | 139|134 | |
| L7 | 114|114 | | 
| L8 | 68|69 | | 
| L9| 1113|957 | 14%| 
| L10 | 1153| 998| 13.4%| 
| L11 | 317| 317| | 
| L12 | 124| 123| | 


> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1472:
-------------------------------

    Attachment: PIG-1472.3.patch

Patch with fix for javac,javadoc and findbugs warnings. The tests that were reported as failed pass when I ran them on my machine, the failures seem to have been caused by problems in hudson environment.


> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1472:
-------------------------------

    Status: Patch Available  (was: Open)

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1472:
-------------------------------

    Attachment: PIG-1472.patch

Summary of changes in the patch -
1. The default TupleFactory is now BinSedesTupleFactory.  It returns the tuple implementation in BinSedesTuple class. This changes the serialization format between Map and Reduce.
2. The (de)serialization in BinSedesTuple and DefaultAbstractBag uses an implementation of a new InterSedes interface, which is returned by InterSedesFactory.getInterSedesInstance() 
3. A new load function InterStorage is used for serializing data between MR jobs . This load function should not be used like a regular load/store function to store persistent data.
4. DefaultTupleFactory has been retained, so that any external udfs that were using it can still compile. DefaultTupleFactory is a subclass of BinSedesTupleFactory that does not override any of the functions.

I think the serialization format should not be tied to the Tuple. A load function can return any tuple implementation, if we happen to call write function of that tuple, it will not be possible to read it in the reduce side using the default tuple. I think the InterSedes/InterSedesFactory classes should be used instead. With this patch, the InterSedes/InterSedesFactory classes get used only when BinSedesTuple is the default tuple.


> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1472:
-------------------------------

    Attachment: PIG-1472.4.patch

Removed unused static constants from InterStorage and BinStorage , addressing comment#1 from Daniel. 


> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887545#action_12887545 ] 

Daniel Dai commented on PIG-1472:
---------------------------------

+1 for commit.

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1472:
-------------------------------

    Status: Patch Available  (was: Open)

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886281#action_12886281 ] 

Hadoop QA commented on PIG-1472:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448937/PIG-1472.2.patch
  against trunk revision 960062.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 69 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

    -1 javac.  The applied patch generated 148 javac compiler warnings (more than the trunk's current 145 warnings).

    -1 findbugs.  The patch appears to introduce 2 new Findbugs warnings.

    -1 release audit.  The applied patch generated 400 release audit warnings (more than the trunk's current 399 warnings).

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/console

This message is automatically generated.

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886647#action_12886647 ] 

Hadoop QA commented on PIG-1472:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12449033/PIG-1472.3.patch
  against trunk revision 960062.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 69 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 395 release audit warnings (more than the trunk's current 394 warnings).

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/console

This message is automatically generated.

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886820#action_12886820 ] 

Thejas M Nair commented on PIG-1472:
------------------------------------

The audit warning diff looks bogus. The contrib tests passed when i ran them on my machine, failures seem to be caused by hudson environment.

The changes in PIG-1295 will need to be ported to work with this new serialization format. For that patch, I think we should introduce a new functions in InterSedes that can compare two serialized tuples. Also add a function to BinSedesTuple that returns corresponding InterSedes class. 
Then while selecting the comparator, add a check to see if the default tuple type is BinSedesTuple, if yes, use the corresponding InterSedes function as the comparator class.  


> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1472:
-------------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

Patch committed to trunk.

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887265#action_12887265 ] 

Daniel Dai commented on PIG-1472:
---------------------------------

Patch looks good. Couple of comments:
1. The following code are never used in BinStorage and InterStorage, should be removed.
{code}
public static final int RECORD_1 = 0x01;
public static final int RECORD_2 = 0x02;
public static final int RECORD_3 = 0x03;
{code}

2. In BinInterSedes, why do we have type "GENERIC_WRITABLECOMPARABLE"? When it will be used?

3. Seems InterStorage is a replacement for BinStorage, why do we make it private? Shall we encourage user use InterStorage in the place of BinStorage, and make BinStorage deprecate?

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1472:
-------------------------------

    Status: Open  (was: Patch Available)

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887267#action_12887267 ] 

Daniel Dai commented on PIG-1472:
---------------------------------

Forget 2, GENERIC_WRITABLECOMPARABLE also in DataReaderWriter, we just follow.

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887441#action_12887441 ] 

Thejas M Nair commented on PIG-1472:
------------------------------------

bq. 1. The following code are never used in BinStorage and InterStorage, should be removed. 
I will remove that.

bq. 3. Seems InterStorage is a replacement for BinStorage, why do we make it private? Shall we encourage user use InterStorage in the place of BinStorage, and make BinStorage deprecate?
In future, we are likely to find better ways to serialize data between MR jobs of a pig query. ie the InterSedes serialization format is likely to change, and the change is not likely to be compatible with its old format. So it will not be suitable for storing persistent data. 
This replaces BinStorage only for its use within pig. Since BinStorage is used in pig queries and it should be easy to maintain the code, I think we don't have to deprecate BinStorage.



> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1472:
-------------------------------

    Attachment: PIG-1472.2.patch

Changed InterRecordWriter.write(..) to use InterSedes.write instead of Tuple.write. 

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.