You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2010/06/29 01:49:51 UTC
[jira] Created: (PIG-1473) Avoid serialization/deserialization
costs for PigStorage data - Use custom Map and Bag implementation
Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
-----------------------------------------------------------------------------------------------------
Key: PIG-1473
URL: https://issues.apache.org/jira/browse/PIG-1473
Project: Pig
Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Fix For: 0.8.0
Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance.
Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes .
The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called.
Example of query where this will help -
{CODE}
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;
fil = FILTER f BY $0 > 5;
dump fil; -- Serialization of column b can be delayed until here using this approach .
{CODE}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1473) Avoid serialization/deserialization
costs for PigStorage data - Use custom Map and Bag implementation
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888953#action_12888953 ]
Thejas M Nair commented on PIG-1473:
------------------------------------
I assume you mean de-serializing the map/bag when tuple.get(i) is called (approach#2 mentioned in http://wiki.apache.org/pig/AvoidingSedes ). That is already happening indirectly for PigStorage .
In case of PigStorage, the deserialization happens outside the load function in a type-casting foreach statement that gets added in pig query plan. PigStorage always returns only bytearray.
As a result of column pruning optimization, the columns that are going not going to be touched aren't casted to corresponding types.
> Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
> -----------------------------------------------------------------------------------------------------
>
> Key: PIG-1473
> URL: https://issues.apache.org/jira/browse/PIG-1473
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance.
> Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes .
> The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called.
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this approach .
> {CODE}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1473) Avoid serialization/deserialization
costs for PigStorage data - Use custom Map and Bag implementation
Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888942#action_12888942 ]
Dmitriy V. Ryaboy commented on PIG-1473:
----------------------------------------
Thejas, do you think there could be any performance gains if we could delay deserialization of the top-level fields in the tuple, but deserialize whole maps or databags if they are touched?
> Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
> -----------------------------------------------------------------------------------------------------
>
> Key: PIG-1473
> URL: https://issues.apache.org/jira/browse/PIG-1473
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance.
> Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes .
> The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called.
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this approach .
> {CODE}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1473) Avoid serialization/deserialization
costs for PigStorage data - Use custom Map and Bag implementation
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair reassigned PIG-1473:
----------------------------------
Assignee: Thejas M Nair
> Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
> -----------------------------------------------------------------------------------------------------
>
> Key: PIG-1473
> URL: https://issues.apache.org/jira/browse/PIG-1473
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance.
> Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes .
> The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called.
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this approach .
> {CODE}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1473) Avoid serialization/deserialization
costs for PigStorage data - Use custom Map and Bag implementation
Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883382#action_12883382 ]
Jeff Zhang commented on PIG-1473:
---------------------------------
This sounds like the lazy deserialization in Hive, Great !
> Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
> -----------------------------------------------------------------------------------------------------
>
> Key: PIG-1473
> URL: https://issues.apache.org/jira/browse/PIG-1473
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance.
> Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes .
> The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called.
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this approach .
> {CODE}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1473) Avoid serialization/deserialization
costs for PigStorage data - Use custom Map and Bag implementation
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair resolved PIG-1473.
--------------------------------
Resolution: Won't Fix
Implementing lazy de-serialization using this approach will introduce a non backward compatible change in PigStorage . So closing this jira as wontfix.
In PigStorage, if the de-serialization fails, the value is treated as null, ie tuple.get(i) returns null .
But if the de-serialization is delayed by returning a subclass of map or bag that holds the serialized data, the tuple.get(i) call will return a non null value even if the serialized format has a problem.
Though this approach is not being implemented in PigStorage() for this reason, other load store functions can potentially adopt this method.
> Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
> -----------------------------------------------------------------------------------------------------
>
> Key: PIG-1473
> URL: https://issues.apache.org/jira/browse/PIG-1473
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance.
> Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes .
> The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called.
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this approach .
> {CODE}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.