You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/03/11 00:20:50 UTC
[jira] Created: (HIVE-336) TBinaryProtocol blow up the data size
between map-reduce boundary
TBinaryProtocol blow up the data size between map-reduce boundary
-----------------------------------------------------------------
Key: HIVE-336
URL: https://issues.apache.org/jira/browse/HIVE-336
Project: Hadoop Hive
Issue Type: Bug
Components: Serializers/Deserializers
Affects Versions: 0.2.0, 0.3.0
Reporter: Zheng Shao
Assignee: Zheng Shao
Priority: Blocker
TBinaryProtocol is very space-inefficient. We've seen the data blown up several times between the map-reduce boundary because of it.
We should change it to simple delimited format (backed by LazySimpleSerDe).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-336) TBinaryProtocol blow up the data size
between map-reduce boundary
Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688940#action_12688940 ]
Zheng Shao commented on HIVE-336:
---------------------------------
We had an example query which sends 4 string columns from mapper to reducer.
The map output data was 220MB using DynamicSerDe with TCTLSeparatedProtocol. It is about 140MB using LazySimpleSerDe right now. The net saving of space is around 1/3.
> TBinaryProtocol blow up the data size between map-reduce boundary
> -----------------------------------------------------------------
>
> Key: HIVE-336
> URL: https://issues.apache.org/jira/browse/HIVE-336
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Affects Versions: 0.2.0, 0.3.0
> Reporter: Zheng Shao
> Assignee: Zheng Shao
> Priority: Blocker
> Fix For: 0.3.0
>
>
> TBinaryProtocol is very space-inefficient. We've seen the data blown up several times between the map-reduce boundary because of it.
> We should change it to simple delimited format (backed by LazySimpleSerDe).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-336) TBinaryProtocol blow up the data size
between map-reduce boundary
Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zheng Shao resolved HIVE-336.
-----------------------------
Resolution: Fixed
Fix Version/s: 0.3.0
Hadoop Flags: [Reviewed]
As part of HIVE-337, we changed the SerDe to LazySimpleSerDe, for the value part in the key-value pair between map and reduce.
> TBinaryProtocol blow up the data size between map-reduce boundary
> -----------------------------------------------------------------
>
> Key: HIVE-336
> URL: https://issues.apache.org/jira/browse/HIVE-336
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Affects Versions: 0.2.0, 0.3.0
> Reporter: Zheng Shao
> Assignee: Zheng Shao
> Priority: Blocker
> Fix For: 0.3.0
>
>
> TBinaryProtocol is very space-inefficient. We've seen the data blown up several times between the map-reduce boundary because of it.
> We should change it to simple delimited format (backed by LazySimpleSerDe).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.