You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/03/25 00:51:50 UTC

[jira] Commented: (HIVE-336) TBinaryProtocol blow up the data size between map-reduce boundary

    [ https://issues.apache.org/jira/browse/HIVE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688940#action_12688940 ] 

Zheng Shao commented on HIVE-336:
---------------------------------

We had an example query which sends 4 string columns from mapper to reducer.

The map output data was 220MB using DynamicSerDe with TCTLSeparatedProtocol. It is about 140MB using LazySimpleSerDe right now. The net saving of space is around 1/3.


> TBinaryProtocol blow up the data size between map-reduce boundary
> -----------------------------------------------------------------
>
>                 Key: HIVE-336
>                 URL: https://issues.apache.org/jira/browse/HIVE-336
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0, 0.3.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>             Fix For: 0.3.0
>
>
> TBinaryProtocol is very space-inefficient. We've seen the data blown up several times between the map-reduce boundary because of it.
> We should change it to simple delimited format (backed by LazySimpleSerDe).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.