You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/03/11 00:20:50 UTC

[jira] Created: (HIVE-336) TBinaryProtocol blow up the data size between map-reduce boundary

TBinaryProtocol blow up the data size between map-reduce boundary
-----------------------------------------------------------------

                 Key: HIVE-336
                 URL: https://issues.apache.org/jira/browse/HIVE-336
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 0.2.0, 0.3.0
            Reporter: Zheng Shao
            Assignee: Zheng Shao
            Priority: Blocker


TBinaryProtocol is very space-inefficient. We've seen the data blown up several times between the map-reduce boundary because of it.

We should change it to simple delimited format (backed by LazySimpleSerDe).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-336) TBinaryProtocol blow up the data size between map-reduce boundary

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688940#action_12688940 ] 

Zheng Shao commented on HIVE-336:
---------------------------------

We had an example query which sends 4 string columns from mapper to reducer.

The map output data was 220MB using DynamicSerDe with TCTLSeparatedProtocol. It is about 140MB using LazySimpleSerDe right now. The net saving of space is around 1/3.


> TBinaryProtocol blow up the data size between map-reduce boundary
> -----------------------------------------------------------------
>
>                 Key: HIVE-336
>                 URL: https://issues.apache.org/jira/browse/HIVE-336
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0, 0.3.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>             Fix For: 0.3.0
>
>
> TBinaryProtocol is very space-inefficient. We've seen the data blown up several times between the map-reduce boundary because of it.
> We should change it to simple delimited format (backed by LazySimpleSerDe).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HIVE-336) TBinaryProtocol blow up the data size between map-reduce boundary

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao resolved HIVE-336.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.3.0
     Hadoop Flags: [Reviewed]

As part of HIVE-337, we changed the SerDe to LazySimpleSerDe, for the value part in the key-value pair between map and reduce.


> TBinaryProtocol blow up the data size between map-reduce boundary
> -----------------------------------------------------------------
>
>                 Key: HIVE-336
>                 URL: https://issues.apache.org/jira/browse/HIVE-336
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0, 0.3.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>             Fix For: 0.3.0
>
>
> TBinaryProtocol is very space-inefficient. We've seen the data blown up several times between the map-reduce boundary because of it.
> We should change it to simple delimited format (backed by LazySimpleSerDe).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.