You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2014/03/25 22:51:22 UTC

[jira] [Commented] (TEZ-945) ColumnStore-like intermediate file format for shuffle

    [ https://issues.apache.org/jira/browse/TEZ-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13947208#comment-13947208 ] 

Gopal V commented on TEZ-945:
-----------------------------

This approach might result in better compression for systems which use proper Writable types. This would be possible with a custom MR app and even some parts of pig.

For something like hive which uses a SerDe to transform rows into bytes, this approach of overloading the data output mechanisms won't work - all rows will be bytes of varying sizes, not tuples of ints or floats.

This clearly proves that there is value in splitting up data into multiple streams when it comes to ETL efficiency.

But to include hive into this stream splitting, we need a more data-agnostic approach & support a SerDe based key/value collections. 

The only assumption we can make is that keys are repeated more often than values and that the keys will be sorted.

> ColumnStore-like intermediate file format for shuffle
> -----------------------------------------------------
>
>                 Key: TEZ-945
>                 URL: https://issues.apache.org/jira/browse/TEZ-945
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Tsuyoshi OZAWA
>         Attachments: design.pdf
>
>
> In ETL workload, intermediate data can be large. It is generally known that the shuffle between map phase and reduce phase is the main bottleneck. 
> To improve IFile, a file format used for shuffle in Hadoop MapReduce and Tez, we can improve shuffle performance. One idea is to introduce Column Store idea into IFile. It can improve compression ratio or overhead of IFile. As a result, performance of ETL jobs can get better. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)