You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/07/02 11:56:25 UTC

[jira] [Commented] (TEZ-1228) Prototype IFile : Define a memory & merge optimized vertex-intermediate file format for Tez

    [ https://issues.apache.org/jira/browse/TEZ-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049791#comment-14049791 ] 

Rajesh Balamohan commented on TEZ-1228:
---------------------------------------

@gopalv For backward compatibility, it would be beneficial to retain setRLE() feature in IFile.  This would enable PipelinedSorter to work out of the box without making any code changes.  For DefaultSorter, we need to have a follow up ticket to make use of RLE.

> Prototype IFile : Define a memory & merge optimized vertex-intermediate file format for Tez
> -------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1228
>                 URL: https://issues.apache.org/jira/browse/TEZ-1228
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>              Labels: perfomance
>         Attachments: TEZ-1228-IFile.pdf, TEZ-1228.WIP.1.patch, TEZ-1228.WIP.2.patch
>
>
> The current vertex-intermediate format used all across Tez is a flat file of variable length k,v pairs. For a significant number of use-cases, in particular the sorted output phase, a large number of consecutive  identical keys are found within the same stream. The IFile format ends up writing each key out fully into the stream to generate (K,V) pairs instead of ordering it into a more efficient K, {V1, .. Vn} list.
> This duplication of key data needs larger buffers to hold in memory and requires comparison between keys known to be identical while doing a merge sort.
> This bug tracks the building of a prototype IFile format which is optimized for lower uncompressed sizes within memory buffers and less compute intensive to perform merge sorts during the reducer phase.



--
This message was sent by Atlassian JIRA
(v6.2#6252)