You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Saikat (JIRA)" <ji...@apache.org> on 2015/06/24 17:29:05 UTC

[jira] [Updated] (TEZ-2574) Make a better metadata Value split choice in Pipeline sort

     [ https://issues.apache.org/jira/browse/TEZ-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Saikat updated TEZ-2574:
------------------------
    Attachment: TEZ-2574.patch

patch logic:
1. In PipelineSort  constructor allocate a span with default size(16)
2. when the first kv pair is written(call to write()), calculate a new per item size hint. if this gives a better split, update the span.
3. Subsequently, when a span fills out, and a new span is to be allocated, use the average item size of the last span to determine the meta-value split of the new span. (This can be extended in future check average KV length seen across all past spans)
4. In order to keep the splits small(according to existing implementation), a min comparison is performed between desired items and new number items calculate. min(1M, numitems))
example:
in the old logic, say a buffer of 30 mb will be split into 15 mb split of meta and value.
So a key-value pair of 20mb will not fit into the buffer and cause exception.
With new implementation, the buffer will be split into 10 and 20 mb, thus accommodating the KV pair.

> Make a better metadata Value split choice in Pipeline sort
> ----------------------------------------------------------
>
>                 Key: TEZ-2574
>                 URL: https://issues.apache.org/jira/browse/TEZ-2574
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Saikat
>            Assignee: Saikat
>            Priority: Minor
>         Attachments: TEZ-2574.patch
>
>
> In the current implementation of pipeline sort, when a new sort span object is created with a hard coded value of 1M items and 16 bytes per item.
> According to the present code logic, 
>       int metasize = METASIZE*maxItems;
>       int dataSize = maxItems * perItem;
>       if(capacity < (metasize+dataSize)) {
>         // try to allocate less meta space, because we have sample data
>         metasize = METASIZE*(capacity/(perItem+METASIZE));
>       }
> if capacity is less than 32mb, the buffer will be halved into meta and value buffers, which is not efficient.
> We need a more generic split, based on the KV pair size written to the buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)