You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2014/04/30 06:10:16 UTC

[jira] [Commented] (TEZ-661) Implement a non-sorted partitioned output

    [ https://issues.apache.org/jira/browse/TEZ-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985144#comment-13985144 ] 

Gopal V commented on TEZ-661:
-----------------------------

LGTM for the sort avoidance.

Minor comment on the use of constant 4 throughout - I think it means different things in different places, not sure (is it always META_SIZE?).

The BufferTooSmallException handling recursion case wastes some memory if a single value is too big out of a group of small values.

Example test case would be KV<1kb>,KV<2Mb>,KV<1kb> etc... but it is a corner case - for each case you'll get 2 files as you switch around.

Might as well write both current buffer and KV<2mb> into one file, to reduce number of eventual files.

The reason to do this is to get fewer files if possible - but this can be taken care of later.

On the IFile front, I'm only griping about the decompress, read, recompress for the merge. I'll make a note for a fast(er) way to concat two IFile streams.

> Implement a non-sorted partitioned output
> -----------------------------------------
>
>                 Key: TEZ-661
>                 URL: https://issues.apache.org/jira/browse/TEZ-661
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Siddharth Seth
>         Attachments: TEZ-661.1.txt
>
>
> When implementing Pig union, we need to gather data from two or more upstream vertexes without sorting. The vertex itself might consists of several tasks. Ideally, it should use OnFileUnorderedKVOutput with DataMovementType.SCATTER_GATHER. However, this combination does not work according to [~hitesh]. We need to implement that. Also, key is meaningless in this scenario, we just want to evenly distribute the output records to tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)