You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Kuhu Shukla (JIRA)" <ji...@apache.org> on 2016/05/05 14:53:12 UTC

[jira] [Commented] (TEZ-2950) Poor performance of UnorderedPartitionedKVWriter

    [ https://issues.apache.org/jira/browse/TEZ-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272472#comment-15272472 ] 

Kuhu Shukla commented on TEZ-2950:
----------------------------------

Proposed Approach based on offline discussions with [~jlowe], [~jeagles].

Add a configurable feature where in we bypass the spill files and instead open {{numPartitions}} IFile( and codecs) writers in the constructor. On {{write()}} instead of using a buffer {{(currentBuffer)}}, KV pair is directly appended to the appropriate partition file we opened. We close all the writers when we are done. During mergeAll the partitions files are concatenated through {{copyBytes()}} instead of IFile {{read()}} to the first partition file and then renamed to the final output file name. 

Pipeline Shuffle ignores this setting and new behavior altogether.

For jobs with auto-parallelism on, with at max 999 downstream tasks, it might be possible to fit {{numpartitions}} open codecs. In a scenario where the client sets the parallelism to something higher, the old scheme might be better in terms of heap usage.

Appreciate any comments/thoughts on this approach. Thanks a lot!

> Poor performance of UnorderedPartitionedKVWriter
> ------------------------------------------------
>
>                 Key: TEZ-2950
>                 URL: https://issues.apache.org/jira/browse/TEZ-2950
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Kuhu Shukla
>
> Came across a job which was taking a long time in UnorderedPartitionedKVWriter.mergeAll. It was decompressing and reading data from spill files (8500 spills) and then writing the final compressed merge file. Why do we need spill files for UnorderedPartitionedKVWriter? Why not just buffer and keep directly writing to the final file which will save a lot of time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)