You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2017/12/14 04:30:04 UTC

[jira] [Commented] (TEZ-2950) Poor performance of UnorderedPartitionedKVWriter

    [ https://issues.apache.org/jira/browse/TEZ-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290330#comment-16290330 ] 

Rohini Palaniswamy commented on TEZ-2950:
-----------------------------------------

Here is a simpler suggestion to try speed it up a bit. Can probably be addressed in a separate jira as a short term solution and leave this one for long term solution.

https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/writers/UnorderedPartitionedKVWriter.java#L1010-L1022 - For each partition, each spill file is opened once.  For parallelism of 1000 and 8500 spills, it will be making 8500000 file open calls. This can be cut down by batching of spill file reads and partition writes. Let's say for a batch size of 10, 10 writers (partitions) and 10 spill file readers are kept open in parallel and merging is done. It will cut down file open by 10x to 850000.

> Poor performance of UnorderedPartitionedKVWriter
> ------------------------------------------------
>
>                 Key: TEZ-2950
>                 URL: https://issues.apache.org/jira/browse/TEZ-2950
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Kuhu Shukla
>         Attachments: TEZ-2950.001_prelim.patch
>
>
> Came across a job which was taking a long time in UnorderedPartitionedKVWriter.mergeAll. It was decompressing and reading data from spill files (8500 spills) and then writing the final compressed merge file. Why do we need spill files for UnorderedPartitionedKVWriter? Why not just buffer and keep directly writing to the final file which will save a lot of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)