You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2017/12/14 05:00:02 UTC

[jira] [Comment Edited] (TEZ-2950) Poor performance of UnorderedPartitionedKVWriter

    [ https://issues.apache.org/jira/browse/TEZ-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290330#comment-16290330 ] 

Rohini Palaniswamy edited comment on TEZ-2950 at 12/14/17 4:59 AM:
-------------------------------------------------------------------

Here is a simpler suggestion to try speed it up a bit. Can probably be addressed in a separate jira as a short term solution and leave this one for long term solution.

https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/writers/UnorderedPartitionedKVWriter.java#L1010-L1022 - For each partition, each spill file is opened once.  For parallelism of 1000 and 8500 spills, it will be making 8500000 file open calls. We can try keeping the first N file handles open always (will need a new IFile.Reader method that does not close the underlying input stream but does rest of close() like freeing up decompressor and buffers).  Let us say we keep first 100 spill files always open, it will cut down number of file open calls to  8400100. For parallelism of 1000 and 100 spills, it will cut down file open calls from 100000 to 100.


was (Author: rohini):
Here is a simpler suggestion to try speed it up a bit. Can probably be addressed in a separate jira as a short term solution and leave this one for long term solution.

https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/writers/UnorderedPartitionedKVWriter.java#L1010-L1022 - For each partition, each spill file is opened once.  For parallelism of 1000 and 8500 spills, it will be making 8500000 file open calls. This can be cut down by batching of spill file reads and partition writes. Let's say for a batch size of 10, 10 writers (partitions) and 10 spill file readers are kept open in parallel and merging is done. It will cut down file open by 10x to 850000.

> Poor performance of UnorderedPartitionedKVWriter
> ------------------------------------------------
>
>                 Key: TEZ-2950
>                 URL: https://issues.apache.org/jira/browse/TEZ-2950
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Kuhu Shukla
>         Attachments: TEZ-2950.001_prelim.patch
>
>
> Came across a job which was taking a long time in UnorderedPartitionedKVWriter.mergeAll. It was decompressing and reading data from spill files (8500 spills) and then writing the final compressed merge file. Why do we need spill files for UnorderedPartitionedKVWriter? Why not just buffer and keep directly writing to the final file which will save a lot of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)