You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2015/01/20 04:23:35 UTC

[jira] [Commented] (TEZ-1937) Reduce cost of merging ifiles in UnorderedPartitionedWriter

    [ https://issues.apache.org/jira/browse/TEZ-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283355#comment-14283355 ] 

Siddharth Seth commented on TEZ-1937:
-------------------------------------

The counter should consider compression - since it's measuring bytes read from disk. It'll be better to update it in the IFile.appendIFile method so that whenever this is changed to fix compression, it'll be an obvious fix.

{code}
+        } else {
+          LOG.warn("Could not obtain decompressor from CodecPool");
+          in = checksumIn;
+        }
{code}
Should be an exception.

{code}
+        prevKey = null;
+        previous.reset();
{code}
Why is this required ?

Doesn't each IFile stream (per partition in each spill file) also have a checksum associated with it. I believe using partLength will not copy the checksum - but is a new checksum being computed for the entire partition stream in the writer ?

Any corner cases where the same record exists across two files - with RLE break in any way. I don't think it should.

> Reduce cost of merging ifiles in UnorderedPartitionedWriter
> -----------------------------------------------------------
>
>                 Key: TEZ-1937
>                 URL: https://issues.apache.org/jira/browse/TEZ-1937
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1937.1.patch, TEZ-1937.2.patch, TEZ-1937.WIP.patch
>
>
> Currently we iterate through all spilled files for merging.  This incurs additional deserialization cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)