You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Yan Zhou (JIRA)" <ji...@apache.org> on 2010/02/23 18:16:27 UTC

[jira] Updated: (PIG-1198) [zebra] performance improvements

     [ https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yan Zhou updated PIG-1198:
--------------------------

    Attachment: PIG-1198.patch

This patch is based upon the load-store-redesign branch and thus might have minor differences due to different code base from the final patch to be applied to the trunk. This patch is teherefore only for reviewing purpose only and no submission is intended. 

> [zebra] performance improvements
> --------------------------------
>
>                 Key: PIG-1198
>                 URL: https://issues.apache.org/jira/browse/PIG-1198
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>             Fix For: 0.7.0
>
>         Attachments: PIG-1198.patch
>
>
> Current input split generation is row-based split on individual TFiles. This leaves undesired fact that even for TFiles smaller than one block one split is still generated for each. Consequently, there will be many mappers, and many waves, needed to handle the many small TFiles generated by as many mappers/reducers that wrote the data. This issue can be addressed by generating input splits that can include multiple TFiles. 
> For sorted tables, key distribution generation by table, which is used to generated proper input splits, includes key distributions from column groups even they are not in projection. This incurs extra cost to perform unnecessary computations and, more inappropriately, creates unreasonable results on input split generations; 
> For unsorted tables, when row split is generated on a union of tables, the FileSplits are generated for each table and then lumped together to form the final list of splits to Map/Reduce. This has a undesirable fact that number of splits is subject to the number of tables in the table union and not just controlled by the number of splits used by the Map/Reduce framework; 
> The input split's goal size is calculated on all column groups even if some of them are not in projection; 
> For input splits of multiple files in one column group, all files are opened at startup. This is unnecessary and takes unnecessarily resources from start to end. The files should be opened when needed and closed when not; 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.