You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/05/13 05:05:14 UTC

[jira] [Updated] (HIVE-7052) Optimize split calculation time

     [ https://issues.apache.org/jira/browse/HIVE-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh Balamohan updated HIVE-7052:
-----------------------------------

    Attachment: HIVE-7052-profiler-2.png
                HIVE-7052-profiler-1.png

> Optimize split calculation time
> -------------------------------
>
>                 Key: HIVE-7052
>                 URL: https://issues.apache.org/jira/browse/HIVE-7052
>             Project: Hive
>          Issue Type: Bug
>         Environment: hive + tez
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>              Labels: performance
>         Attachments: HIVE-7052-profiler-1.png, HIVE-7052-profiler-2.png
>
>
> When running a TPC-DS query (query_27),  significant amount of time was spent in split computation on a dataset of size 200 GB (ORC format).
> Profiling revealed that, 
> 1. Lot of time was spent in Config's subtitutevar (regex) in HiveInputFormat.getSplits() method.  
> 2. FileSystem was created repeatedly in OrcInputFormat.generateSplitsInfo(). 
> I will attach the profiler snapshots soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)