You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/04/09 12:57:14 UTC

[jira] [Created] (HIVE-6872) Explore options of optimizing FileSinkOperator-->getDynOutPaths()

Rajesh Balamohan created HIVE-6872:
--------------------------------------

             Summary: Explore options of optimizing FileSinkOperator-->getDynOutPaths()
                 Key: HIVE-6872
                 URL: https://issues.apache.org/jira/browse/HIVE-6872
             Project: Hive
          Issue Type: Bug
            Reporter: Rajesh Balamohan
            Priority: Critical


1. Download hive-testbench from https://github.com/cartershanklin/hive-testbench
2. Generate data using "./tpcds-setup.sh 10 /user/hive/external partitioned" 
3. Most of the data population for tables with "partition + bucket + sorted data" will run a lot slower even with scale factor of 10 on 20 node cluster.

Bottleneck seems to be in FileSinkOperator-->getDynOutPaths() where it tries to close FSPath writers.  Every call takes almost 150-200 ms. 

set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=4096;

With the above setting, one of the data loading (for web_sales table) took almost 4096 * 150 = 600 seconds in closing the writers sequentially.  

Purpose of this jira is to figure out options of optimizing this code path.



--
This message was sent by Atlassian JIRA
(v6.2#6252)