You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by st...@apache.org on 2020/06/01 21:26:20 UTC
[impala] 03/03: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

This is an automated email from the ASF dual-hosted git repository.

stakiar pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 00ef25888080bb1ec792c01177ab6ebcff447c5d
Author: Sahil Takiar <ta...@gmail.com>
AuthorDate: Thu May 28 13:49:17 2020 -0700

    IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
    
    This sets hive.optimize.sort.dynamic.partition to true when loading
    tpcds.store_sales. This option takes effect during Hive dynamic partitioning
    inserts. It introduces a sort into the insert query so that all data is
    sorted on the partition key. This allows the reducers to only open a single
    file at a time when writing out files.
    
    When this config is set to false, Hive will write to multiple partitions
    at the same time. So a single Hive container will have multiple file
    handles open at once. This can lead to OOM issues on the Hive side as well
    as diskspace issues with HDFS. When a file is opened on HDFS, the
    Namenode reserves an entire block for each file, even if the resulting
    file is less than a block size. If there isn't enough disk space for all
    file reservations, inserts will start failing because HDFS says there is
    not enough capacity on the cluster.
    
    The change is only necessary when loading tpcds.store_sales. Adding it
    to other dynamic partitioning inserts does not seem to be necessary. It
    is likely that the issue only shows up when reading from an
    unpartitioned table and inserting into a partitioned table. In this
    case, loading tpcds.store_sales requires reading from
    tpcds_unpartitioned.store_sales. The other dynamic partitioning inserts
    all read from a partitioned table and write to a partitioned table.
    
    This patch does not introduce a significant performance regression to
    the runtime of data-load generation.
    
    Testing:
    * Ran core tests
    * Ran core tests for Impala-EC
    
    Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
    Reviewed-on: http://gerrit.cloudera.org:8080/15998
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Sahil Takiar <st...@cloudera.com>
---
 testdata/datasets/tpcds/tpcds_schema_template.sql | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/testdata/datasets/tpcds/tpcds_schema_template.sql b/testdata/datasets/tpcds/tpcds_schema_template.sql
index e6686d3..91c6c29 100644
--- a/testdata/datasets/tpcds/tpcds_schema_template.sql
+++ b/testdata/datasets/tpcds/tpcds_schema_template.sql
@@ -772,6 +772,8 @@ set hive.exec.max.dynamic.partitions.pernode=10000;
 set hive.exec.max.dynamic.partitions=10000;
 set hive.exec.dynamic.partition.mode=nonstrict;
 set hive.exec.dynamic.partition=true;
+set hive.optimize.sort.dynamic.partition=true;
+set hive.optimize.sort.dynamic.partition.threshold=1;
 
 insert overwrite table {table_name} partition(ss_sold_date_sk)
 select ss_sold_time_sk,