You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by st...@apache.org on 2020/06/01 21:26:20 UTC
[impala] 03/03: IMPALA-9777: Set
hive.optimize.sort.dynamic.partition to true for dynamic inserts
This is an automated email from the ASF dual-hosted git repository.
stakiar pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit 00ef25888080bb1ec792c01177ab6ebcff447c5d
Author: Sahil Takiar <ta...@gmail.com>
AuthorDate: Thu May 28 13:49:17 2020 -0700
IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
This sets hive.optimize.sort.dynamic.partition to true when loading
tpcds.store_sales. This option takes effect during Hive dynamic partitioning
inserts. It introduces a sort into the insert query so that all data is
sorted on the partition key. This allows the reducers to only open a single
file at a time when writing out files.
When this config is set to false, Hive will write to multiple partitions
at the same time. So a single Hive container will have multiple file
handles open at once. This can lead to OOM issues on the Hive side as well
as diskspace issues with HDFS. When a file is opened on HDFS, the
Namenode reserves an entire block for each file, even if the resulting
file is less than a block size. If there isn't enough disk space for all
file reservations, inserts will start failing because HDFS says there is
not enough capacity on the cluster.
The change is only necessary when loading tpcds.store_sales. Adding it
to other dynamic partitioning inserts does not seem to be necessary. It
is likely that the issue only shows up when reading from an
unpartitioned table and inserting into a partitioned table. In this
case, loading tpcds.store_sales requires reading from
tpcds_unpartitioned.store_sales. The other dynamic partitioning inserts
all read from a partitioned table and write to a partitioned table.
This patch does not introduce a significant performance regression to
the runtime of data-load generation.
Testing:
* Ran core tests
* Ran core tests for Impala-EC
Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Reviewed-on: http://gerrit.cloudera.org:8080/15998
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Sahil Takiar <st...@cloudera.com>
---
testdata/datasets/tpcds/tpcds_schema_template.sql | 2 ++
1 file changed, 2 insertions(+)
diff --git a/testdata/datasets/tpcds/tpcds_schema_template.sql b/testdata/datasets/tpcds/tpcds_schema_template.sql
index e6686d3..91c6c29 100644
--- a/testdata/datasets/tpcds/tpcds_schema_template.sql
+++ b/testdata/datasets/tpcds/tpcds_schema_template.sql
@@ -772,6 +772,8 @@ set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
+set hive.optimize.sort.dynamic.partition=true;
+set hive.optimize.sort.dynamic.partition.threshold=1;
insert overwrite table {table_name} partition(ss_sold_date_sk)
select ss_sold_time_sk,