You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Sahil Takiar (Code Review)" <ge...@cloudera.org> on 2020/05/28 20:52:41 UTC

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Sahil Takiar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15998


Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................

IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

This sets hive.optimize.sort.dynamic.partition to true by default during
data load. This option takes effect during Hive dynamic partitioning
inserts. It introduces a sort into the insert query so that all data is
sorted on the partition key. This allows the reducers to only open a single
file at a time when writing out files. When this config is set to false,
dynamic partitioning inserts will be run as a map-only job that
potentially opens hundreds of files per partition.

Testing:
* Ran core tests for Impala-EC

Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
---
M testdata/bin/generate-schema-statements.py
M testdata/datasets/tpcds/tpcds_schema_template.sql
2 files changed, 5 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/15998/1
-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6154/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 May 2020 23:04:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5921/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Jun 2020 15:17:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 5: Code-Review+2

Carrying +2.


-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Jun 2020 21:25:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 3: Code-Review+2

(1 comment)

Looks good. It's always cool when a one-liner fixes something big like this.

http://gerrit.cloudera.org:8080/#/c/15998/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15998/3//COMMIT_MSG@7
PS3, Line 7: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
           : 
           : This sets hive.optimize.sort.dynamic.partition to true by default during
           : data load. This option takes effect during Hive dynamic partitioning
           : inserts.
Since we removed the code setting it for most inserts, let's update the message to mention that this is only for text tpcds.store_sales.



-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Sun, 31 May 2020 17:42:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/6175/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Sun, 31 May 2020 00:02:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 1: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5905/


-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 May 2020 04:01:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15998/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15998/3//COMMIT_MSG@7
PS3, Line 7: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
           : 
           : This sets hive.optimize.sort.dynamic.partition to true by default during
           : data load. This option takes effect during Hive dynamic partitioning
           : inserts.
> Since we removed the code setting it for most inserts, let's update the mes
Done. Added some notes about why it is only necessary for tpcds.store_sales load as well.



-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Jun 2020 15:16:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15998

to look at the new patch set (#2).

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................

IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

This sets hive.optimize.sort.dynamic.partition to true by default during
data load. This option takes effect during Hive dynamic partitioning
inserts. It introduces a sort into the insert query so that all data is
sorted on the partition key. This allows the reducers to only open a single
file at a time when writing out files. When this config is set to false,
dynamic partitioning inserts will be run as a map-only job that
potentially opens hundreds of files per partition, resulting in lots of
small files. Creating all these small files potentially impacts the
health of the Namenode, and can cause data-load to fail altogether.

Testing:
* Ran core tests for Impala-EC

Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
---
M testdata/bin/generate-schema-statements.py
M testdata/datasets/tpcds/tpcds_schema_template.sql
2 files changed, 5 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/15998/2
-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15998

to look at the new patch set (#4).

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................

IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

This sets hive.optimize.sort.dynamic.partition to true when loading
tpcds.store_sales. This option takes effect during Hive dynamic partitioning
inserts. It introduces a sort into the insert query so that all data is
sorted on the partition key. This allows the reducers to only open a single
file at a time when writing out files.

When this config is set to false, Hive will write to multiple partitions
at the same time. So a single Hive container will have multiple file
handles open at once. This can lead to OOM issues on the Hive side as well
as diskspace issues with HDFS. When a file is opened on HDFS, the
Namenode reserves an entire block for each file, even if the resulting
file is less than a block size. If there isn't enough disk space for all
file reservations, inserts will start failing because HDFS says there is
not enough capacity on the cluster.

The change is only necessary when loading tpcds.store_sales. Adding it
to other dynamic partitioning inserts does not seem to be necessary. It
is likely that the issue only shows up when reading from an
unpartitioned table and inserting into a partitioned table. In this
case, loading tpcds.store_sales requires reading from
tpcds_unpartitioned.store_sales. The other dynamic partitioning inserts
all read from a partitioned table and write to a partitioned table.

This patch does not introduce a significant performance regression to
the runtime of data-load generation.

Testing:
* Ran core tests
* Ran core tests for Impala-EC

Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
---
M testdata/datasets/tpcds/tpcds_schema_template.sql
1 file changed, 2 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/15998/4
-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@13
PS2, Line 13: When this config is set to false,
            : dynamic partitioning inserts will be run as a map-only job that
            : potentially opens hundreds of files per partition, resulting in lots of
            : small files. Creating all these small files potentially impacts the
            : health of the Namenode, and can cause data-load to fail altogether.
Couple things here:
1. Let's mention the original diskspace accounting issue
2. I thought the problem was that multiple partitions are being written simultaneously with one file per partition. Were we creating more than one file per partition?

Also, can we include some output comparing the runtimes for this versus before this change? Just this part of the dataload output:
14:08:53   Loading workload 'tpch' using exploration strategy 'core' OK (Took: 7 min 12 sec)
14:13:21   Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 11 min 40 sec)
14:27:07   Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 25 min 26 sec)


http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@20
PS2, Line 20: * Ran core tests for Impala-EC
The frontend tests can be sensitive to dataload changes, and we don't run frontend tests on EC, so we'll need a normal core job.


http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py@162
PS2, Line 162: SET_OPTIMIZE_SORT_DYNAMIC_PARTITION = "SET hive.optimize.sort.dynamic.partition=true;\n"\
             :     "SET hive.optimize.sort.dynamic.partition.threshold=1;"
This applies to all the Hive inserts. To my knowledge, only the insert into the text version of tpcds.store_sales needs this setting. Does the setting cost us anything or change anything for other tables?



-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 May 2020 22:37:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 1:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5905/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 May 2020 21:29:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@13
PS2, Line 13: When this config is set to false,
            : dynamic partitioning inserts will be run as a map-only job that
            : potentially opens hundreds of files per partition, resulting in lots of
            : small files. Creating all these small files potentially impacts the
            : health of the Namenode, and can cause data-load to fail altogether.
> Couple things here:
Updated the commit message. Yeah, it looks like its just one file per partition, not multiple.

After I removed the hive.optimize.sort.dynamic.partition setting in generate-schema-statements.py, the perf runtime of data load hasn't really changed at all.


http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@20
PS2, Line 20: * Ran core tests for Impala-EC
> The frontend tests can be sensitive to dataload changes, and we don't run f
Done


http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py@162
PS2, Line 162: SET_OPTIMIZE_SORT_DYNAMIC_PARTITION = "SET hive.optimize.sort.dynamic.partition=true;\n"\
             :     "SET hive.optimize.sort.dynamic.partition.threshold=1;"
> This applies to all the Hive inserts. To my knowledge, only the insert into
I removed this, and looks like all the tests pass. Removing it does improve the performance as well. Technically the optimization should apply for all dynamic partition inserts, but I guess it makes the biggest difference when generating tpcds.store_sales, probably because tpcds.store_sales gen requires going from unpartitioned --> partitioned table, whereas all the other queries go from partitioned --> partitioned tables.



-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Sat, 30 May 2020 23:56:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................


Patch Set 5: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Jun 2020 20:43:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................

IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

This sets hive.optimize.sort.dynamic.partition to true when loading
tpcds.store_sales. This option takes effect during Hive dynamic partitioning
inserts. It introduces a sort into the insert query so that all data is
sorted on the partition key. This allows the reducers to only open a single
file at a time when writing out files.

When this config is set to false, Hive will write to multiple partitions
at the same time. So a single Hive container will have multiple file
handles open at once. This can lead to OOM issues on the Hive side as well
as diskspace issues with HDFS. When a file is opened on HDFS, the
Namenode reserves an entire block for each file, even if the resulting
file is less than a block size. If there isn't enough disk space for all
file reservations, inserts will start failing because HDFS says there is
not enough capacity on the cluster.

The change is only necessary when loading tpcds.store_sales. Adding it
to other dynamic partitioning inserts does not seem to be necessary. It
is likely that the issue only shows up when reading from an
unpartitioned table and inserting into a partitioned table. In this
case, loading tpcds.store_sales requires reading from
tpcds_unpartitioned.store_sales. The other dynamic partitioning inserts
all read from a partitioned table and write to a partitioned table.

This patch does not introduce a significant performance regression to
the runtime of data-load generation.

Testing:
* Ran core tests
* Ran core tests for Impala-EC

Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Reviewed-on: http://gerrit.cloudera.org:8080/15998
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Sahil Takiar <st...@cloudera.com>
---
M testdata/datasets/tpcds/tpcds_schema_template.sql
1 file changed, 2 insertions(+), 0 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Sahil Takiar: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15998

to look at the new patch set (#3).

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts
......................................................................

IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts

This sets hive.optimize.sort.dynamic.partition to true by default during
data load. This option takes effect during Hive dynamic partitioning
inserts. It introduces a sort into the insert query so that all data is
sorted on the partition key. This allows the reducers to only open a single
file at a time when writing out files.

When this config is set to false, Hive will write to multiple partitions
at the same time. So a single Hive container will have multiple file
handles open at once. This can lead to OOM issues on the Hive side as well
as diskspace issues with HDFS. When a file is opened on HDFS, the
Namenode reserves an entire block for each file, even if the resulting
file is less than a block size. If there isn't enough disk space for all
file reservations, inserts will start failing because HDFS says there is
not enough capacity on the cluster.

This patch does not introduce a significant performance regression to
the runtime of data-load generation.

Testing:
* Ran core tests
* Ran core tests for Impala-EC

Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
---
M testdata/datasets/tpcds/tpcds_schema_template.sql
1 file changed, 2 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/15998/3
-- 
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>