You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Joe McDonnell (Code Review)" <ge...@cloudera.org> on 2020/05/22 22:45:53 UTC

[Impala-ASF-CR] IMPALA-9777: Use Impala to do text tpcds.store sales load

Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15980

to look at the new patch set (#2).

Change subject: IMPALA-9777: Use Impala to do text tpcds.store_sales load
......................................................................

IMPALA-9777: Use Impala to do text tpcds.store_sales load

tpcds.store_sales is populated by selecting from
tpcds.store_sales_unpartitioned. Currently, this runs the
insert statement via Hive. Since a large number of partitions
are being created, this holds a large number of files open
for writing. By an analysis of the namenode log, this peaks
at over 450 open files. The open files reserve disk space
corresponding to the HDFS block size, even though the resulting
file is significantly smaller. This currently requires
dozens of GB of free disk space to run successfully.

Impala's inserts are clustered. The input is sorted and the
partitions are created one by one. This means that it does
not keep a large number of files open. Using Impala for these
inserts would reduce the reserved diskspace requirement.

This switches the inserts into the text version of tpcds.store_sales
to use Impala. It introduces a "LOAD_IMPALA" section that is
executed immediately after the Hive "LOAD" section.

The non-text versions of store_sales are not impacted. Since
the non-text versions are being created by selecting from the
text version, Hive can process one partition at a time and avoid
keeping many files open.

Testing:
 - Ran a core job
 - Processed namenode logs and verified reduced number of
   outstanding files
 - Ran an erasure coding job

Change-Id: Idfdfedd38a8001bdffd971cabd7df95020c88159
---
M bin/load-data.py
M testdata/bin/generate-schema-statements.py
M testdata/datasets/tpcds/tpcds_schema_template.sql
3 files changed, 26 insertions(+), 129 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/15980/2
-- 
To view, visit http://gerrit.cloudera.org:8080/15980
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Idfdfedd38a8001bdffd971cabd7df95020c88159
Gerrit-Change-Number: 15980
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>