You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "David Knupp (Code Review)" <ge...@cloudera.org> on 2017/01/28 19:52:55 UTC

[Impala-ASF-CR] IMPALA-4482, IMPALA-4838: RECOVER PARTITIONS with tpcds.store sales

Hello Internal Jenkins, Dimitris Tsirogiannis,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5177

to look at the new patch set (#8).

Change subject: IMPALA-4482, IMPALA-4838: RECOVER PARTITIONS with tpcds.store_sales
......................................................................

IMPALA-4482, IMPALA-4838: RECOVER PARTITIONS with tpcds.store_sales

This patch changes the way we load tpcds.store_sales test data. Before
this, we were relying on a force_reload to build the table partitions
based upon the data that had been copied over to HDFS from the warehouse
snapshot. This worked on the local mini-cluster, but for some reason,
it was selectively duplicating data when run on a remote cluster.

This patch doesn't solve the mystery of why data duplication occurs on
remote clusters, but it does resolve the immediate concern of loading
test data by using Impala's recover partitions feature to automatically
recognize the partitions in the HDFS directories. We just needed to add
an ALTER TABLE store_sales RECOVER PARTITIONS to the tpcds schema
template file.

This patch also changes the way we handle the ALTER sections of our
testdata schema template files (IMPALA-4838). Before, Hive didn't support
fully qualified table names with ALTER, but this is no longer the case.
We should allow for fully qualified names in any subsequent schema
template ALTER statements, but still remain backwards compatible with
existing ALTER statements that do not use fully-qualified names.

Tested by dropping the tpcds table on from a remote cluster setup,
reloading the table, and running the tests in test_tpcds_queries.py.
Tests that had been failng before are now passing.

Also loaded tpcds.store_sales with and without using a fully qualified
name in the ALTER TABLE statement, and checked table stats to confirm
the results were the same in either case.

As an final check, a pre-review test run was attempted on the upstream
Jenkins server.

Change-Id: Iaae97d1d44201aeeacacdd39adbae35753512950
---
M testdata/bin/generate-schema-statements.py
M testdata/datasets/tpcds/tpcds_schema_template.sql
2 files changed, 8 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/5177/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaae97d1d44201aeeacacdd39adbae35753512950
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Harrison Sheinblatt <hs...@hotmail.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Jim Apple <jb...@apache.org>