You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Qifan Chen (Code Review)" <ge...@cloudera.org> on 2020/06/19 16:10:07 UTC
[Impala-ASF-CR] WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Qifan Chen has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16098
Change subject: WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
This work addresses the current limitation in computing the total row
count for a Hive table. The row count can be incorrectly computed as 0,
even though there exists data in some partitions of the Hive table.
CDPD-12560 documents a form of corruption in partition stats in Hive
tables that contributes to this limitation in Impala: the row count of
a partition is set to 0 even though the partition size is a positive
value. The corruption can only happen when hive.stats.autogather=true
during both table creation and table loading.
In the fix, as long as no partition in a Hive table exhibits any stats
corruptions including the type described above, the total row count for
the table is computed from the row counts in all partitions. Otherwise,
Impala estimates the total row count from the total size of the partitions
and the row width if feasible.
Testing:
1. Ran unit tests with queries documented in the case against Hive tables
with the following configrations:
a. No stats corruption in any partitions;
b. Stats corruption in some partitions;
c. Stats corruption in all partitions.
Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
1 file changed, 10 insertions(+), 7 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/1
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 1
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 25:
> Uploaded patch set 25.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Jul 2020 15:51:52 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 25:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/16098/25/tests/metadata/test_explain.py
File tests/metadata/test_explain.py:
http://gerrit.cloudera.org:8080/#/c/16098/25/tests/metadata/test_explain.py@132
PS25, Line 132: # Set the number of rows at the table level to -1.
: self.execute_query(
: "alter table %s set tblproperties('numRows'='-1')" % mixed_tbl)
just curious why this is necessary?
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jul 2020 17:38:12 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/16098 )
Change subject: WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
This work addresses the current limitation in computing the total row
count for a Hive table. The row count can be incorrectly computed as 0,
even though there exists data in partitions of the Hive table.
CDPD-12560 documents a form of corruption in partition stats in Hive
tables that contributes to this limitation in Impala: the row count of
a partition is set to 0 even though the partition size is a positive
value. The corruption can only happen when hive.stats.autogather=true
during both table creation and loading.
In the fix, as long as no partition in a Hive table exhibits any stats
corruptions including the kind described above, the total row count for
the table is computed from the row counts in all partitions. Otherwise,
Impala estimates the total row count from the total size of the partitions
and the row width, if feasible.
Testing:
1. Ran unit tests with queries documented in the case against Hive tables
with the following configrations:
a. No stats corruption in any partitions;
b. Stats corruption in some partitions;
c. Stats corruption in all partitions.
Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
1 file changed, 11 insertions(+), 7 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/3
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 3
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
This work addresses the current limitation in computing the total row
count for a Hive table. The row count can be incorrectly computed as 0,
even though there exists data in partitions of the Hive table.
In the fix, as long as no partition in a Hive table exhibits any stats
corruptions including the kind described above, the total row count for
the table is computed from the row counts in all partitions. Otherwise,
Impala estimates the total row count from the total size of the partitions
and the row width, if feasible.
Testing:
1. Ran unit tests with queries documented in the case against Hive tables
with the following configrations:
a. No stats corruption in any partitions
b. Stats corruption in some partitions
c. Stats corruption in all partitions
2. Added a new test in test_compute_stats.py:
test_corrupted_stats_impala_9744
3. Ran "core" test
Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A testdata/data/alltypes_parquet_year2009_month01.parquet
M tests/metadata/test_compute_stats.py
3 files changed, 72 insertions(+), 7 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/6
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 6
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
[Impala-ASF-CR] WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 3:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/6381/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 3
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Jun 2020 16:48:55 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 23: Code-Review+2
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 23
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 27 Jul 2020 13:43:03 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#31). ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
This work addresses the current limitation in computing the total row
count for a Hive table in a scan. The row count can be incorrectly
computed as 0, even though there exists data in the Hive table. This
is the stats corruption at table level. Similar stats corruption
exists for a partition. The row count of a table or a partition
sometime can also be -1 which indicates a missing stats situation.
In the fix, as long as no partition in a Hive table exhibits any
missing or corrupt stats, the total row count for the table is computed
from the row counts in all partitions. Otherwise, Impala looks at
the table level stats particularly the table row count.
In addition, if the table stats is missing or corrupted, Impala
estimates a row count for the table, if feasible. This row count is
the sum of the row count from the partitions with good stats, and
an estimation of the number of rows in the partitions with missing or
corrupt stats. Such estimation also applies when some partition
has corrupt stats.
One way to observe the fix is through the explain of queries scanning
Hive tables with missing or corrupted stats. The cardinality for any
full scan should be a positive value (i.e. the estimated row count),
instead of 'unavailable'. At the beginning of the explain output,
that table is still listed in the WARNING section for potentially
corrupt table statistics.
Testing:
1. Ran unit tests with queries documented in the case against Hive
tables with the following configrations:
a. No stats corruption in any partitions
b. Stats corruption in some partitions
c. Stats corruption in all partitions
2. Added two new tests in test_compute_stats.py:
a. test_corrupted_stats_in_partitioned_Hive_tables
b. test_corrupted_stats_in_unpartitioned_Hive_tables
3. Fixed failures in corrupt-stats.test, stats-extrapolation.test and
test_compute_stats.py
4. Introduce a new filter
DO_NOT_VALIDATE_ROWCOUNT_ESTIMATION_FOR_PARTITIONS in Planner tests
5. Ran "core" test
Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-planner/queries/PlannerTest/bloom-filter-assignment.test
M testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test
M testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering-disabled.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/union.test
M testdata/workloads/functional-query/queries/QueryTest/corrupt-stats.test
M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test
M tests/metadata/test_compute_stats.py
M tests/metadata/test_explain.py
16 files changed, 283 insertions(+), 95 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/31
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 31
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 9:
(3 comments)
http://gerrit.cloudera.org:8080/#/c/16098/9/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:
http://gerrit.cloudera.org:8080/#/c/16098/9/tests/metadata/test_compute_stats.py@198
PS9, Line 198:
flake8: W291 trailing whitespace
http://gerrit.cloudera.org:8080/#/c/16098/9/tests/metadata/test_compute_stats.py@198
PS9, Line 198: # autogathering. The partition stats is corrupted: row count=0 and row
line has trailing whitespace
http://gerrit.cloudera.org:8080/#/c/16098/9/tests/metadata/test_compute_stats.py@240
PS9, Line 240: #
flake8: E303 too many blank lines (2)
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 9
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Jun 2020 17:50:26 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#22). ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
This work addresses the current limitation in computing the total row
count for a Hive table in a scan. The row count can be incorrectly
computed as 0, even though there exists data in the Hive table. This
is the stats corruption at table level. Similar stats corruption
exists for a partition. The row count of a table or a partition
sometime can also be -1 which indicates a missing stats situation.
In the fix, as long as no partition in a Hive table exhibits any
missing or corrupt stats, the total row count for the table is computed
from the row counts in all partitions. Otherwise, Impala looks at
the table level stats particularly the table row count.
In addition, if the table stats is missing or corrupted, Impala
estimates a row count for the table, if feasible. This row count is
the sum of the row count from the partitions with good stats, and
an estimation of the number of rows in the partitions with missing or
corrupt stats. Such estimation also applies when some partition
has missing or corrupt stats.
One way to observe the fix is through the explain of queries scanning
Hive tables with missing or corrupted stats. The cardinality for any
full scan should be a positive value (i.e. the estimated row count),
instead of 'unavailable'. At the beginning of the explain output,
that table is still listed in the WARNING section for potentially
corrupt table statistics.
Testing:
1. Ran unit tests with queries documented in the case against Hive
tables with the following configrations:
a. No stats corruption in any partitions
b. Stats corruption in some partitions
c. Stats corruption in all partitions
2. Added two new tests in test_compute_stats.py:
a. test_corrupted_stats_in_partitioned_Hive_tables
b. test_corrupted_stats_in_unpartitioned_Hive_tables
3. Fixed failures in corrupt-stats.test
4. Ran "core" test
Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/corrupt-stats.test
M tests/metadata/test_compute_stats.py
3 files changed, 185 insertions(+), 37 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/22
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 17:
(11 comments)
http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:
http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1178
PS14, Line 1178: if (partNumRows > -1) {
: if (partitionNumRows_ == -1) partitionNumRows_ = 0;
:
> Done
ping - this is still an issue. it should be
} else if (partNumRows > -1) {
http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1189
PS14, Line 1189:
> This is an extra protection for the #partitions in the loop above ever bein
Done
http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1200
PS14, Line 1200: table size from those partitions with bad r
> Yes, it is needed for single-partition tables.
but non-partitioned tables (e.g. getNumClusteringCols == 0), have a single FeFsPartition that basically represents the entire table. the for loop about over the partitions_ table should still set hasCorruptTableStats_ even for non-partitioned tables.
http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1201
PS14, Line 1201: n
> hasCorruptTableStats_ here actually means that some partition is of bad sta
sorry, I meant shouldn't it check of any partitions have either corrupt or missing stats? right now it only checks for corrupt stats, but shouldn't it check if there are missing stats for any partitions as well?
http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:
http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1187
PS17, Line 1187: if (partitionsWithCorruptOrMissingStats.size() == 0 && numPartitionsWithNumRows_ > 0)
: return partitionNumRows_;
needs curly braces around the body of the if statement
http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1244
PS17, Line 1244: p:
nit should be "p :"
http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:
http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@174
PS17, Line 174:
we need to add a test for the mixed case for partitioned tables - where some partitions have good stats and other have missing / corrupt stats
http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@180
PS17, Line 180: Hive
entire test name should be all lower case
http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@197
PS17, Line 197: CREATE TABLE {0}.{1} (
: id int COMMENT 'Add a comment',
: bool_col boolean,
: tinyint_col tinyint,
: smallint_col smallint,
: int_col int,
: bigint_col bigint,
: float_col float,
: double_col double,
: date_string_col string,
: string_col string,
: timestamp_col timestamp,
: year int,
: month int )
: PARTITIONED BY (decade string)
: STORED AS PARQUET;
I know we discussed this before, but I still don't understand why you can't create the table in Impala. it doesn't really matter, but I made the following change locally and the test still passes:
diff --git a/tests/metadata/test_compute_stats.py b/tests/metadata/test_compute_stats.py
index c477773..87ecc37 100644
--- a/tests/metadata/test_compute_stats.py
+++ b/tests/metadata/test_compute_stats.py
@@ -193,23 +193,11 @@ class TestComputeStats(ImpalaTestSuite):
# Setting hive.stats.autogather=true after CRTB DDL but before LOAD DML
# minimally reproduces the corrupt stats issue.
+ impala_create = "CREATE TABLE {0}.{1} like parquet '{2}' PARTITIONED BY (decade string)".format(
+ unique_database, table_name, "file://" + local_file)
+ self.execute_query(impala_create)
+
create_load_data_stmts = """
- CREATE TABLE {0}.{1} (
- id int COMMENT 'Add a comment',
- bool_col boolean,
- tinyint_col tinyint,
- smallint_col smallint,
- int_col int,
- bigint_col bigint,
- float_col float,
- double_col double,
- date_string_col string,
- string_col string,
- timestamp_col timestamp,
- year int,
- month int )
- PARTITIONED BY (decade string)
- STORED AS PARQUET;
set hive.stats.autogather=true;
load data local inpath '{2}' into table {0}.{1} partition (decade="2020s");
""".format(unique_database, table_name, local_file)
http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@231
PS17, Line 231: assert("cardinality=[1-9][0-9]*" in explain_result.data[i + 2])
there should be some validation of the output for "partitions: %s/%s rows=%s" as well.
http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@247
PS17, Line 247: # Load from a local data file
most of the code in this test method and the above is the same, it should be re-factored into a private helper method to avoid code duplication between the two methods
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 17
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 17 Jul 2020 18:50:43 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 36:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/6896/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 36
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 12 Aug 2020 13:46:29 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 28:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/16098/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:
http://gerrit.cloudera.org:8080/#/c/16098/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1199
PS25, Line 1199: table size from thos
> right - I think there can be arguments made for one behavior vs. the other.
Yeah. I sensed the same. On the other hand, we could also think like this.
Tables and partitions form a hierarchy and if the table has a valid RC (as in this case through ALTER TABLE), then the table does not have missing stats, regardless of the situations at the partition level.
The current IF basically enforces the following:
1. If table has missing stats, estimate;
2. If table has good stats but some partition has corrupt stats, estimate.
Maybe this is a good comprise such that the intention to set RC at the table level is preserved?
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Aug 2020 16:47:12 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 22:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/6639/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Sat, 18 Jul 2020 03:13:03 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 35:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/16098/35/fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
File fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java:
http://gerrit.cloudera.org:8080/#/c/16098/35/fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java@840
PS35, Line 840: _FOR_PARTITIONS
general question, how come the estimates are only updated for partitioned tables?
http://gerrit.cloudera.org:8080/#/c/16098/35/fe/src/test/java/org/apache/impala/testutil/TestUtils.java
File fe/src/test/java/org/apache/impala/testutil/TestUtils.java:
http://gerrit.cloudera.org:8080/#/c/16098/35/fe/src/test/java/org/apache/impala/testutil/TestUtils.java@153
PS35, Line 153: r
nit: should be upper case
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 35
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Aug 2020 21:06:07 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 10:
(4 comments)
http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:
http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164
PS10, Line 1164:
> Yes by a single partitioned table I mean a non-partitioned table. We use th
I need to spend some time understanding this code some more, I'll get back to you on this.
http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179: // If all partitions have good stats, return the total row count, contributed
: // by all of them, as the row count for the table.
> Yes, I agree that treating the missing and corrupt stats the same is a good
I agree that could be an issue, but I think that is currently by design. As per the code comments,
// Ignore partitions with missing stats in the hope they don't matter
// enough to change the planning outcome.
I think it might make sense to change that, but as Tim said I think we should keep the current patch simple and focused on the problem described in the JIRA. We can make this change in a follow up patch.
I also think that if we were to make this change, we should do something slightly different. The approach you have currently implemented will fallback to the data-size based estimate whenever a single partition has corrupt stats. The problem is if you have a table with 100 partitions. 99 of which of correct stats, and 1 of which has corrupt stats. the patch will then fallback to the data-size based estimate and ignore all stats from the 99 partitions with correct stats. a better approach would be to still continue to use the stats from the 99 partitions, and just do the data-size based estimate for the single partition with corrupt stats. I think this would be considered a follow up to IMPALA-7608, which adds the logic for handling unpartitioned tables or partitioned tables where all partitions have bad stats, but does not handle the case where a partitioned tables has some good stats, and some bad stats. however, again lets tackle this is a different patch. feel free to file a follow up JIRA.
http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:
http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199
PS10, Line 199: CREATE TABLE {0}.{1} (
> Thus, if the objective is to test against a Hive table, the current version does the job.
Is that the objective? I thought the objective is to test against corrupt stats created by Hive's load data local inpath query? I guess I don't understand what is different whether you use a table created via Hive vs. Impala.
http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215
PS10, Line 215: set hive.stats.autogather=true;
> Maybe we add a comment here? The rational is to call out the condition to r
so if hive.stats.autogather is false, does the behavior here change?
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 10
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Jul 2020 18:59:02 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 32:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/6865/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 32
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Aug 2020 04:54:17 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 10:
(4 comments)
http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:
http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164
PS10, Line 1164:
> Without removing it, we can not use the containing logic to detect corrupte
I thought that getNumClusteringCols.size() == number of partition columns for a table? do you mean non-partitioned tables?
http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179: // If all partitions have good stats, return the total row count, contributed
: // by all of them, as the row count for the table.
Responding to Tim's comments:
> My original suggestion was to only fall back when the total row count was 0.
So to summarize, the goal here (or at least original intention of this JIRA) is that when you have a partitioned table, and the total row count of the table is calculated to be 0, but some of the partitions are "corrupt", then treat the stats as "missing", in which case we fall back to the estimation based on the total table size + average row size?
Where "corrupt" stats are defined as stats with a value less than -1 or where the numRows == 0 but the partition size != 0.
Where "missing" stats are defined as stats with the value = -1.
> I think we could also just treat corrupt partition stats the same as missing partition stats - i.e. ignore them as long as we have some valid stats.
Yeah that seems reasonable. For missing stats, I think that is the current behavior. The code says "Ignore partitions with missing stats in the hope they don't matter enough to change the planning outcome."
My vote is to follow the "treat corrupt partition stats the same as missing partition stats" as that is more generic and should simpler and safer to implement.
http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:
http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199
PS10, Line 199: CREATE TABLE {0}.{1} (
> Yes, I would agree. The disadvantage though is that we may miss the code pa
Isn't it the load data local inpath query that is setting the stats to a corrupt value? not the create table statement itself?
http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215
PS10, Line 215: set hive.stats.autogather=true;
> Yes. This is pre-condition for Hive to create corrupted stats.
It's set to true by default in Hive already: https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2536 so I don't think you need to set it to true again here.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 10
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jun 2020 00:33:28 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
This work addresses the current limitation in computing the total row
count for a Hive table. The row count can be incorrectly computed as 0,
even though there exists data in partitions of the Hive table.
In the fix, as long as no partition in a Hive table exhibits any stats
corruptions including the kind described above, the total row count for
the table is computed from the row counts in all partitions. Otherwise,
Impala estimates the total row count from the total size of the partitions
and the row width, if feasible.
Testing:
1. Ran unit tests with queries documented in the case against Hive tables
with the following configrations:
a. No stats corruption in any partitions
b. Stats corruption in some partitions
c. Stats corruption in all partitions
2. Added a new test in test_compute_stats.py:
test_corrupted_stats_impala_9744
3. Ran "core" test
Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A testdata/data/alltypes_parquet_year2009_month01.parquet
M tests/metadata/test_compute_stats.py
3 files changed, 72 insertions(+), 7 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/5
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 5
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 20:
(3 comments)
http://gerrit.cloudera.org:8080/#/c/16098/20/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:
http://gerrit.cloudera.org:8080/#/c/16098/20/tests/metadata/test_compute_stats.py@176
PS20, Line 176: n
flake8: E501 line too long (95 > 90 characters)
http://gerrit.cloudera.org:8080/#/c/16098/20/tests/metadata/test_compute_stats.py@250
PS20, Line 250: l
flake8: E501 line too long (98 > 90 characters)
http://gerrit.cloudera.org:8080/#/c/16098/20/tests/metadata/test_compute_stats.py@294
PS20, Line 294: l
flake8: E501 line too long (98 > 90 characters)
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 20
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 17 Jul 2020 22:33:55 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 24:
Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6185/ DRY_RUN=false
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 27 Jul 2020 13:43:26 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 37:
Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6277/ DRY_RUN=false
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 37
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 12 Aug 2020 15:23:33 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#27). ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
This work addresses the current limitation in computing the total row
count for a Hive table in a scan. The row count can be incorrectly
computed as 0, even though there exists data in the Hive table. This
is the stats corruption at table level. Similar stats corruption
exists for a partition. The row count of a table or a partition
sometime can also be -1 which indicates a missing stats situation.
In the fix, as long as no partition in a Hive table exhibits any
missing or corrupt stats, the total row count for the table is computed
from the row counts in all partitions. Otherwise, Impala looks at
the table level stats particularly the table row count.
In addition, if the table stats is missing or corrupted, Impala
estimates a row count for the table, if feasible. This row count is
the sum of the row count from the partitions with good stats, and
an estimation of the number of rows in the partitions with missing or
corrupt stats. Such estimation also applies when some partition
has missing or corrupt stats.
One way to observe the fix is through the explain of queries scanning
Hive tables with missing or corrupted stats. The cardinality for any
full scan should be a positive value (i.e. the estimated row count),
instead of 'unavailable'. At the beginning of the explain output,
that table is still listed in the WARNING section for potentially
corrupt table statistics.
Testing:
1. Ran unit tests with queries documented in the case against Hive
tables with the following configrations:
a. No stats corruption in any partitions
b. Stats corruption in some partitions
c. Stats corruption in all partitions
2. Added two new tests in test_compute_stats.py:
a. test_corrupted_stats_in_partitioned_Hive_tables
b. test_corrupted_stats_in_unpartitioned_Hive_tables
3. Fixed failures in corrupt-stats.test
4. Ran "core" test
Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-planner/queries/PlannerTest/bloom-filter-assignment.test
M testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-enabled.test
M testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering-disabled.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/union.test
M testdata/workloads/functional-query/queries/QueryTest/corrupt-stats.test
M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test
M tests/metadata/test_compute_stats.py
M tests/metadata/test_explain.py
13 files changed, 236 insertions(+), 82 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/27
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 14:
> Uploaded patch set 14.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Jul 2020 00:01:49 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: WIP IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 1:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/6379/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 1
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Jun 2020 16:38:18 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 39:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/6900/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 39
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 12 Aug 2020 22:19:40 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 25:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/6728/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Jul 2020 16:17:44 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 5:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/16098/5/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:
http://gerrit.cloudera.org:8080/#/c/16098/5/tests/metadata/test_compute_stats.py@232
PS5, Line 232: +
flake8: E226 missing whitespace around arithmetic operator
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 5
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jun 2020 20:22:35 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 11:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:
http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179: // If all partitions have good stats, return the total row count, contributed
: // by all of them, as the row count for the table.
> I agree that could be an issue, but I think that is currently by design. As
Yeah, it is good to reuse the good RCs :-). I have also realize the possibility but did not materialize it.
IMHO, this new logic is very easy to add on top of the fix. So I feel maybe we should just tackle the improvement in the fix to minimize the turn-around time.
http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:
http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215
PS10, Line 215: # Make the table visible in Impala.
> so if hive.stats.autogather is false, does the behavior here change?
In my experiment, if the control is set to false, we will not see the problem.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Jul 2020 20:42:44 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 40: Code-Review+2
Carrying +2.
--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 40
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 12 Aug 2020 22:02:09 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
......................................................................
Patch Set 6:
(5 comments)
adding Tim to the review as well
http://gerrit.cloudera.org:8080/#/c/16098/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:
http://gerrit.cloudera.org:8080/#/c/16098/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1218
PS6, Line 1218: if (numRows < -1 || (numRows == 0 && tbl_.getTotalHdfsBytes() > 0)) {
: hasCorruptTableStats_ = true;
: }
: return numRows;
what about for unpartitioned tables? it looks like if the table stats are corrupted, and numRows == 0, we still return numRows = 0 from this method.
would be good to add a test case for unpartitioned tables as well.
http://gerrit.cloudera.org:8080/#/c/16098/6/testdata/data/alltypes_parquet_year2009_month01.parquet
File testdata/data/alltypes_parquet_year2009_month01.parquet:
http://gerrit.cloudera.org:8080/#/c/16098/6/testdata/data/alltypes_parquet_year2009_month01.parquet@1
PS6, Line 1: PAR1°ºLì Ø ô×a b c d e f g h i j k l m n o p q r s N O P Q R S T U V W X Y Z [ \ ] ^ _ ` t u v w x y z { | } ~
! "