You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "bharath v (JIRA)" <ji...@apache.org> on 2017/07/05 22:29:00 UTC
[jira] [Created] (IMPALA-5615) Compute Incremental stats is broken
for general partition expressions
bharath v created IMPALA-5615:
---------------------------------
Summary: Compute Incremental stats is broken for general partition expressions
Key: IMPALA-5615
URL: https://issues.apache.org/jira/browse/IMPALA-5615
Project: IMPALA
Issue Type: Project
Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0
Reporter: bharath v
Assignee: bharath v
Priority: Critical
It turns out that the logic is ComputeStatsStmt#analyze() doesn't work well with general partition expressions. A simple repro for it is as follows,
{noformat}
1) Prepare test data:
create table pp(c int) partitioned by (p1 int, p2 int);
insert into pp partition (p1=10, p2) select 1, 1;
insert into pp partition (p1=10, p2) select 2,2;
2) Generate correct stats:
compute stats pp;
show table stats pp;
Query: show table stats pp
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
| p1 | p2 | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
| 10 | 1 | 1 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT | true | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=1 |
| 10 | 2 | 1 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT | true | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=2 |
| Total | | 0 | 2 | 4B | 0B | | | | |
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
Fetched 3 row(s) in 0.02s
3) Reproduce the issue:
compute incremental stats pp partition (p1=10);
show table stats pp;
Query: show table stats pp
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
| p1 | p2 | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
| 10 | 1 | 0 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT | true | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=1 |
| 10 | 2 | 0 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT | true | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=2 |
| Total | | 0 | 2 | 4B | 0B | | | | |
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
Fetched 3 row(s) in 0.01s
{noformat}
The bug is in the child queries generated by the incremental stats query.
{noformat}
SELECT NDV_NO_FINALIZE(c) AS c, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE), COUNT(c), p1, p2 FROM pp WHERE ((p1=10 AND p2=1) AND (p1=10 AND p2=2)) GROUP BY p1, p2
SELECT COUNT(*), p1, p2 FROM pp WHERE ((p1=10 AND p2=1) AND (p1=10 AND p2=2)) GROUP BY p1, p2
{noformat}
Specifically, the problem is in the filter predicate generated. {{((p1=10 AND p2=1) AND (p1=10 AND p2=2))}}. It turns out that the ComputeStats#analyze() is broken due to IMPALA-1654 and we need to rewrite the logic to support general partition expressions based on {{PartitionSet}}.
Workaround: Don't use general partition expressions and instead use a full partition spec, i.e., run the compute incremental stats for one partition at a time.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)