You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "bharath v (JIRA)" <ji...@apache.org> on 2017/07/05 22:29:00 UTC

[jira] [Created] (IMPALA-5615) Compute Incremental stats is broken for general partition expressions

bharath v created IMPALA-5615:
---------------------------------

             Summary: Compute Incremental stats is broken for general partition expressions
                 Key: IMPALA-5615
                 URL: https://issues.apache.org/jira/browse/IMPALA-5615
             Project: IMPALA
          Issue Type: Project
    Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0
            Reporter: bharath v
            Assignee: bharath v
            Priority: Critical


It turns out that the logic is ComputeStatsStmt#analyze() doesn't work well with general partition expressions.  A simple repro for it is as follows,

{noformat}

1) Prepare test data:

create table pp(c int) partitioned by (p1 int, p2 int);
insert into pp partition (p1=10, p2) select 1, 1;
insert into pp partition (p1=10, p2) select 2,2;

2) Generate correct stats:
compute stats pp;
show table stats pp;

Query: show table stats pp
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
| p1    | p2 | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                            |
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
| 10    | 1  | 1     | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | true              | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=1 |
| 10    | 2  | 1     | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | true              | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=2 |
| Total |    | 0     | 2      | 4B   | 0B           |                   |        |                   |                                                     |
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
Fetched 3 row(s) in 0.02s

3) Reproduce the issue:
compute incremental stats pp partition (p1=10);
show table stats pp;

Query: show table stats pp
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
| p1    | p2 | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                            |
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
| 10    | 1  | 0     | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | true              | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=1 |
| 10    | 2  | 0     | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | true              | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=2 |
| Total |    | 0     | 2      | 4B   | 0B           |                   |        |                   |                                                     |
+-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
Fetched 3 row(s) in 0.01s
{noformat}

The bug is in the child queries generated by the incremental stats query.

{noformat}
SELECT NDV_NO_FINALIZE(c) AS c, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE), COUNT(c), p1, p2 FROM pp WHERE ((p1=10 AND p2=1) AND (p1=10 AND p2=2)) GROUP BY p1, p2	

SELECT COUNT(*), p1, p2 FROM pp WHERE ((p1=10 AND p2=1) AND (p1=10 AND p2=2)) GROUP BY p1, p2
{noformat}

Specifically, the problem is in the filter predicate generated. {{((p1=10 AND p2=1) AND (p1=10 AND p2=2))}}. It turns out that the ComputeStats#analyze() is broken due to IMPALA-1654 and we need to rewrite the logic to support general partition expressions based on {{PartitionSet}}.

Workaround: Don't use general partition expressions and instead use a full partition spec, i.e., run the compute incremental stats for one partition at a time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)