You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2021/09/28 07:03:00 UTC

[jira] [Created] (IMPALA-10932) Make sure all kinds of simple predicates on bool columns are pushed down

Quanlong Huang created IMPALA-10932:
---------------------------------------

             Summary: Make sure all kinds of simple predicates on bool columns are pushed down
                 Key: IMPALA-10932
                 URL: https://issues.apache.org/jira/browse/IMPALA-10932
             Project: IMPALA
          Issue Type: Improvement
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang


When scanning parquet/orc tables, we push down binary predicates like "x < 1" to leverage the file level statistics. However, predicates on bool column may not in this form. They could be "{{x}}", "{{NOT x}}", "{{x ISĀ [NOT] TRUE}}", "{{x IS [NOT] FALSE}}".

Note that dictionary predicates may have some of them, but still not all of them. For instance, here we have the predicate in dictionary predicates:
{code:sql}
set explain_level=2;
explain select count(*) from functional_parquet.alltypessmall where bool_col;
| 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]        |
|    HDFS partitions=4/4 files=4 size=14.76KB                    |
|    predicates: bool_col                                        |
|    stored statistics:                                          |
|      table: rows=unavailable size=unavailable                  |
|      partitions: 0/4 rows=939                                  |
|      columns: unavailable                                      |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable  |
|    parquet dictionary predicates: bool_col                     |
{code}
Here we still have the predicate in dictionary predicates:
{code:sql}
explain select count(*) from functional_parquet.alltypessmall where bool_col is true;
| 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]        |
|    HDFS partitions=4/4 files=4 size=14.76KB                    |
|    predicates: istrue(bool_col)                                |
|    stored statistics:                                          |
|      table: rows=unavailable size=unavailable                  |
|      partitions: 0/4 rows=939                                  |
|      columns: unavailable                                      |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable  |
|    parquet dictionary predicates: istrue(bool_col)             |
{code}
But here we don't have any predicates pushed down to stats or dictionary:
{code:sql}
explain select count(*) from functional_parquet.alltypessmall where bool_col is not true;
| 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]             |
|    HDFS partitions=4/4 files=4 size=14.76KB                         |
|    predicates: isnottrue(bool_col)                                  |
|    stored statistics:                                               |
|      table: rows=unavailable size=unavailable                       |
|      partitions: 0/4 rows=939                                       |
|      columns: unavailable                                           |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable       |
|    mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1 |
|    tuple-ids=0 row-size=1B cardinality=94                           |
|    in pipelines: 00(GETNEXT)                                        |
{code}
If we use a weird form "x < TRUE", we can see them both:
{code:sql}
explain select count(*) from functional_parquet.alltypessmall where bool_col < true;
| 00:SCAN HDFS [functional_parquet.alltypessmall]                     |
|    HDFS partitions=4/4 files=4 size=14.76KB                         |
|    predicates: bool_col < TRUE                                      |
|    stored statistics:                                               |
|      table: rows=unavailable size=unavailable                       |
|      partitions: 0/4 rows=939                                       |
|      columns: unavailable                                           |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable       |
|    parquet statistics predicates: bool_col < TRUE                   |
|    parquet dictionary predicates: bool_col < TRUE                   |
|    mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1 |
{code}
Usually, we don't use this form for bool columns. So we should deal with the above mentioned forms as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org