You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Qifan Chen (Jira)" <ji...@apache.org> on 2021/03/23 19:20:00 UTC
[jira] [Updated] (IMPALA-10602) Intersection of multiple min/max
filters when applying to common equi-join columns
[ https://issues.apache.org/jira/browse/IMPALA-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qifan Chen updated IMPALA-10602:
--------------------------------
Description:
Currently, Impala generates two min/max filters from the two joins, for a test query as follows.
{quote}select straight_join count(*)
from store_sales ss, date_dim d1, date_dim d2
where
ss.ss_sold_time_sk = d1.d_date_sk and
ss.ss_sold_time_sk = d2.d_date_sk;{quote}
{quote}| 00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM] |
| HDFS partitions=1824/1824 files=1824 size=200.94MB |
| runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] -> ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> ss.ss_sold_time_sk |
| stored statistics: |
| table: rows=2.88M size=200.94MB |
| partitions: 1824/1824 rows=2.88M |
| columns: all |
| extrapolated-rows=disabled max-scan-range-rows=130.09K |
| file formats: [PARQUET] |
| mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1 |
| tuple-ids=0 row-size=4B cardinality=2.88M{quote}
Since the two filters are applied to the same equi-join column ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 into a single filter and apply the result one instead. Assume the range of RF001 is [10, 50] and that for RF003 is [20, 60] then the combined filter will be [20, 50].
was:
Currently, Impala actually generates two min/max filters from the two joins, for a test query as follows.
{quote}select straight_join count(*)
from store_sales ss, date_dim d1, date_dim d2
where
ss.ss_sold_time_sk = d1.d_date_sk and
ss.ss_sold_time_sk = d2.d_date_sk;{quote}
{quote}| 00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM] |
| HDFS partitions=1824/1824 files=1824 size=200.94MB |
| runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] -> ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> ss.ss_sold_time_sk |
| stored statistics: |
| table: rows=2.88M size=200.94MB |
| partitions: 1824/1824 rows=2.88M |
| columns: all |
| extrapolated-rows=disabled max-scan-range-rows=130.09K |
| file formats: [PARQUET] |
| mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1 |
| tuple-ids=0 row-size=4B cardinality=2.88M{quote}
Since the two filters are applied to the same equi-join column ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 into a single filter and apply the result one instead. Assume the range of RF001 is [10, 50] and that for RF003 is [20, 60] then the combined filter will be [20, 50].
Summary: Intersection of multiple min/max filters when applying to common equi-join columns (was: Intersection of multiple min/max filters when applying onto common equi-join column)
> Intersection of multiple min/max filters when applying to common equi-join columns
> ----------------------------------------------------------------------------------
>
> Key: IMPALA-10602
> URL: https://issues.apache.org/jira/browse/IMPALA-10602
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Qifan Chen
> Priority: Major
>
> Currently, Impala generates two min/max filters from the two joins, for a test query as follows.
> {quote}select straight_join count(*)
> from store_sales ss, date_dim d1, date_dim d2
> where
> ss.ss_sold_time_sk = d1.d_date_sk and
> ss.ss_sold_time_sk = d2.d_date_sk;{quote}
> {quote}| 00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM] |
> | HDFS partitions=1824/1824 files=1824 size=200.94MB |
> | runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] -> ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> ss.ss_sold_time_sk |
> | stored statistics: |
> | table: rows=2.88M size=200.94MB |
> | partitions: 1824/1824 rows=2.88M |
> | columns: all |
> | extrapolated-rows=disabled max-scan-range-rows=130.09K |
> | file formats: [PARQUET] |
> | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1 |
> | tuple-ids=0 row-size=4B cardinality=2.88M{quote}
> Since the two filters are applied to the same equi-join column ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 into a single filter and apply the result one instead. Assume the range of RF001 is [10, 50] and that for RF003 is [20, 60] then the combined filter will be [20, 50].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org