You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Qifan Chen (Code Review)" <ge...@cloudera.org> on 2021/02/18 14:14:21 UTC

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Qifan Chen has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17075


Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

[WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column during column statistics computation
and to store the stats in HMS. These two new stats are used by the
back end to discard min/max filters whose coverage are too close to
the actual range.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
10 files changed, 391 insertions(+), 20 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 2
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 23: Code-Review+1

(4 comments)

This mostly looks good to me, so doing a +1. One pending question is about the DATE type..can you clarify?

http://gerrit.cloudera.org:8080/#/c/17075/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/21//COMMIT_MSG@12
PS21, Line 12: both hash join builders and Parque
> Currently at the scanner level, the overlap is checked against the column i
Ok, so the worse that would happen if the stats are stale is that instead of marking it as alwaysTrue, we would keep the min-max filter and since the min-max filter is created based on the actual data values, it should be ok.
Agree about the TPC-DS use case for the scans since most of the columns have either randomly or uniformly distributed values across all row groups.


http://gerrit.cloudera.org:8080/#/c/17075/23/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/23/be/src/exec/filter-context.cc@477
PS23, Line 477:     case PrimitiveType::TYPE_DATE:
I thought you added DATE in patch set 23 ?


http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@10
PS19, Line 10: CREATE TABLE unique_database.lineitem_orderkey_only(l_orderkey bigint)
> Sounds like a good idea. Due to the complexity of bin/generate-schema-state
I am ok with deferring this. Pls create a JIRA ticket.


http://gerrit.cloudera.org:8080/#/c/17075/21/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/21/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@221
PS21, Line 221: ---- QUERY
> Another approach would be to join D1 and D2 first which will produce one filter.
The join order decisions are made during the logical planning phase based on costing, so we would not change that for the min-max filters.  But yeah, we should create an enhancement JIRA for creating the intersection of the 2 or more min-max intervals. It is also easier to do compared to the bloom filter aggregation that is supported today that happens at the coordinator node.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 23
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Mar 2021 02:30:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 30: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 30
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Apr 2021 15:58:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 22:

Aman asked whether min/max filters are applied in the context of out-joins. The answer is no.  Please refer to TPCDS q49.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Mar 2021 20:27:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8279/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 12
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Mar 2021 03:58:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

[WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer, double, timestamp
and date during column statistics computation and to store the stats
in HMS. These two new stats are used by the back end to discard
min/max filters whose coverage are too close to the actual range.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/Column.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
11 files changed, 430 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 3
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 15:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/8310/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 15
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Mar 2021 21:12:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#22). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for a column of type integers, float or double for parquet tables,
and to make use of the new stats to discard min/max filters, in both hash
join builders and Parquet scanners, whose coverage are too close to the
actual range defined by the column min and max.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Enable the feature for Iceberg tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,145 insertions(+), 109 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/22
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 24:

(9 comments)

lgtm, mostly found nits.

http://gerrit.cloudera.org:8080/#/c/17075/24//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/24//COMMIT_MSG@74
PS24, Line 74:  1. Enable the feature for Iceberg tables with Parquet data files.
Do you plan to resolve this TODO in this patch, or in a follow-up Jira? If the latter, please create a new Jira ticket and refer to the ticket id here.


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/filter-context.cc@432
PS24, Line 432: // in the filter is too close to the column min/max.
nit: please extend comment with in what cases we return true/false.


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/filter-context.cc@433
PS24, Line 433: OverlapWithColumnLowAndHighValue
nit: this name doesn't really tell what this function does. Maybe 'ShouldRejectFilterBasedOnColumnStats()'?


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@122
PS24, Line 122: NumColumnStatsRejectedRowGroups
nit: I think it's easy to misinterpret the name of this counter. It sounds like it counts the number of row groups rejected by column stats. Maybe a better name would be "NumUnusefulFiltersForRowGroups"?


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/service/hs2-util.cc@631
PS24, Line 631: l
nit: typo


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/service/hs2-util.cc@632
PS24, Line 632: must
nit: must be?


http://gerrit.cloudera.org:8080/#/c/17075/24/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
File testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test:

http://gerrit.cloudera.org:8080/#/c/17075/24/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@4
PS24, Line 4: set compute_column_minmax_stats = true;
            : compute stats functional_parquet.alltypes;
I'm not sure if it's a good idea to modify these tables in tests, even if it's just column stats. Other tests might depend on the stats being set/unset. Probably we should just copy these tables into a private unique database. Also, we could use alltypestiny instead of alltypes.


http://gerrit.cloudera.org:8080/#/c/17075/24/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/24/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@8
PS24, Line 8: create database if not exists unique_database;
You can change TestOverlapMinMaxFilters:: test_all_runtime_filters to use a unique database by just adding a parameter to your test method then pass it to run_test_case(). Like you already do it in this patch set for test_hbase_compute_stats (though the method name shouldn't have hbase in it).


http://gerrit.cloudera.org:8080/#/c/17075/24/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/17075/24/tests/metadata/test_compute_stats.py@446
PS24, Line 446: _hbase_
this test is not related to HBase.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Mar 2021 15:04:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17075/13/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/17075/13/tests/metadata/test_compute_stats.py@430
PS13, Line 430: class TestParquetComputeColumnMinMax(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Mar 2021 00:43:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 17:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8318/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 17
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Mar 2021 16:55:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer, double, timestamp
or date for parquet tables, and to make use of the new stats to
discard min/max filters whose coverage are too close to the actual
range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed when each
partition is processed in coordinator.

Two new columns 'Min' and 'Max' are added in the output of the
show column stats command as shown below.

show column stats tpcds_parquet.store_sales
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
 1. Test compute stats for float, timestamp and date columns;
 2. Test filters being disabled at the scan node;
 3. Add logic to disable min/max filters inside HJ builder via
    the column stats;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/scan-node.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M common/thrift/CatalogObjects.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
21 files changed, 816 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/9
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 9
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 16:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/8311/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 16
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Mar 2021 21:20:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/17075/2/be/src/exec/incr-stats-util.cc
File be/src/exec/incr-stats-util.cc:

http://gerrit.cloudera.org:8080/#/c/17075/2/be/src/exec/incr-stats-util.cc@123
PS2, Line 123:       low_value.bool_val &= value.bool_val; 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17075/2/be/src/exec/incr-stats-util.cc@150
PS2, Line 150:       high_value.bool_val ^= value.bool_val; 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17075/2/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/17075/2/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@293
PS2, Line 293:           Preconditions.checkState(false, "Unsupported type encountered in setLowValue()");
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17075/2/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@318
PS2, Line 318:           Preconditions.checkState(false, "Unsupported type encountered in setLowValue()");
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17075/2/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@350
PS2, Line 350:     LOG.error("ColumnStats::update() with Hive ColumnStatisticsData!" + ", coltype=" + colType.toSql());
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17075/2/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@354
PS2, Line 354:     /// called. 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17075/2/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@355
PS2, Line 355:     /// 
line has trailing whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 2
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Feb 2021 14:15:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 27: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Mar 2021 12:52:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 29: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Apr 2021 15:56:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#26). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
M tests/query_test/test_runtime_filters.py
36 files changed, 1,415 insertions(+), 129 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/26
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 26
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8175/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 7
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Sat, 20 Feb 2021 00:37:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8296/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Mar 2021 01:10:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 25:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17075/25/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/17075/25/tests/query_test/test_runtime_filters.py@287
PS25, Line 287: u
flake8: F821 undefined name 'unique_database'



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Mar 2021 19:26:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 30:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7044/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 30
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Apr 2021 15:58:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

[WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer, double, timestamp
and date during column statistics computation and to store the stats
in HMS. These two new stats are used by the back end to discard
min/max filters whose coverage are too close to the actual range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed when each
partition is processed in coordinator.

TODO:
  Compute the min/max for partition columns of type timestamp and
  date.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
13 files changed, 550 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/6
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 6
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 25:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8425/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Mar 2021 19:46:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#16). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.

set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter is no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,084 insertions(+), 108 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/16
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 16
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 28:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8460/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Mar 2021 18:42:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

[WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer, double, timestamp
and date during column statistics computation and to store the stats
in HMS. These two new stats are used by the back end to discard
min/max filters whose coverage are too close to the actual range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed when each
partition is processed in coordinator.

Testing:
 - Add TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
  Compute the min/max for partition columns of type timestamp and
  date.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M common/thrift/CatalogObjects.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
16 files changed, 624 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/7
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 7
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 27:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
> I guess we have quite a bunch of .test files that are executed against all 
Ok, yeah it makes sense that the same .test file is used for different formats..so will involve updating several of them (just for reference, I checked under testdata:  
   testdata/workloads$ grep -r -i "show column stats" | cut -d ":" -f 1 | uniq | wc -l
15
It shows 15 files). 

On extending the min/max to other formats, we could probably consider in the future .. any such stats should ideally be leveraged by either the optimizer (for selectivity estimate) or runtime rather than be only for informational purpose.

My other thought on this is that showing min/max stats should ideally be controlled through a syntax improvement.. e.g SHOW COLUMN STATS WITH MINMAX <table name>  rather than a query option. 
But in the interest of moving this patch forward, I am ok with deferring such work.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Mar 2021 17:37:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 28: Code-Review+1

(1 comment)

LGTM, but had a comment about testing Hudi tables

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
File testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test:

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@98
PS28, Line 98: # Create a new hudi parquet table.
Please note that it creates a plain Parquet table from the data in the Hudi table. Unfortunately Impala cannot write Hudi tables.

I think we have the following options here:
* Add a hudi table to the repo only for this test. But it needs some work to figure out how to do that
* Use functional_parquet.hudi_non_partitioned and hope it doesn't break other tests. But we cannot guarantee it won't break future tests.
* Create a Jira ticket to add Hudi testing later



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 01 Apr 2021 11:57:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files;
 4. IMPALA-10617: Compute min/max column stats beyond parquet tables.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Reviewed-on: http://gerrit.cloudera.org:8080/17075
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
M tests/query_test/test_runtime_filters.py
37 files changed, 1,434 insertions(+), 141 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 31
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8270/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Mar 2021 21:33:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17075/3/fe/src/main/java/org/apache/impala/catalog/Column.java
File fe/src/main/java/org/apache/impala/catalog/Column.java:

http://gerrit.cloudera.org:8080/#/c/17075/3/fe/src/main/java/org/apache/impala/catalog/Column.java@73
PS3, Line 73:    LOG.error("Column::updateStats(ColumnStatisticsData): " + name_ + " #distinct=" + stats_.getNumDistinctValues());
line too long (116 > 90)


http://gerrit.cloudera.org:8080/#/c/17075/3/fe/src/main/java/org/apache/impala/catalog/Column.java@82
PS3, Line 82:    LOG.error("Column::updateStats(TColumnStats): " + name_ + " #distinct=" + stats_.getNumDistinctValues());
line too long (108 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 3
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Feb 2021 21:59:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 27:

(3 comments)

Was getting ready to +2 this but just had couple of additional questions and a suggestion.

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
> On paper, yes we should be able to show the min/max. On the other hand, the
Did we arrive at a consensus on this ?  Adding new query options means we have to document them.  It seems option 2 could be deprecated in the future as soon as tests are updated so I would vote for not documenting it for now. I am also not clear why many tests have to be updated.. if non-parquet formats don't have this stats stored in HMS, why would it affect the display ? We could just skip the last 2 columns.


http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260
PS27, Line 260:       for (FeFsPartition partition : allPartitions) {
it's unfortunate that we have to iterate through all partitions (could be tens of thousands) to check this..and I am not sure if at this point the how many rpcs the loadAllPartitions() will make to the catalog server. In a vast majority of cases all partitions will be of the same format.  My question is ..suppose you find at least 1 parquet format and return true for computeMinMax ..doesn't it just mean that whatever min/max stats is computed will be off (similar to stale stats that we discussed earlier) .. so in theory it will reduce the effectiveness of the filter but not affect correctness.
I am thinking of use cases with lots of partitions where users have COMPUTE STATS in their ETL scripts and suppose they enable compute_min_max_stats, this will slow down the compute stats noticeably.


http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@279
PS27, Line 279:   public boolean isParquetBased() {
nit: There's an identical method in HdfsScanNode.java which is currently private (see below) but it would be good to either have that method call this one or have a public static isParquetBased(HdfsFileFormat) defined somewhere.

https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L387

Also, I suppose HUDI_PARQUET min/max stats should work with this patch ?  I don't think a test has been added for that. The fact that the other method in HdfsScanNode treats both somewhat equivalent leads me to think it should work.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Mar 2021 06:14:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.

set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest;
 - Add new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Tested compute/show stats for integers, float and double column data
   types;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Add logic to disable min/max filters inside HJ builder via
    the column stats.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
32 files changed, 1,007 insertions(+), 81 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/14
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 24:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8419/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Mar 2021 21:55:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 19:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8335/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 19
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Mar 2021 01:20:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer, double, timestamp
or date for parquet tables, and to make use of the new stats to
discard min/max filters whose coverage are too close to the actual
range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Two new columns 'Min' and 'Max' are added in the output of the
show column stats command as shown below.

show column stats tpcds_parquet.store_sales
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
 1. Test compute stats for float, timestamp and date columns;
 2. Test filters being disabled at the scan node;
 3. Add logic to disable min/max filters inside HJ builder via
    the column stats;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/scan-node.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
22 files changed, 826 insertions(+), 115 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/10
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 10
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#18). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range in both hash join builders and Parquet scanners.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner.;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,107 insertions(+), 108 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/18
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 18
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/8158/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 2
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Feb 2021 14:25:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#17). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range in both hash join builders and Parquet scanners.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,102 insertions(+), 108 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/17
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 17
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 29: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
File testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test:

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@98
PS28, Line 98: 
> Good catch. Did a unit test with the hudi table. All seems good. I think it
Thanks for trying it out. I'm OK with the current state, HUDI_PARQUET is just PARQUET for Impala.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Apr 2021 08:06:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 19:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
> Did we arrive at a consensus on this ?  Adding new query options means we h
I guess we have quite a bunch of .test files that are executed against all kinds of file formats, and the tests require to have the same result data in each case.

The other option is to calculate min/max for every file format. Since we only do it for numeric values I don't think it would add too much overhead. We couldn't use them during query execution (only for Parquet and later ORC?), but might be still useful for a user if they want to see some basic stats of a table.


http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@279
PS27, Line 279:   public boolean isParquetBased() {
> nit: There's an identical method in HdfsScanNode.java which is currently pr
Yeah, HUDI_PARQUET means Hudi table format + Parquet data files, therefore the min/max filters will work fine on them.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 19
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Mar 2021 16:26:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8295/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Mar 2021 01:02:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 21:

Thinking about min-max filters in the context of outer joins, I would expect they are _not_ pushed to the non-null producing side of an outer join (we have similar restriction for the runtime bloom filters). Is that handled somewhere (maybe in another patch) ?


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Mar 2021 18:40:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 23:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8411/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 23
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Mar 2021 23:04:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 4:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/8166/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 4
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Feb 2021 23:15:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#21). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for a column of type integers, float or double for parquet tables,
and to make use of the new stats to discard min/max filters, in both hash
join builders and Parquet scanners, whose coverage are too close to the
actual range defined by the column min and max.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Enable the feature for Iceberg tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,155 insertions(+), 107 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/21
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

[WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer, double, timestamp
and date during column statistics computation and to store the stats
in HMS. These two new stats are used by the back end to discard
min/max filters whose coverage are too close to the actual range.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
10 files changed, 453 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 4
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 21:

(5 comments)

I haven't analyzed how exactly the overlap of min-max filters works..so will go through the design doc and maybe the original commit.  In the meantime, sending some initial comments.

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/catalog-op-executor.cc
File be/src/exec/catalog-op-executor.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/catalog-op-executor.cc@316
PS19, Line 316: or non-string
              :   // columns.
Update this since min/max stats are not getting computed for Decimal types.  Also, are we sure that allowing float/double min-max filters is ok ? These are not precise data types.


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc@474
PS19, Line 474:     case PrimitiveType::TYPE_TIMESTAMP:
The commit message says the min-max column stats is used for integer, float and double so I was not expecting to see timestamp and date types also considered here.  Shouldn't they just to the default case and return false ?


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/incr-stats-util.cc
File be/src/exec/incr-stats-util.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/incr-stats-util.cc@260
PS19, Line 260: apache::hive::service::cli::thrift::TColumnValue
It would be good to create a typedef for the hive version of TColumnValue.  Also, the package name looks different from what I see in the source code here: https://github.com/apache/hive/blob/master/service-rpc/src/gen/thrift/gen-javabean/org/apache/hive/service/rpc/thrift/TColumnValue.java#L7


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc@957
PS19, Line 957:         VLOG(3) << "The filter is set to always true.";
nit: Would be useful to print that this was a min-max filter and probably the filter id.


http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@10
PS19, Line 10: CREATE TABLE unique_database.lineitem_orderkey_only(l_orderkey bigint)
Creating large tables during a test run adds unacceptable latency to the  testing time.   Can the db and table creation steps not be moved into testdata/datasets/tpch/  as a schema template for min-max filters ?



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Mar 2021 18:14:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 13:

Added the following:
 1. New test overlap_min_max_filters.test
 2. Query option show_column_minmax_stats to facilitate shield existing column stats from format change
 3. A new counter NumColumnStatsRejectedRowGroups.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Mar 2021 00:45:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#20). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for a column of type integers, float or double for parquet tables,
and to make use of the new stats to discard min/max filters, in both hash
join builders and Parquet scanners, whose coverage are too close to the
actual range defined by the column min and max.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,137 insertions(+), 108 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/20
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 20
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 29:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8493/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 01 Apr 2021 17:46:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 27:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8432/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Mar 2021 14:34:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#11). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer and double
for parquet tables, and to make use of the new stats to
discard min/max filters whose coverage are too close to the actual
range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Two new columns 'Min' and 'Max' are added in the output of the
show column stats command as shown below.

show column stats tpcds_parquet.store_sales
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Test filters being disabled at the scan node;
 3. Add logic to disable min/max filters inside HJ builder via
    the column stats.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
25 files changed, 827 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/11
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#24). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, or date for parquet
tables, and to make use of the new stats to discard min/max filters, in
both hash join builders and Parquet scanners, whose coverage are too
close to the actual range defined by the column min and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Enable the feature for Iceberg tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,401 insertions(+), 116 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/24
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 28: Code-Review+1

(2 comments)

Just a nit and a comment. Should be able to +2 after that.

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260
PS27, Line 260:       }
> Looks like mixture of files of different format (like Parquet and ORC at the same time) is not allowed. 
This may not be accurate.  In the HdfsScanNode's computeScanRangeLocation(), we examine the file format for a partition and just record it but just set a flag if they are not all parquet format. It does not assert or return error. 
Here's the snippet in HdfsScanNode.java:
      fileFormats_.add(partition.getFileFormat());
      if (!isParquetBased(partition.getFileFormat())) {
        allParquet = false;
      }
However, for statistics, as I mentioned before (and you  seem in agreement) that having at least one parquet partition is sufficient to trigger the min-max filter checks since it does not affect correctness of results.


http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@387
PS28, Line 387:   /*
Can you remove this method.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 31 Mar 2021 21:39:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8243/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 9
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Feb 2021 22:27:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 20:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8369/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 20
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Mar 2021 15:04:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#19). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for a column of type integers, float or double for parquet tables,
and to make use of the new stats to discard min/max filters, in both hash
join builders and Parquet scanners, whose coverage are too close to the
actual range defined by the column min and max.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,112 insertions(+), 108 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/19
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 19
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#25). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tests with TPCDS 3TB which demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
M tests/query_test/test_runtime_filters.py
36 files changed, 1,414 insertions(+), 128 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/25
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#23). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, or date for parquet
tables, and to make use of the new stats to discard min/max filters, in
both hash join builders and Parquet scanners, whose coverage are too
close to the actual range defined by the column min and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for decimal columns;
 2. Enable the feature for Iceberg tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,280 insertions(+), 113 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/23
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 23
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 11:

Also add the query option compute_column_min_max_stats. Default to 'false'.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Mar 2021 21:14:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 21:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8371/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Mar 2021 17:24:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#13). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.

set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest;
 - Add new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Tested compute/show stats for integers, float and double column data
   types;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Add logic to disable min/max filters inside HJ builder via
    the column stats.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
32 files changed, 1,006 insertions(+), 81 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/13
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 21:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17075/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/21//COMMIT_MSG@12
PS21, Line 12: join builders and Parquet scanners
Is it fair to say that the main value proposition of the column  min/max statistics is in the join builder ?  The Parquet scanners already have access to the row group's min/max stats per column, so it seems to me the stats coming from HMS will not add additional value there. But for the HJ builder, yes it helps by figuring out whether after the build phase let's say you have the range [10, 50] and the min/max stats fetched from HMS are [60, 100] then we can quickly say that the runtime min/max filter will exclude all row groups.

But what happens if the stats are out-of-date ? Since these stats are getting uses not just for ACID tables but for external tables as well, the stats are not bound to the table's valid write id (applicable only for ACID tables). This can lead to incorrect overlap calculation. Any thoughts ?


http://gerrit.cloudera.org:8080/#/c/17075/21/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/21/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@221
PS21, Line 221: ---- QUERY
For the tests, it would be good to add a 3 table join case as well. My understanding is that if you have a fact table F and 2 dimension tables D1, D2 and suppose F is joined to D1, D2 as follows:  
 F. a1 = D1.a1 AND F.a1 = D2.a1  then suppose the first join's  range is  [10, 50] and the second join's range is [20, 60]  then the filter seen at scan of F  will be [20, 50].  Are such scenario already tested elsewhere ? If so, feel free to point me to those and mark this resolved.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Mar 2021 03:07:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 22:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8394/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Mar 2021 20:17:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 14:

Fix incorrectly spelled query option name in overlap_min_max_filters.test.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Mar 2021 00:51:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 18:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8322/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 18
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Mar 2021 19:09:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 30: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 30
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Apr 2021 21:50:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 28:

(3 comments)

Rework.

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260
PS27, Line 260:       }
> > Looks like mixture of files of different format (like Parquet and ORC at 
Reworked method hasAtLeastOneParquetPartition() to iterate until a Parquet partition is met.

DONE.


http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@387
PS28, Line 387:   /*
> Can you remove this method.
Done


http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
File testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test:

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@98
PS28, Line 98: # Create a new hudi parquet table.
> Please note that it creates a plain Parquet table from the data in the Hudi
Good catch. Did a unit test with the hudi table. All seems good. I think it probably is okay to leave out the hudi test for now. 

Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build e45818b616ab4ff81fa512c8257bc8fac594094a)
COMPUTE_COLUMN_MINMAX_STATS set to true
Query: compute stats functional_parquet.hudi_non_partitioned
+------------------------------------------+
| summary                                  |
+------------------------------------------+
| Updated 1 partition(s) and 15 column(s). |
+------------------------------------------+
Fetched 1 row(s) in 0.53s
[12:57:17 qchen@qifan-10229: IMPALA-10494_making_use_of_minmax_column_stats] sql dml.showstats.hudi
Starting Impala Shell with no authentication using Python 2.7.16
Warning: live_progress only applies to interactive shell sessions, and is being skipped for now.
Opened TCP connection to localhost:21000
Connected to localhost:21000
Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build e45818b616ab4ff81fa512c8257bc8fac594094a)
SHOW_COLUMN_MINMAX_STATS set to true
Query: show column stats functional_parquet.hudi_non_partitioned
+------------------------+---------------------------------------+------------------+--------+----------+-------------------+--------+---------+----------------------+--------------------+
| Column                 | Type                                  | #Distinct Values | #Nulls | Max Size | Avg Size          | #Trues | #Falses | Min                  | Max                |
+------------------------+---------------------------------------+------------------+--------+----------+-------------------+--------+---------+----------------------+--------------------+
| _hoodie_commit_time    | STRING                                | 2                | 0      | 14       | 14                | -1     | -1      | -1                   | -1                 |
| _hoodie_commit_seqno   | STRING                                | 97               | 0      | 20       | 20                | -1     | -1      | -1                   | -1                 |
| _hoodie_record_key     | STRING                                | 99               | 0      | 36       | 36                | -1     | -1      | -1                   | -1                 |
| _hoodie_partition_path | STRING                                | 3                | 0      | 25       | 25                | -1     | -1      | -1                   | -1                 |
| _hoodie_file_name      | STRING                                | 6                | 0      | 71       | 70.68000030517578 | -1     | -1      | -1                   | -1                 |
| _hoodie_is_deleted     | BOOLEAN                               | 2                | 0      | 1        | 1                 | 0      | 100     | -1                   | -1                 |
| _row_key               | STRING                                | 99               | 0      | 36       | 36                | -1     | -1      | -1                   | -1                 |
| begin_lat              | DOUBLE                                | 100              | 0      | 8        | 8                 | -1     | -1      | 0.013803214965246391 | 0.9973157077943435 |
| begin_lon              | DOUBLE                                | 99               | 0      | 8        | 8                 | -1     | -1      | 0.014143391676368022 | 0.991562254763212  |
| driver                 | STRING                                | 2                | 0      | 10       | 10                | -1     | -1      | -1                   | -1                 |
| end_lat                | DOUBLE                                | 100              | 0      | 8        | 8                 | -1     | -1      | 7.903052288528167E-4 | 0.9877514097604384 |
| end_lon                | DOUBLE                                | 99               | 0      | 8        | 8                 | -1     | -1      | 0.029829569706356973 | 0.9978872086544781 |
| fare                   | STRUCT<amount:DOUBLE,currency:STRING> | -1               | -1     | -1       | -1                | -1     | -1      | -1                   | -1                 |
| partition              | STRING                                | 3                | 0      | 25       | 25                | -1     | -1      | -1                   | -1                 |
| rider                  | STRING                                | 2                | 0      | 9        | 9                 | -1     | -1      | -1                   | -1                 |
| timestamp              | DOUBLE                                | 1                | 0      | 8        | 8                 | -1     | -1      | 0.0                  | 0.0                |
+------------------------+---------------------------------------+------------------+--------+----------+-------------------+--------+---------+----------------------+--------------------+
Fetched 16 row(s) in 0.02s


| Max Per-Host Resource Reservation: Memory=6.97MB Threads=5                                                         |
| Per-Host Resource Estimates: Memory=55MB                                                                           |
| Codegen disabled by planner                                                                                        |
| Analyzed query: SELECT /* +straight_join */ a.`_hoodie_record_key` FROM                                            |
| functional_parquet.hudi_non_partitioned a,                                                                         |
| functional_parquet.hudi_non_partitioned b WHERE a.begin_lat = b.end_lat                                            |
|                                                                                                                    |
| F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                                                              |
| Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1                                |
|   PLAN-ROOT SINK                                                                                                   |
|   |  output exprs: a.`_hoodie_record_key`                                                                          |
|   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0                           |
|   |                                                                                                                |
|   04:EXCHANGE [UNPARTITIONED]                                                                                      |
|      mem-estimate=23.18KB mem-reservation=0B thread-reservation=0                                                  |
|      tuple-ids=0,1 row-size=64B cardinality=100                                                                    |
|      in pipelines: 00(GETNEXT)                                                                                     |
|                                                                                                                    |
| F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3                                                                     |
| Per-Host Resources: mem-estimate=34.95MB mem-reservation=2.95MB thread-reservation=2 runtime-filters-memory=1.00MB |
|   DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=04, UNPARTITIONED]                                                       |
|   |  mem-estimate=0B mem-reservation=0B thread-reservation=0                                                       |
|   02:HASH JOIN [INNER JOIN, BROADCAST]                                                                             |
|   |  hash predicates: a.begin_lat = b.end_lat                                                                      |
|   |  fk/pk conjuncts: a.begin_lat = b.end_lat                                                                      |
|   |  runtime filters: RF000[bloom] <- b.end_lat, RF001[min_max] <- b.end_lat                                       |
|   |  mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB thread-reservation=0                          |
|   |  tuple-ids=0,1 row-size=64B cardinality=100                                                                    |
|   |  in pipelines: 00(GETNEXT), 01(OPEN)                                                                           |
|   |                                                                                                                |
|   |--03:EXCHANGE [BROADCAST]                                                                                       |
|   |     mem-estimate=16.00KB mem-reservation=0B thread-reservation=0                                               |
|   |     tuple-ids=1 row-size=8B cardinality=100                                                                    |
|   |     in pipelines: 01(GETNEXT)                                                                                  |
|   |                                                                                                                |
|   00:SCAN HDFS [functional_parquet.hudi_non_partitioned a, RANDOM]                                                 |
|      HDFS partitions=1/1 files=3 size=28.45KB                                                                      |
|      runtime filters: RF001[min_max] -> a.begin_lat, RF000[bloom] -> a.begin_lat                                   |
|      stored statistics:                                                                                            |
|        table: rows=100 size=28.45KB                                                                                |
|        columns: all                                                                                                |
|      extrapolated-rows=disabled max-scan-range-rows=34                                                             |
|      file formats: [PARQUET]                                                                                       |
|      mem-estimate=32.00MB mem-reservation=16.00KB thread-reservation=1                                             |
|      tuple-ids=0 row-size=56B cardinality=100                                                                      |
|      in pipelines: 00(GETNEXT)                                                                                     |
|                                                                                                                    |
| F01:PLAN FRAGMENT [RANDOM] hosts=3 instances=3                                                                     |
| Per-Host Resources: mem-estimate=16.00MB mem-reservation=16.00KB thread-reservation=2                              |
|   DATASTREAM SINK [FRAGMENT=F00, EXCHANGE=03, BROADCAST]                                                           |
|   |  mem-estimate=0B mem-reservation=0B thread-reservation=0                                                       |
|   01:SCAN HDFS [functional_parquet.hudi_non_partitioned b, RANDOM]                                                 |
|      HDFS partitions=1/1 files=3 size=28.45KB                                                                      |
|      stored statistics:                                                                                            |
|        table: rows=100 size=28.45KB                                                                                |
|        columns: all                                                                                                |
|      extrapolated-rows=disabled max-scan-range-rows=34                                                             |
|      file formats: [PARQUET]                                                                                       |
|      mem-estimate=16.00MB mem-reservation=16.00KB thread-reservation=1                                             |
|      tuple-ids=1 row-size=8B cardinality=100                                                                       |
|      in pipelines: 01(GETNEXT)                                                                                     |
+--------------------------------------------------------------------------------------------------------------------+
Fetched 61 row(s) in 0.02s

Query: select straight_join  a._hoodie_record_key from 
hudi_non_partitioned a, hudi_non_partitioned b 
where a.begin_lat = b.end_lat
Query submitted at: 2021-04-01 13:06:55 (Coordinator: http://qifan-10229:25000)
Query progress can be monitored at: http://qifan-10229:25000/query_plan?query_id=5d4afb9f44b47432:2a456b9d00000000
Fetched 0 row(s) in 0.11s



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 01 Apr 2021 17:20:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#29). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files;
 4. IMPALA-10617: Compute min/max column stats beyond parquet tables.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
M tests/query_test/test_runtime_filters.py
37 files changed, 1,434 insertions(+), 141 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/29
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#28). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files.
 4. IMPALA-10617: Compute min/max column stats beyond the parquet tables

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
M tests/query_test/test_runtime_filters.py
37 files changed, 1,471 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/28
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 19:

(8 comments)

Did a quick walkthrough, will look into it in detail next week.

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
Do we need this query option? I mean if we have min/max stats then we'd probably want to show them.


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc@447
PS19, Line 447: (
nit: parentheses are not needed as the '.' member access takes precedence over the '&' adress-of operator,


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc@492
PS19, Line 492: (
nit: parentheses not needed


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/hdfs-scanner.h@348
PS19, Line 348:     uint8_t enabled_for_rowgroup;
Why do we need this flag? If enabled_for_rowgroup is false, then the min/max filter is completely turned off, right? In that case we shouldn't even send them to the scanner, or am I missing something?


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/parquet/hdfs-parquet-scanner.cc@662
PS19, Line 662:     }
nit: EvaluateOverlapForRowGroup() is already quite long, maybe this code could go into a separate function.


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc@950
PS19, Line 950:     
nit: indentation


http://gerrit.cloudera.org:8080/#/c/17075/19/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/17075/19/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@50
PS19, Line 50: LOG
seems unused


http://gerrit.cloudera.org:8080/#/c/17075/19/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/17075/19/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@278
PS19, Line 278:    */
It would be good to handle Iceberg tables that use Parquet data files.



-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 19
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 12 Mar 2021 17:02:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#15). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.

set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter is no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
36 files changed, 1,084 insertions(+), 110 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/15
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 15
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/8165/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 3
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Feb 2021 22:09:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Two new columns 'Min' and 'Max' are added in the output of the
show column stats command as shown below.

show column stats tpcds_parquet.store_sales
+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Test filters being disabled at the scan node;
 3. Add logic to disable min/max filters inside HJ builder via
    the column stats.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
26 files changed, 847 insertions(+), 68 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/12
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 12
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 26:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8428/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 26
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Mar 2021 02:44:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has uploaded a new patch set (#27). ( http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
M tests/query_test/test_runtime_filters.py
36 files changed, 1,415 insertions(+), 129 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/27
-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )

Change subject: [WIP] IMPALA-10494: Making use of the min/max column stats to improve min/max filters
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8173/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 6
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Feb 2021 19:35:54 +0000
Gerrit-HasComments: No