You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/04/04 00:14:00 UTC

[jira] [Commented] (IMPALA-10603) Enable min/max overlap filter feature for Iceberg tables with Parquet data files

    [ https://issues.apache.org/jira/browse/IMPALA-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314373#comment-17314373 ] 

ASF subversion and git services commented on IMPALA-10603:
----------------------------------------------------------

Commit 1231208da7104c832c13f272d1e5b8f554d29337 in impala's branch refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1231208 ]

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files;
 4. IMPALA-10617: Compute min/max column stats beyond parquet tables.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Reviewed-on: http://gerrit.cloudera.org:8080/17075
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Enable min/max overlap filter feature for Iceberg tables with Parquet data files
> --------------------------------------------------------------------------------
>
>                 Key: IMPALA-10603
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10603
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Qifan Chen
>            Priority: Major
>
> The resolution to IMPALA-10494 "Making use of the min/max column stats to improve min/max filters" can be applied to Iceberg tables with Parquet data files. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org