You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/06/16 08:42:00 UTC

[jira] [Commented] (IMPALA-10709) Min/max filters should be enabled for joins on sorted columns in Parquet tables

    [ https://issues.apache.org/jira/browse/IMPALA-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364140#comment-17364140 ] 

ASF subversion and git services commented on IMPALA-10709:
----------------------------------------------------------

Commit 40c3074e79f4e35ef8af9bfe1f73aa34511425cf in impala's branch refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=40c3074 ]

IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in Parquet tables

This patch enables min/max filters for equi-joins on lexical sort-by
columns in a Parquet table created by Impala by default. This is to
take advantage of Impala sorting the min/max values in column index
in each data file for the table. The control knob is query option
minmax_filter_sorted_columns, default to true.

When minmax_filter_sorted_columns is true, the patch will generate
min/max filters only for the leading sort columns. The normal control
knobs minmax_filter_threshold (for threshold) and
minmax_filtering_level (for filtering level) still work. When the
threshold is 0, the patch automatically assigns a reasonable value
for the threshhold, and selects PAGE to be the filtering level.

In the backend, the skipped pages are quickly found by taking a
fast code path to identify the corresponding lower and the upper
bounds in the sorted min and max value arrays, given a range in the
filter.  The skipped pages are expressed as page ranges which are
translated into row ranges later on.

A new query option minmax_filter_fast_code_path is added to control
the work of the fast code path. It can take ON (default), OFF, or
VERIFICATION three values. The last helps verify that the results
from both the fast and the regular code path are the same.

Preliminary performance testing (joining into a simpplified TPCH
lineitem table of 2 sorted BIG INT columns and a total of 6001215
rows) confirms that min/max filtering on leading sort-by columns
improves the performance of scan operators greatly. The best result
is seen with pages containing no more than 24000 rows: 84.62ms
(page level filtering) vs. 115.27ms (row group level filtering)
vs 137.14ms (no filtering). The query utilized is as follows.

  select straight_join a.l_orderkey from
  simpflified_lineitem a join [SHUFFLE] tpch_parquet.lineitem b
  where a.l_orderkey = b.l_orderkey and b.l_receiptdate = "1998-12-31"

Also fixed in the patch are abnormal min/max display in "Final
filter table" section in a profile for DECIMAL, TIMESTAMP and DATE
data types, and reading DATE column index in batch without validation.

Testing:
  1). Added a new test overlap_min_max_filters_on_sorted_columns.test
      to verify
      a) Min/max filters are only created for leading sort by column;
      b) Query option minmax_filter_sorted_columns works;
      c) Query option minmax_filter_fast_code_path works.
  2). Added new tests in parquet-page-index-test.cc to test fast
      code path under various conditions;
  3). Ran core tests successfully.

Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
Reviewed-on: http://gerrit.cloudera.org:8080/17478
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Min/max filters should be enabled for joins on sorted columns in Parquet tables 
> --------------------------------------------------------------------------------
>
>                 Key: IMPALA-10709
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10709
>             Project: IMPALA
>          Issue Type: Test
>            Reporter: Qifan Chen
>            Assignee: Qifan Chen
>            Priority: Major
>
> Currently, the min/max filter feature is turned off by default (MINMAX_FILTER_THRESHOLD=0). 
> When joining into sorted columns in a fact Parquet table created by Imoala, the feature can be turned on by default. This is because Impala sorts the data in sort by columns in each data file during population. A min/max filter can be used to easily reject pages not overlapping with the search region specified in the filter. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org