You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/05/26 06:41:00 UTC

[jira] [Commented] (IMPALA-10650) Bail out min/max filters in hash join builder early

    [ https://issues.apache.org/jira/browse/IMPALA-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351550#comment-17351550 ] 

ASF subversion and git services commented on IMPALA-10650:
----------------------------------------------------------

Commit b50d60a6c5b6fdd182dfc851841edae5cd1b3943 in impala's branch refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b50d60a ]

IMPALA-10650: Bailout min/max filters in hash join builder early

This change set addresses the weakness in population min/max filters
in the hash join builder by periodically measuring the usefulness of
each filter and set the 'always_true_' flag accordingly. Once set to
true, the insertion to such a filter completely skips the steps from
the evaluation of the value from a row to the verification of the
value in the min/max range. This optimization is LLVM-enabled.

In addition, a new flag 'is_min_max_value_present' is added to
TRuntimeFilterTargetDesc to indicate whether the min/max column stats
is present in the query plan. The flag eliminates the need to check
the presence of min/max stats for every row in back-end.

Early bail out improves the HJ builder step in general. For example,
the step for join node #11 in TPCDS Q8 improves 13%, and the step
for join node #8 in TPCDS Q16 improves 3.2%.

The Insert() methods are optimized with branch prediction compiler
hints which yield the following improvement when tested with the
insertion of 10000 randomly generated items.

  Small Integers: 7.0%
  Integers:       4.1%
  Big Integers:   4.3%
  Strings:        5.6%
  Dates:          4.4%
  Timestamps:    10.7%
  Decimals(4):   10.4%
  Decimals(8):    9.1%

In addition, the min/max stats for pages are read in batches with a
fast track version for column types of int32_t,  int64_t, float,
double and date that have identical storage format as Parquet. For a
row group, the page locations are read only once, instead of once for
every page skipped, resulting in 100x speedup when a subset of 199
pages are skipped.

Testing:
  1. Ran core test successfully;
  2. Ran TPCDS performance tests.

Change-Id: I193646e7acfdd3023f7c947d8107da58a1f41183
Reviewed-on: http://gerrit.cloudera.org:8080/17295
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Bail out min/max filters in hash join builder early 
> ----------------------------------------------------
>
>                 Key: IMPALA-10650
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10650
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Qifan Chen
>            Assignee: Qifan Chen
>            Priority: Major
>
> Currently, a mechanism is in place to set a min/max filter to always true (not useful) after all batches of rows are inserted into the hash table, utilizing the column stats.  While quite helpful, the mechanism does not exploit the property that the same not useful state can be reached as soon as several batches are inserted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org