You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/09/17 05:57:00 UTC

[jira] [Commented] (IMPALA-10916) Pushdown Stats predicates with different argument types to the ORC reader

    [ https://issues.apache.org/jira/browse/IMPALA-10916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416471#comment-17416471 ] 

ASF subversion and git services commented on IMPALA-10916:
----------------------------------------------------------

Commit 35b21083b1866b7056e3810ae5a8daf7bc77ddda in impala's branch refs/heads/master from norbert.luksa
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=35b2108 ]

IMPALA-6505: Min-Max predicate push down in ORC scanner

In planning phase, the planner collects and generates min-max predicates
that can be evaluated on parquet file statistics. We can easily extend
this on ORC tables.

This commit implements min/max predicate pushdown for the ORC scanner
leveraging on the external ORC library's search arguments. We build
the search arguments when we open the scanner as we need not to
modify them later.

Also added a new query option orc_read_statistics, similar to
parquet_read_statistics. If the option is set to true (it is by default)
predicate pushdown will take effect, otherwise it will be skipped. The
predicates will be evaluated at ORC row group level, i.e. by default for
every 10,000 rows.

Limitations:
 - Min-max predicates on CHAR/VARCHAR types are not pushed down due to
   inconsistent behaviors on padding/truncating between Hive and Impala.
   (IMPALA-10882)
 - Min-max predicates on TIMESTAMP are not pushed down (IMPALA-10915).
 - Min-max predicates having different arg types are not pushed down
   (IMPALA-10916).
 - Min-max predicates with non-literal const exprs are not pushed down
   since SearchArgument interfaces only accept literals. This only
   happens when expr rewrites are disabled thus constant folding is
   disabled.

Tests:
 - Add e2e tests similar to test_parquet_stats to verify that
   predicates are pushed down.
 - Run CORE tests
 - Run TPCH benchmark, there is no improvement, nor regression.
   On the other hand, certain selective queries gained significant
   speed-up, e.g. select count(*) from lineitem where l_orderkey = 1.

Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Reviewed-on: http://gerrit.cloudera.org:8080/15403
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Reviewed-by: Qifan Chen <qc...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Pushdown Stats predicates with different argument types to the ORC reader
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-10916
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10916
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>
> IMPALA-6505 introduces min-max predicate (stats predicate) pushdown to the ORC reader. However, stats predicates with different argument types may not be pushed down. For example, for the query
> {code:sql}
> select count(*) from functional_orc_def.decimal_tbl where d1 > 132842;
> {code}
> the slot type of "d1" is DECIMAL(9,0), while the type of literal "132842" is INT.
>  FE generates a stats predicate "d1 > 132842". After analyze, it becomes "CAST(d1 as INT) > CAST(132842 as INT)". The rhs is Literal(value=132842 type=INT) in the BE, which is good. But the lhs is no longer a single slot ref, but a expression. So we cannot simply push it down.
> The following queries have the same issue:
> {code:sql}
> select count(*) from functional_orc_def.decimal_tbl where d1 > cast(132842.0 as float);
> select count(*) from functional_orc_def.decimal_tbl where d1 > cast(132842.0 as double);
> {code}
> Currently, a workaround is avoid comparing slot ref with other types. For decimal, this query works:
> {code:sql}
> select count(*) from functional_orc_def.decimal_tbl where d1 > 132842.0;
> {code}
> FE will analyze "132842.0" to be DECIMAL(10,1), so we are good.
> A possible solution is detecting such simple CAST expressions, and generating orc::PredicateDataType using the slot type instead of the literal type. We will need handcraft casting logics when generating the orc::Literal, e.g. converting a impala::Literal(value=132842.0 type=FLOAT) to orc::Literal in DECIMAL type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org