You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/03/29 09:16:00 UTC

[jira] [Commented] (IMPALA-10116) Builtin cast function's selectivity is different from that of explicit cast

    [ https://issues.apache.org/jira/browse/IMPALA-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310516#comment-17310516 ] 

ASF subversion and git services commented on IMPALA-10116:
----------------------------------------------------------

Commit 2e5589d85fa5b6be1ec0378d31d4040547f6ef71 in impala's branch refs/heads/master from Aman Sinha
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2e5589d ]

IMPALA-10116: Allow unwrapping a builtin cast function similar to CastExpr

This change allows unwrapping a builtin cast function such as
casttobigint(col) similar to a CAST(col as bigint). Unwrapping
is useful to access the SlotRef of the column and this in turn
is needed to compute predicate selectivity correctly.  Without
unwrapping, the cast function uses default 10 % selectivity
for a predicate such as 'casttobigint(l_quantity) is NOT NULL'
which is not accurate.

Note that Impala does not allow a user query to directly call the
builtin cast function. Rather, they have to use the explicit CAST
syntax. However, since the frontend jar can be used by an external
frontend module as a library, the builtin function can be called
and this patch makes the behavior consistent.

Testing:
 - Ran PlannerTest
 - Manual testing by commenting out the code in
   FunctionCallExpr.analyzeImpl() that throws an AnalysisException
   if builtin cast function is called. I haven't added a new test
   for this reason.

Cardinality before this change:
explain select * from date_dim d1, date_dim d2
   where d1.d_week_seq = d2.d_week_seq - 52
    and casttobigint(d1.d_week_seq) is not null
    and casttobigint(d2.d_week_seq) is not null

  SCAN HDFS [tpcds.date_dim d1]
    HDFS partitions=1/1 files=1 size=9.84MB
    predicates: casttobigint(d1.d_week_seq) IS NOT NULL
    runtime filters: RF000 -> d1.d_week_seq
    row-size=255B cardinality=7.30K

Cardinality after this change:
  SCAN HDFS [tpcds.date_dim d1]
    HDFS partitions=1/1 files=1 size=9.84MB
    predicates: casttobigint(d1.d_week_seq) IS NOT NULL
    runtime filters: RF000 -> d1.d_week_seq
    row-size=255B cardinality=73.05K

Change-Id: Idf82b2de78c6a7051ea036062f177d69e2558940
Reviewed-on: http://gerrit.cloudera.org:8080/16407
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Builtin cast function's selectivity is different from that of explicit cast
> ---------------------------------------------------------------------------
>
>                 Key: IMPALA-10116
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10116
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Frontend
>    Affects Versions: Impala 3.4.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>
> Query 1 below uses 'casttobigint()'  in the IS NOT NULL predicate and its selectivity is computed as the default 10% of the input rows, resulting in cardinality = 7.3K. The predicate in Query 2 with 'CAST' expr computes the correct cardinality of 73.05K. 
> Query 1:
> {noformat}
> Query: explain select * from date_dim d1, date_dim d2 where d1.d_week_seq = d2.d_week_seq - 52 and casttobigint(d1.d_week_seq) is not null and casttobigint(d2.d_week_seq) is not null
>                                                        |
> | 00:SCAN HDFS [tpcds.date_dim d1]                            |
> |    HDFS partitions=1/1 files=1 size=9.84MB                  |
> |    predicates: casttobigint(d1.d_week_seq) IS NOT NULL      |
> |    runtime filters: RF000 -> d1.d_week_seq                  |
> |    row-size=255B cardinality=7.30K                          |
> +-------------------------------------------------------------+
> {noformat}
> Query 2:
> {noformat}
> Query: explain select * from date_dim d1, date_dim d2 where d1.d_week_seq = d2.d_week_seq - 52 and cast(d1.d_week_seq as bigint) is not null and cast(d2.d_week_seq as bigint) is not null 
> | 00:SCAN HDFS [tpcds.date_dim d1]                            |
> |    HDFS partitions=1/1 files=1 size=9.84MB                  |
> |    predicates: CAST(d1.d_week_seq AS BIGINT) IS NOT NULL    |
> |    runtime filters: RF000 -> d1.d_week_seq                  |
> |    row-size=255B cardinality=73.05K                         |
> +-------------------------------------------------------------+
> {noformat}
> Query 1  should ideally provide the same cardinality as Query 2.  Note that I had to comment out the following lines in FunctionCallExpr.java because a user query is not supposed to directly call the builtin cast function. However, for an external frontend module that calls functions in impala-frontend.jar, this is supported and we should make the behavior consistent.
> {noformat}
> +//    if (isBuiltinCastFunction()) {
> +//      throw new AnalysisException(toSql() +
> +//          " is reserved for internal use only. Use 'cast(expr AS type)' instead.");
> +//    }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org