You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/07/17 03:37:00 UTC

[jira] [Reopened] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty

     [ https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon reopened SPARK-24402:
----------------------------------

This was reverted.

> Optimize `In` expression when only one element in the collection or collection is empty 
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-24402
>                 URL: https://issues.apache.org/jira/browse/SPARK-24402
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: DB Tsai
>            Assignee: DB Tsai
>            Priority: Major
>
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Collection}}*, the physical plan
> will be optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct<profileID:int>
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical
> plan will be simplified to
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
> TODO:
>  # For multiple conditions with numbers less than certain thresholds,
> we should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the
> numbers of the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we
> should do benchmark for using different set implementation for faster
> query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org