You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2019/01/15 14:28:00 UTC

[jira] [Resolved] (SPARK-26203) Benchmark performance of In and InSet expressions

     [ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun resolved SPARK-26203.
-----------------------------------
       Resolution: Fixed
         Assignee: Anton Okolnychyi
    Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23291

> Benchmark performance of In and InSet expressions
> -------------------------------------------------
>
>                 Key: SPARK-26203
>                 URL: https://issues.apache.org/jira/browse/SPARK-26203
>             Project: Spark
>          Issue Type: Test
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Anton Okolnychyi
>            Assignee: Anton Okolnychyi
>            Priority: Major
>             Fix For: 3.0.0
>
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O\(n\) time complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org