You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anton Okolnychyi (JIRA)" <ji...@apache.org> on 2018/11/28 16:21:00 UTC

[jira] [Updated] (SPARK-26203) Benchmark performance of In and InSet expressions

     [ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anton Okolnychyi updated SPARK-26203:
-------------------------------------
    Description: 
{{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O(n) time complexity for {{In}}.

The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again.

According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues.

The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. 

Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type.


  was:
{{OptimizeIn}} rule that replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O(n) time complexity for {{In}}.

The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again.

According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues.

The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. 

Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type.



> Benchmark performance of In and InSet expressions
> -------------------------------------------------
>
>                 Key: SPARK-26203
>                 URL: https://issues.apache.org/jira/browse/SPARK-26203
>             Project: Spark
>          Issue Type: Test
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Anton Okolnychyi
>            Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O(n) time complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org