You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/12/11 19:21:00 UTC

[jira] [Commented] (SPARK-26205) Optimize In expression for bytes, shorts, ints

    [ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717812#comment-16717812 ] 

ASF GitHub Bot commented on SPARK-26205:
----------------------------------------

aokolnychyi commented on issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
URL: https://github.com/apache/spark/pull/23171#issuecomment-446327322
 
 
   @dbtsai @cloud-fan @mgaido91 @rxin @dongjoon-hyun @viirya @gatorsmile 
   
   PR #23291 contains benchmarks for different data types. 
   
   @rxin was your latest suggestion to convert `In` to `InSet` if all elements are literals independently of data types and the number of elements?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Optimize In expression for bytes, shorts, ints
> ----------------------------------------------
>
>                 Key: SPARK-26205
>                 URL: https://issues.apache.org/jira/browse/SPARK-26205
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Anton Okolnychyi
>            Priority: Major
>
> Currently, {{In}} expressions are compiled into a sequence of if-else statements, which results in O\(n\) time complexity. {{InSet}} is an optimized version of {{In}}, which is supposed to improve the performance if the number of elements is big enough. However, {{InSet}} actually degrades the performance in many cases due to various reasons (benchmarks will be available in SPARK-26203 and solutions are discussed in SPARK-26204).
> The main idea of this JIRA is to make use of {{tableswitch}} and {{lookupswitch}} bytecode instructions. In short, we can improve our time complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java {{switch}} statements. We will have O\(1\) time complexity if our case values are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will give us O\(log n\). 
> An important benefit of the proposed approach is that we do not have to pay an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we can substantially outperform {{InSet}} even on 250+ elements.
> See [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] and [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org