You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wei Lu (Jira)" <ji...@apache.org> on 2019/11/27 03:47:00 UTC
[jira] [Commented] (SPARK-30045) Difference in subquery performance

    [ https://issues.apache.org/jira/browse/SPARK-30045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983143#comment-16983143 ] 

Wei Lu commented on SPARK-30045:
--------------------------------

{code:java}
WHERE field NOT IN (SELECT s FROM table){code}

> Difference in subquery performance
> ----------------------------------
>
>                 Key: SPARK-30045
>                 URL: https://issues.apache.org/jira/browse/SPARK-30045
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: {code:java}
>  {code}
>            Reporter: Max Lapan
>            Priority: Major
>
> Hi!
> There is a weird performance difference between 
> {code:java}
> WHERE field IN (SELECT s FROM table)
> {code}
> and 
> {code:java}
> WHERE NOT EXISTS (SELECT s FROM table WHERE s = field){code}
> In my toy benchmark code (attached to the ticket), which has 1M rows in the outer select and 20k rows in the inner, the first query runs in 4.5 minutes and the second completes in 14 seconds.
> With larger datasets, the difference is even more dramatic, so, the first query form just never ends and eats all the memory.
> We've tested this on Spark 2.4.0 (AWS EMR)
> Reproduction code
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import Row
> spark = SparkSession \
>         .builder \
>         .appName("SQL bug reproduce") \
>         .getOrCreate()
> def do_q1():
>     df = spark.sql("""
>     select * from df1
>     where df1.id not in (select id from df2)
>     """)
>     return df.count()
> def do_q2():
>     dff = spark.sql("""
>     select * from df1
>     where not exists (select id from df2 where df2.id = df1.id)
>     """)
>     return dff.count()
> def build_data(n1, n2, p1=2, p2=2):
>     d1 = (Row(id=v) for v in range(n1))
>     d2 = (Row(id=v) for v in range(n2))
>     df1 = spark.createDataFrame(d1).repartition(p1)
>     df2 = spark.createDataFrame(d2).repartition(p2)
>     df1.createOrReplaceTempView("df1")
>     df2.createOrReplaceTempView("df2")
> build_data(1000000, 20000, 10, 5)
> # 14 seconds
> do_q2()     
> # 4.5 minutes
> do_q1()     
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org