You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2018/11/15 12:12:00 UTC
[jira] [Resolved] (SPARK-26057) Table joining is broken in Spark 2.4

     [ https://issues.apache.org/jira/browse/SPARK-26057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenchen Fan resolved SPARK-26057.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.4.1
                   3.0.0

Issue resolved by pull request 23035
[https://github.com/apache/spark/pull/23035]

> Table joining is broken in Spark 2.4
> ------------------------------------
>
>                 Key: SPARK-26057
>                 URL: https://issues.apache.org/jira/browse/SPARK-26057
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Pavel Parkhomenko
>            Priority: Major
>             Fix For: 3.0.0, 2.4.1
>
>
> This sample works in spark-shell 2.3.1 and throws an exception in 2.4.0
> {code:java}
> import java.util.Arrays.asList
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> spark.createDataFrame(
>   asList(
>     Row("1-1", "sp", 6),
>     Row("1-1", "pc", 5),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 3),
>     Row("2-2", "pc", 2),
>     Row("2-2", "sp", 1)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", StringType), StructField("n", IntegerType)))
> ).createOrReplaceTempView("cc")
> spark.createDataFrame(
>   asList(
>     Row("sp", 1),
>     Row("sp", 1),
>     Row("sp", 2),
>     Row("sp", 3),
>     Row("sp", 3),
>     Row("sp", 4),
>     Row("sp", 5),
>     Row("sp", 5),
>     Row("pc", 1),
>     Row("pc", 2),
>     Row("pc", 2),
>     Row("pc", 3),
>     Row("pc", 4),
>     Row("pc", 4),
>     Row("pc", 5)
>   ),
>   StructType(List(StructField("layout", StringType), StructField("ts", IntegerType)))
> ).createOrReplaceTempView("p")
> spark.createDataFrame(
>  asList(
>     Row("1-1", "sp", 1),
>     Row("1-1", "sp", 2),
>     Row("1-1", "pc", 3),
>     Row("1-2", "pc", 3),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 4),
>     Row("2-1", "sp", 5),
>     Row("2-2", "pc", 6),
>     Row("2-2", "sp", 6)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", StringType), StructField("ts", IntegerType)))
> ).createOrReplaceTempView("c")
> spark.sql("""
> SELECT cc.id, cc.layout, count(*) as m
>   FROM cc
>   JOIN p USING(layout)
>   WHERE EXISTS(SELECT 1 FROM c WHERE c.id = cc.id AND c.layout = cc.layout AND c.ts > p.ts)
>   GROUP BY cc.id, cc.layout
> """).createOrReplaceTempView("pcc")
> spark.sql("SELECT * FROM pcc ORDER BY id, layout").show
> spark.sql("""
> SELECT cc.id, cc.layout, n, m
>   FROM cc
>   LEFT OUTER JOIN pcc ON pcc.id = cc.id AND pcc.layout = cc.layout
> """).createOrReplaceTempView("k")
> spark.sql("SELECT * FROM k ORDER BY id, layout").show
> {code}
> Actually I tried to catch another bug: similar calculations with joins and nested queries have different results in Spark 2.3.1 and 2.4.0, but when I tried to create a minimal example I received exception
> {code:java}
> java.lang.RuntimeException: Couldn't find id#0 in [id#38,layout#39,ts#7,id#10,layout#11,ts#12]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org