You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Martin Junghanns (JIRA)" <ji...@apache.org> on 2018/04/03 09:15:00 UTC

[jira] [Created] (SPARK-23855) Performing a Join after a CrossJoin can lead to data corruption

Martin Junghanns created SPARK-23855:
----------------------------------------

             Summary: Performing a Join after a CrossJoin can lead to data corruption
                 Key: SPARK-23855
                 URL: https://issues.apache.org/jira/browse/SPARK-23855
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.1, 2.2.0
            Reporter: Martin Junghanns


The following tests produces the wrong result for the join operation. The error only occurs when joining on the first column of the crossed dataframe. However, a subsequent select fixes the data (which is of course not a solution).

It works on 2.3.0 though. It would be nice to get this fixed on the 2.2.x releases, too. Maybe someone can point me to the issue that has been fixed? Would be nice to see the solution in code.
{code}
it("should correctly perform a join after a cross") {
    val df1 = sparkSession.createDataFrame(Seq(Tuple1(0L)))
      .toDF("a")

    val df2 = sparkSession.createDataFrame(Seq(Tuple1(1L)))
      .toDF("b")

    val df3 = sparkSession.createDataFrame(Seq(Tuple1(0L)))
      .toDF("c")

    val cross = df1.crossJoin(df2)
    cross.show()

    val joined = cross
      .join(df3, cross.col("a") === df3.col("c"))

    joined.show()

    val selected = joined.select("*")
    selected.show
  }
{code}
prints:
{code:java}
+---+---+
|  a|  b|
+---+---+
|  0|  1|
+---+---+

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  0|  1|
+---+---+---+

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  1|  0|
+---+---+---+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org