You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nattavut Sutyanyong (JIRA)" <ji...@apache.org> on 2016/09/08 16:04:21 UTC
[jira] [Comment Edited] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe

    [ https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15474239#comment-15474239 ] 

Nattavut Sutyanyong edited comment on SPARK-14040 at 9/8/16 4:03 PM:
---------------------------------------------------------------------

The root cause of this problem is the way Spark implemented to generate a unique identifier for each column without an obvious way to distinguish multiple references to the same column. This problem has been discovered in different contexts and different approaches to fix this problem have been discussed in various places:

SPARK-13801
SPARK-17337

A partial fix was implemented in the {{dedupRight()}} method for the {{Join}} operator.

and the latest attempt to fix this in SPARK-17154. We should solve this problem at the root cause. I will post my idea in SPARK-17154. We shall close this JIRA as a duplicate.




was (Author: nsyca):
The root cause of this problem is the way Spark implemented to generate a unique identifier for each column without an obvious way to distinguish multiple references to the same column. This problem has been discovered in different contexts and different approaches to fix this problem have been discussed in various places:

SPARK-14040
SPARK-17337

A partial fix was implemented in the {{dedupRight()}} method for the {{Join}} operator.

and the latest attempt to fix this in SPARK-17154. We should solve this problem at the root cause. I will post my idea in SPARK-17154. We shall close this JIRA as a duplicate.



> Null-safe and equality join produces incorrect result with filtered dataframe
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-14040
>                 URL: https://issues.apache.org/jira/browse/SPARK-14040
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: Ubuntu Linux 15.10
>            Reporter: Denton Cockburn
>
> Initial issue reported here: http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>       val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>       val a = b.where("c = 1").withColumnRenamed("a", "filta").withColumnRenamed("b", "filtb")
>       a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>       a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>       a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === b("c"), "left_outer").show
> But that produced a warning for :  
>       WARN Column: Constructing trivially true equals predicate, 'c#18232 = c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you still have a.c in your data. It looks like it is picked downstream before b.c and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org