You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2018/10/10 20:08:00 UTC
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

    [ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645484#comment-16645484 ] 

Sean Owen commented on SPARK-25150:
-----------------------------------

The second join joins "states-joined-with-humans" with "zombies", but the join condition references a column in dataframe "states", which isn't one of those two dataframes being joined. Obviously all of these tables have a column "State" but that's not quite what this code is specifying. I had thought that wasn't allowed or didn't work? Can you try breaking the join down into two statements and making sure the column references only refer to dataframes in each join?

If that condition is being ignored, then you end up with a full cross join, right? Spark seems to think so because it asks if that's what you're doing. And its answer is correct as if it were doing a cross join. It looks correct to me. You do see NH and RI zombie stats mixed with each other; that count is 1 in every case though. So you get double the rows as in the result of the first join, with 1 zombie each.

 

> Joining DataFrames derived from the same source yields confusing/incorrect results
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-25150
>                 URL: https://issues.apache.org/jira/browse/SPARK-25150
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Nicholas Chammas
>            Priority: Major
>         Attachments: expected-output.txt, output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of bug here. The "join condition is missing" error is confusing and doesn't make sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should be left outer join instead of an inner join (since some of the aggregates are not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org