You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/10/10 11:46:05 UTC
[jira] [Resolved] (SPARK-10968) Incorrect Join behavior in filter
conditions
[ https://issues.apache.org/jira/browse/SPARK-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-10968.
-------------------------------
Resolution: Not A Problem
FWIW I also agree that this does not look like a problem. The values being compared have different types. You're suggesting a higher level semantic interpretation that I don't think is what is expected of the analyzer.
> Incorrect Join behavior in filter conditions
> --------------------------------------------
>
> Key: SPARK-10968
> URL: https://issues.apache.org/jira/browse/SPARK-10968
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 1.4.1, 1.5.1
> Environment: RHEL, spark-shell
> Reporter: RaviShankar KS
> Labels: DataFramejoin, sql,
> Attachments: CreateDF_sparkshell_jira.scala
>
>
> We notice that the join conditions are not working as expected in the case of nested columns being compared.
> As long as leaf columns have the same name under a nested column, should order matter ??
> Consider below example for two data frames d5 and d5_opp :
> d5 and d5_opp have a nested field 'value', but their inner leaf columns do not have the same ordering.
> -- d5.printSchema
> root
> |-- key: integer (nullable = false)
> |-- value: array (nullable = true)
> | |-- element: struct (containsNull = true)
> | | |-- col1: string (nullable = true)
> | | |-- col2: string (nullable = true)
> |-- value1: struct (nullable = false)
> | |-- col1: string (nullable = false)
> | |-- col2: string (nullable = false)
> -- d5_opp.printSchema
> root
> |-- key: integer (nullable = false)
> |-- value: array (nullable = true)
> | |-- element: struct (containsNull = true)
> | | |-- col2: string (nullable = true)
> | | |-- col1: string (nullable = true)
> |-- value1: struct (nullable = false)
> | |-- col2: string (nullable = false)
> | |-- col1: string (nullable = false)
> The below join statement do not work in spark 1.5, and raises exception. In spark 1.4, no exception is raised, but join result is incorrect :
> -- d5.as("d5").join( d5_opp.as("d5_opp"), $"d5.value" === $"d5_opp.value", "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due to data type mismatch: differing types in '(value = value)' (array<struct<col1:string,col2:string>> and array<struct<col2:string,col1:string>>).;
> -- d5.as("d5").join( d5_opp.as("d5_opp"), $"d5.value1" === $"d5_opp.value1", "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' due to data type mismatch: differing types in '(value1 = value1)' (struct<col1:string,col2:string> and struct<col2:string,col1:string>).;
> // Code to be used in spark shell to create the data frames is attached.
> -------------------------
> The only work-around is to explode the conditions for every leaf field.
> In our case, we are generating the conditions and dataframes programmatically, and exploding the conditions for every leaf field is additional overhead, and may not be always possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org