You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anton Okolnychyi (JIRA)" <ji...@apache.org> on 2017/07/08 22:08:02 UTC
[jira] [Commented] (SPARK-20660) Not able to merge Dataframes with different column orders

    [ https://issues.apache.org/jira/browse/SPARK-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079340#comment-16079340 ] 

Anton Okolnychyi commented on SPARK-20660:
------------------------------------------

I used the following code to investigate this problem:


{code}
    val sc = spark.sparkContext

    val inputSchema1 = StructType(
      StructField("key", StringType) ::
      StructField("value", IntegerType) ::
      Nil)
    val rdd1 = sc.parallelize(1 to 2).map(x => Row(x.toString, 555))
    val df1 = spark.createDataFrame(rdd1, inputSchema1)

    val inputSchema2 = StructType(
      StructField("value", IntegerType) ::
      StructField("key", StringType) ::
      Nil)
    val rdd2 = sc.parallelize(1 to 2).map(x => Row(555, x.toString))
    val df2 = spark.createDataFrame(rdd2, inputSchema2)

    val result = df1.union(df2)
    result.explain(true)
    result.show()
{code}

and it gives the following output:


{noformat}
== Parsed Logical Plan ==
'Union
:- LogicalRDD [key#2, value#3]
+- LogicalRDD [value#9, key#10]

== Analyzed Logical Plan ==
key: string, value: string
Union
:- Project [key#2, cast(value#3 as string) AS value#20]
:  +- LogicalRDD [key#2, value#3]
+- Project [cast(value#9 as string) AS value#21, key#10]
   +- LogicalRDD [value#9, key#10]

== Optimized Logical Plan ==
Union
:- Project [key#2, cast(value#3 as string) AS value#20]
:  +- LogicalRDD [key#2, value#3]
+- Project [cast(value#9 as string) AS value#21, key#10]
   +- LogicalRDD [value#9, key#10]

== Physical Plan ==
Union
:- *Project [key#2, cast(value#3 as string) AS value#20]
:  +- Scan ExistingRDD[key#2,value#3]
+- *Project [cast(value#9 as string) AS value#21, key#10]
   +- Scan ExistingRDD[value#9,key#10]
{noformat}


{noformat}
+---+-----+
|key|value|
+---+-----+
|  1|  555|
|  2|  555|
|555|    1|
|555|    2|
+---+-----+
{noformat}

It is important to notice the result schema that consists of two strings even though the original schemes were different. This happens because of the {{WidenSetOperationTypes}} rule, which introduces casts to strings in the analyzed logical plan (since one column has type String the other is promoted to String as well). According to the Scala doc, this cast is done on purpose but leads to a confusing result in this particular scenario.

I am wondering what would be a better outcome in this case.

> Not able to merge Dataframes with different column orders
> ---------------------------------------------------------
>
>                 Key: SPARK-20660
>                 URL: https://issues.apache.org/jira/browse/SPARK-20660
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Michel Lemay
>            Priority: Minor
>
> Union on two dataframes with different column orders is not supported and lead to hard to find issues.
> Here is an example showing the issue.
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> var inputSchema = StructType(StructField("key", StringType, nullable=true) :: StructField("value", IntegerType, nullable=true) :: Nil)
> var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => Row(x.toString, 555)), inputSchema)
> var b = a.select($"value" * 2 alias "value", $"key")  // any transformation changing column order will show the problem.
> a.union(b).show
> // in order to make it work, we need to reorder columns
> val bCols = a.columns.map(aCol => b(aCol))
> a.union(b.select(bCols:_*)).show
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org