You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Munesh Bandaru (JIRA)" <ji...@apache.org> on 2017/09/01 21:31:00 UTC
[jira] [Commented] (SPARK-20761) Union uses column order rather than schema

    [ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151181#comment-16151181 ] 

Munesh Bandaru commented on SPARK-20761:
----------------------------------------

As the ticket was closed as 'Not a Problem', a workaround is to use the 'select' to change the order of the columns of one of the dataframe.
But if we have a large number of columns, it doesn't look good to specify all the columns.

So we can use the columns of one dataframe to arrange the other dataframe in the order as below.

comb_df = df1.unionAll(df2.select(df1.columns))

> Union uses column order rather than schema
> ------------------------------------------
>
>                 Key: SPARK-20761
>                 URL: https://issues.apache.org/jira/browse/SPARK-20761
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1
>            Reporter: Nakul Jeirath
>            Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+--------+--------+----------+
> | id|flag_one|flag_two|flag_three|
> +---+--------+--------+----------+
> |  1|    true|   false|     false|
> |  2|   false|    true|     false|
> |  3|   false|   false|      true|
> +---+--------+--------+----------+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+--------+--------+----------+
> | id|flag_one|flag_two|flag_three|
> +---+--------+--------+----------+
> |  1|   false|   false|      true|
> |  2|    true|   false|     false|
> |  3|   false|    true|     false|
> +---+--------+--------+----------+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, flag_three, flag_one from temp_flags")).show
> +---+--------+--------+----------+
> | id|flag_one|flag_two|flag_three|
> +---+--------+--------+----------+
> |  1|    true|   false|     false|
> |  2|   false|    true|     false|
> |  3|   false|   false|      true|
> |  1|   false|   false|      true|
> |  2|    true|   false|     false|
> |  3|   false|    true|     false|
> +---+--------+--------+----------+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org