You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2019/05/27 15:18:00 UTC

[jira] [Commented] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes

    [ https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849016#comment-16849016 ] 

Liang-Chi Hsieh commented on SPARK-27855:
-----------------------------------------

If you notice, the printed schema of two Datasets is different. The columns have different order. Dataset.union resolves columns by position. This is well documented in the API doc.

If you want to resolve columns by name, please use Dataset.unionByName API.

> Union failed between 2 datasets of the same type converted from different dataframes
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-27855
>                 URL: https://issues.apache.org/jira/browse/SPARK-27855
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.3
>            Reporter: Hao Ren
>            Priority: Major
>
> 2 Datasets of the same type converted from different dataframes can not union.
> Here is the code to reproduce the problem. It seems `union` just checks the schema of the orignal dataframe, even if the two datasets have already been converted to the same type of dataset.
> {code:java}
> case class Entity(key: Int, a: Int, b: String)
> val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
> val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
> df1.printSchema
> df2.printSchema
> df1 union df2
> {code}
> Result
> {code:java}
> defined class Entity
> df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field]
> df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more field]
> converted
> root
> |-- key: integer (nullable = false)
> |-- a: integer (nullable = false)
> |-- b: string (nullable = true)
> root
> |-- key: integer (nullable = false)
> |-- b: string (nullable = true)
> |-- a: integer (nullable = false)
> org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "a")
> - root class: "Entity"{code}
> The problem is that the two datasets of the same type have different schemas.
> The schema of the dataset does not conserve the order of the fields in the case class definition, but the one of the original dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org