You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (Jira)" <ji...@apache.org> on 2020/03/31 16:52:00 UTC
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

    [ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071968#comment-17071968 ] 

Michael Armbrust commented on SPARK-29358:
------------------------------------------

I think we should reconsider closing this as won't fix:
 - I think the semantics of this operation make sense. We already have this with writing JSON or parquet data. It is just a really inefficient way to accomplish the end goal.
 - I don't think it is a problem to move "away from SQL union". This is a clearly named, different operation. IMO this one makes *more* sense than SQL union. It is much more likely that columns with the same name are semantically equivalent than columns at the same ordinal with different names.
 - We are not breaking the behavior of unionByName. Currently it throws an exception in these cases. We are making more data transformations possible, but anything that was working before will continue to work. You could add a boolean flag if you were really concerned, but I think I would skip that.

> Make unionByName optionally fill missing columns with nulls
> -----------------------------------------------------------
>
>                 Key: SPARK-29358
>                 URL: https://issues.apache.org/jira/browse/SPARK-29358
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of columns (even though the order can be different). It would be good to add either an option to unionByName or a new type of union which fills in missing columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +----+----+ 
> | x| y| 
> +----+----+ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +----+----+
> {code}
> Currently the workaround to make this possible is by using unionByName, but this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org