You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Syedhamjath (Jira)" <ji...@apache.org> on 2020/07/04 16:09:00 UTC

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

    [ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151349#comment-17151349 ] 

Syedhamjath commented on SPARK-29358:
-------------------------------------

I also came across same issue, if the data frame does not have a column. It makes sense to add column with null or keep the self data frame as fixed ignore additional columns from the parameter data frame based on additional parameter.

I'm getting confused when it throw an error message, it should not throw an error on missing column or additional columns.

I vote for this issue, if Spark doesn't solve I don't think anything can solve this problem.

 

> Make unionByName optionally fill missing columns with nulls
> -----------------------------------------------------------
>
>                 Key: SPARK-29358
>                 URL: https://issues.apache.org/jira/browse/SPARK-29358
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of columns (even though the order can be different). It would be good to add either an option to unionByName or a new type of union which fills in missing columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +----+----+ 
> | x| y| 
> +----+----+ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +----+----+
> {code}
> Currently the workaround to make this possible is by using unionByName, but this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org