You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/02/13 00:40:00 UTC

[jira] [Updated] (SPARK-38193) [Spark Core] [Feature] change of unionByName parameter

     [ https://issues.apache.org/jira/browse/SPARK-38193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-38193:
---------------------------------
    Affects Version/s: 3.3.0
                           (was: 3.2.1)

> [Spark Core] [Feature] change of unionByName parameter
> ------------------------------------------------------
>
>                 Key: SPARK-38193
>                 URL: https://issues.apache.org/jira/browse/SPARK-38193
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Daniel Davies
>            Priority: Minor
>
> Hello,
> I had a quick question about the unionByName function. This function currently seems to accept a parameter- "allowMissingColumns"- that allows some tolerance to merging datasets with different schemas [here|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2170]]; but the implementation is currently a bit restrictive, i.e., with the second parameter being a boolean, it is only possible to make unionByName add all columns from both dataframes at the moment. We have other use cases in our workflows- for example, to take only column names that are in both dataframes (and I'm assuming that other users will have different merge strategies in mind also). Does it seem reasonable to extend the parameter from "allowMissingColumns" to a "mode" string-type parameter natively in Spark? If so, I'm happy to make a PR to achieve this (the change would involve amending the [ResolveUnion.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala] utility to make it more flexible in merging columns; to a user it would look a lot more like the 'join' operator, where a join strategy is selected). 
> I've posted this question on the dev mailing list also; happy to continue the conversation there if that is preferable.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org