You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kaya Kupferschmidt (Jira)" <ji...@apache.org> on 2022/07/22 06:54:00 UTC

[jira] [Commented] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata

    [ https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569842#comment-17569842 ] 

Kaya Kupferschmidt commented on SPARK-39838:
--------------------------------------------

Please find a PR at https://github.com/apache/spark/pull/37251

> Passing an empty Metadata object to Column.as() should clear the metadata
> -------------------------------------------------------------------------
>
>                 Key: SPARK-39838
>                 URL: https://issues.apache.org/jira/browse/SPARK-39838
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Kaya Kupferschmidt
>            Priority: Major
>
> h2. Description
> The Spark DataFrame API allows developers to attach arbiotrary metadata to individual columns as key/value pairs. The attachment is performed via the method "Column.as(name, metadata)". This works as expected, as long as the metadata object is not empty. But when passing an empty metadata object, the final column in the resulting DataFrame will still hold the metadata of the original incoming column, i.e. you cannot use this method to essentially reset the metadata of a column.
> This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 3.2.1 and earlier, passing an empty metadata object to the method "Column.as(name, metadata)" resets the columns metadata as expected.
> h2. Steps to Reproduce
> The following code snippet will show the issue in Spark shell:
> {code:scala}
> import org.apache.spark.sql.types.MetadataBuilder
> // Create a DataFrame with one column with Metadata attached
> val df1 = spark.range(1,10)
>     .withColumn("col_with_metadata", col("id").as("col_with_metadata", new MetadataBuilder().putString("metadata", "value").build()))
> // Create a derived DataFrame which should reset the metadata of the column
> val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new MetadataBuilder().build()))
> // Display metadata of both DataFrames columns
> println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
> println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
> {code} 
> This code results in the following lines printed onto the console
> {code}
> df1 metadata: {"metadata":"value"}
> df2 metadata: {"metadata":"value"}
> {code}
> This result does not meet my expectations. I expect that df1 has non-empty metadata, but df2 should have empty metadata. But this is not the case, df2 still holds the same metadata as df1.
> h2. Analysis
> I think the problem stems from the changes in the method "trimNonTopLevelAliases" in the class AliasHelper:
> {code:scala}
>   protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
>     val res = e match {
>       case a: Alias =>
>         val metadata = if (a.metadata == Metadata.empty) {
>           None
>         } else {
>           Some(a.metadata)
>         }
>         a.copy(child = trimAliases(a.child))(
>           exprId = a.exprId,
>           qualifier = a.qualifier,
>           explicitMetadata = metadata,
>           nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
>       case a: MultiAlias =>
>         a.copy(child = trimAliases(a.child))
>       case other => trimAliases(other)
>     }
>     res.asInstanceOf[T]
>   }
> {code}
> The method will remove any empty metadata object from an Alias, which in turn means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org