You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kaya Kupferschmidt (Jira)" <ji...@apache.org> on 2022/07/22 05:55:00 UTC
[jira] [Created] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata

Kaya Kupferschmidt created SPARK-39838:
------------------------------------------

             Summary: Passing an empty Metadata object to Column.as() should clear the metadata
                 Key: SPARK-39838
                 URL: https://issues.apache.org/jira/browse/SPARK-39838
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Kaya Kupferschmidt


h2. Description

The Spark DataFrame API allows developers to attach arbiotrary metadata to individual columns as key/value pairs. The attachment is performed via the method "Column.as(name, metadata)". This works as expected, as long as the metadata object is not empty. But when passing an empty metadata object, the final column in the resulting DataFrame will still hold the metadata of the original incoming column, i.e. you cannot use this method to essentially reset the metadata of a column.

This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 3.2.1 and earlier, passing an empty metadata object to the method "Column.as(name, metadata)" resets the columns metadata as expected.

h2. Steps to Reproduce

The following code snippet will show the issue in Spark shell:
{code:scala}
import org.apache.spark.sql.types.MetadataBuilder

// Create a DataFrame with one column with Metadata attached
val df1 = spark.range(1,10)
    .withColumn("col_with_metadata", col("id").as("col_with_metadata", new MetadataBuilder().putString("metadata", "value").build()))

// Create a derived DataFrame which should reset the metadata of the column
val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new MetadataBuilder().build()))

// Display metadata of both DataFrames columns
println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
{code} 

Expected would be that df1 has non-empty metadata, but df2 has empty metadata. But this is not the case, df2 still holds the same metadata as df1.

h2. Analysis

I think the problem stems from the changes in the method "trimNonTopLevelAliases" in the class AliasHelper:
{code:scala}
  protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
    val res = e match {
      case a: Alias =>
        val metadata = if (a.metadata == Metadata.empty) {
          None
        } else {
          Some(a.metadata)
        }
        a.copy(child = trimAliases(a.child))(
          exprId = a.exprId,
          qualifier = a.qualifier,
          explicitMetadata = metadata,
          nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
      case a: MultiAlias =>
        a.copy(child = trimAliases(a.child))
      case other => trimAliases(other)
    }

    res.asInstanceOf[T]
  }
{code}

The method will remove any empty metadata object from an Alias, which in turn means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org