You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michel Trottier-McDonald (Jira)" <ji...@apache.org> on 2021/04/30 16:24:00 UTC
[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

    [ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337491#comment-17337491 ] 

Michel Trottier-McDonald commented on SPARK-34805:
--------------------------------------------------

I believe this is not a PySpark-specific issue. We have a unit test in [transmogif.ai|https://transmogrif.ai/] where we are specifying [column metadata manually|https://github.com/salesforce/TransmogrifAI/blob/90a0f298f14506a27c84a71de414d53a30cf687f/core/src/test/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerTest.scala#L137] and check whether the metadata is properly passed on to a model that consumes this column. The column metadata is properly given to the column using {{.as(columnName, metadata)}}, but is immediately lost once the select is executed. I've traced the issue to the changes in {{ExpressionEncoder}}:
 * In Spark 2.4, [it takes it a schema argument|https://github.com/apache/spark/blob/e89526d2401b3a04719721c923a6f630e555e286/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L222] through which the column metadata is passed along
 * In Spark 3.0, [it no longer takes|https://github.com/apache/spark/blob/39889df32a7a916d826e255fda6fc62e2a3d7971/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L232] this schema parameter and it seems like the column metadata is lost as a result

I can't tell if this was intentional or not, but it renders the metadata argument of the {{.as}} [method|https://github.com/apache/spark/blob/39889df32a7a916d826e255fda6fc62e2a3d7971/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1133] in {{Column}} mostly useless.

> PySpark loses metadata in DataFrame fields when selecting nested columns
> ------------------------------------------------------------------------
>
>                 Key: SPARK-34805
>                 URL: https://issues.apache.org/jira/browse/SPARK-34805
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.1, 3.1.1
>            Reporter: Mark Ressler
>            Priority: Major
>         Attachments: jsonMetadataTest.py
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for fields in the schema, that metadata is lost when a DataFrame selects nested fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in the DataFrame and "SubField0" is the name of the first nested field under "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org