You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michel Trottier-McDonald (Jira)" <ji...@apache.org> on 2021/04/30 16:24:00 UTC
[jira] [Commented] (SPARK-34805) PySpark loses metadata in
DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337491#comment-17337491 ]
Michel Trottier-McDonald commented on SPARK-34805:
--------------------------------------------------
I believe this is not a PySpark-specific issue. We have a unit test in [transmogif.ai|https://transmogrif.ai/] where we are specifying [column metadata manually|https://github.com/salesforce/TransmogrifAI/blob/90a0f298f14506a27c84a71de414d53a30cf687f/core/src/test/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerTest.scala#L137] and check whether the metadata is properly passed on to a model that consumes this column. The column metadata is properly given to the column using {{.as(columnName, metadata)}}, but is immediately lost once the select is executed. I've traced the issue to the changes in {{ExpressionEncoder}}:
* In Spark 2.4, [it takes it a schema argument|https://github.com/apache/spark/blob/e89526d2401b3a04719721c923a6f630e555e286/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L222] through which the column metadata is passed along
* In Spark 3.0, [it no longer takes|https://github.com/apache/spark/blob/39889df32a7a916d826e255fda6fc62e2a3d7971/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L232] this schema parameter and it seems like the column metadata is lost as a result
I can't tell if this was intentional or not, but it renders the metadata argument of the {{.as}} [method|https://github.com/apache/spark/blob/39889df32a7a916d826e255fda6fc62e2a3d7971/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1133] in {{Column}} mostly useless.
> PySpark loses metadata in DataFrame fields when selecting nested columns
> ------------------------------------------------------------------------
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.0.1, 3.1.1
> Reporter: Mark Ressler
> Priority: Major
> Attachments: jsonMetadataTest.py
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for fields in the schema, that metadata is lost when a DataFrame selects nested fields. For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in the DataFrame and "SubField0" is the name of the first nested field under "Field0".
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org