You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2020/02/26 08:34:00 UTC
[jira] [Updated] (SPARK-30319) Adds a stricter version of as[T]
[ https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enrico Minack updated SPARK-30319:
----------------------------------
Description:
The behaviour of as[T] is not intuitive when you read code like df.as[T].write.csv("data.csv"). The result depends on the actual schema of df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The expected behaviour is not provided elsewhere:
* Extra columns that are not part of the type {{T}} are not dropped.
* Order of columns is not aligned with schema of {{T}}.
A method that enforces schema of T on a given Dataset would be very convenient and allows to articulate and guarantee above assumptions about your data with the native Spark Dataset API. This method plays a more explicit and enforcing role than as[T] with respect to columns, column order and column type.
Possible naming of a stricter version of {{as[T]}}:
* {{as[T](strict = true)}}
* {{toDS[T]}} (as in {{toDF}})
* {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})
The naming {{toDS[T]}} is chosen in the linked PR.
was:
The behaviour of as[T] is not intuitive when you read code like df.as[T].write.csv("data.csv"). The result depends on the actual schema of df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The expected behaviour is not provided elsewhere:
* Extra columns that are not part of the type {{T}} are not dropped.
* Order of columns is not aligned with schema of {{T}}.
* Columns are not cast to the types of {{T}}'s fields. They have to be cast explicitly.
A method that enforces schema of T on a given Dataset would be very convenient and allows to articulate and guarantee above assumptions about your data with the native Spark Dataset API. This method plays a more explicit and enforcing role than as[T] with respect to columns, column order and column type.
Possible naming of a stricter version of {{as[T]}}:
* {{as[T](strict = true)}}
* {{toDS[T]}} (as in {{toDF}})
* {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})
The naming {{toDS[T]}} is chosen here.
> Adds a stricter version of as[T]
> --------------------------------
>
> Key: SPARK-30319
> URL: https://issues.apache.org/jira/browse/SPARK-30319
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Enrico Minack
> Priority: Major
>
> The behaviour of as[T] is not intuitive when you read code like df.as[T].write.csv("data.csv"). The result depends on the actual schema of df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The expected behaviour is not provided elsewhere:
> * Extra columns that are not part of the type {{T}} are not dropped.
> * Order of columns is not aligned with schema of {{T}}.
> A method that enforces schema of T on a given Dataset would be very convenient and allows to articulate and guarantee above assumptions about your data with the native Spark Dataset API. This method plays a more explicit and enforcing role than as[T] with respect to columns, column order and column type.
> Possible naming of a stricter version of {{as[T]}}:
> * {{as[T](strict = true)}}
> * {{toDS[T]}} (as in {{toDF}})
> * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})
> The naming {{toDS[T]}} is chosen in the linked PR.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org