You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Simeon Simeonov (JIRA)" <ji...@apache.org> on 2018/07/22 17:03:00 UTC

[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

     [ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simeon Simeonov updated SPARK-16483:
------------------------------------
    Affects Version/s: 2.3.1
          Description: 
This issue comes as a result of an exchange with Michael Armbrust outside of the usual JIRA/dev list channels.

DataFrame provides a full set of manipulation operations for top-level columns. They have be added, removed, modified and renamed. The same is not true about fields inside structs yet, from a logical standpoint, Spark users may very well want to perform the same operations on struct fields, especially since automatic schema discovery from JSON input tends to create deeply nested structs.

Common use-cases include:
 - Remove and/or rename struct field(s) to adjust the schema
 - Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling {{named_struct}} and listing all fields, including ones we don't want to manipulate. This leads to complex, fragile code that cannot survive schema evolution.

It would be far better if the various APIs that can now manipulate top-level columns were extended to handle struct fields at arbitrary locations or, alternatively, if we introduced new APIs for modifying any field in a dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes (overloaded methods are not shown):
{code:java}
class Column(val expr: Expression) extends Logging {

  // ...

  // matches Dataset.schema semantics
  def schema: StructType

  // matches Dataset.select() semantics
  // '* support allows multiple new fields to be added easily, saving cumbersome repeated withColumn() calls
  def select(cols: Column*): Column

  // matches Dataset.withColumn() semantics of add or replace
  def withColumn(colName: String, col: Column): Column

  // matches Dataset.drop() semantics
  def drop(colName: String): Column

}

class Dataset[T] ... {

  // ...

  // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
  def cast(newShema: StructType): DataFrame

}
{code}
The benefit of the above API is that it unifies manipulating top-level & nested columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for nested field reordering, casting, etc., which is important in data exchange scenarios where field position matters. That's also the reason to add {{cast}} to {{Dataset}}: it improves consistency and readability (with method chaining).

  was:
This issue comes as a result of an exchange with Michael Armbrust outside of the usual JIRA/dev list channels. 

DataFrame provides a full set of manipulation operations for top-level columns. They have be added, removed, modified and renamed. The same is not true about fields inside structs yet, from a logical standpoint, Spark users may very well want to perform the same operations on struct fields, especially since automatic schema discovery from JSON input tends to create deeply nested structs.

Common use-cases include:

- Remove and/or rename struct field(s) to adjust the schema
- Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling {{named_struct}} and listing all fields, including ones we don't want to manipulate. This leads to complex, fragile code that cannot survive schema evolution.

It would be far better if the various APIs that can now manipulate top-level columns were extended to handle struct fields at arbitrary locations or, alternatively, if we introduced new APIs for modifying any field in a dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes, here is the skeleton implementation of an update() implicit that we've use to modify any existing field in a dataframe. (Note that it depends on various other utilities and implicits that are not included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66


> Unifying struct fields and columns
> ----------------------------------
>
>                 Key: SPARK-16483
>                 URL: https://issues.apache.org/jira/browse/SPARK-16483
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Simeon Simeonov
>            Priority: Major
>              Labels: sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of the usual JIRA/dev list channels.
> DataFrame provides a full set of manipulation operations for top-level columns. They have be added, removed, modified and renamed. The same is not true about fields inside structs yet, from a logical standpoint, Spark users may very well want to perform the same operations on struct fields, especially since automatic schema discovery from JSON input tends to create deeply nested structs.
> Common use-cases include:
>  - Remove and/or rename struct field(s) to adjust the schema
>  - Fix a data quality issue with a struct field (update/rewrite)
> To do this with the existing API by hand requires manually calling {{named_struct}} and listing all fields, including ones we don't want to manipulate. This leads to complex, fragile code that cannot survive schema evolution.
> It would be far better if the various APIs that can now manipulate top-level columns were extended to handle struct fields at arbitrary locations or, alternatively, if we introduced new APIs for modifying any field in a dataframe, whether it is a top-level one or one nested inside a struct.
> Purely for discussion purposes (overloaded methods are not shown):
> {code:java}
> class Column(val expr: Expression) extends Logging {
>   // ...
>   // matches Dataset.schema semantics
>   def schema: StructType
>   // matches Dataset.select() semantics
>   // '* support allows multiple new fields to be added easily, saving cumbersome repeated withColumn() calls
>   def select(cols: Column*): Column
>   // matches Dataset.withColumn() semantics of add or replace
>   def withColumn(colName: String, col: Column): Column
>   // matches Dataset.drop() semantics
>   def drop(colName: String): Column
> }
> class Dataset[T] ... {
>   // ...
>   // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
>   def cast(newShema: StructType): DataFrame
> }
> {code}
> The benefit of the above API is that it unifies manipulating top-level & nested columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for nested field reordering, casting, etc., which is important in data exchange scenarios where field position matters. That's also the reason to add {{cast}} to {{Dataset}}: it improves consistency and readability (with method chaining).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org