You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mahmoudmahdi24 <gi...@git.apache.org> on 2018/08/01 15:09:46 UTC

[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

GitHub user mahmoudmahdi24 opened a pull request:

    https://github.com/apache/spark/pull/21944

    [SPARK-24988][SQL]Add a castBySchema method which casts all the values of a DataFrame based on the DataTypes of a StructType

    ## What changes were proposed in this pull request?
    
    The main goal of this Pull Request is to extend the Dataframe methods in order to add a method which casts all the values of a Dataframe, based on the DataTypes of a StructType.
    
    This feature can be useful when we have a large dataframe, and that we need to make multiple casts. In that case, we won't have to cast each value independently, all we have to do is to pass a StructType to the method castBySchema with the types we need (In real world examples, this schema is generally provided by the client, which was my case).
    
    Here's an example here on how we can apply this method (let's say that we have a dataframe of strings, and that we want to cast the values of the second columns to Int) : 
    ```
    // We start by creating the dataframe
    val df = Seq(("test1", "0"), ("test2", "1")).toDF("name", "id")
    
    // we prepare the StructType of the casted Dataframe that we'll obtain:
    val schema = StructType( Seq( StructField("name", StringType, true), StructField("id", IntegerType, true)))
    
    // and then, we simply use the method castBySchema :
    val castedDf = df.castBySchema(schema) 
    ```
    
    
    ## How was this patch tested?
    
    I modified DataFrameSuite in order to test the new added method (castBySchema).
    I first tested the method on a simple dataframe with a simple schema, then I tested it on Dataframes with a complex schemas (Nested StructTypes for example).
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mahmoudmahdi24/spark SPARK-24988

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21944
    
----
commit b48819e3894e4d2f246fc2dba7db73ad5714757d
Author: mahmoud_mahdi <ma...@...>
Date:   2018-08-01T14:00:22Z

    Add a castBySchema method which casts all the values of a DataFrame based on the DataTypes of a StructType

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21944: [SPARK-24988][SQL]Add a castBySchema method which casts ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21944
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by mahmoudmahdi24 <gi...@git.apache.org>.
Github user mahmoudmahdi24 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207209633
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    I am still not convinced about not adding this since it might be helpful for Spark users.
    I totally agree about avoiding to add it to the Dataset Api, but can we add it to the functions object in the org.apache.spark.sql package for example ? 
    Thanks 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by mahmoudmahdi24 <gi...@git.apache.org>.
Github user mahmoudmahdi24 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207164547
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    @maropu shall we add the @Experimental annotation in that case ? What do you suggest to make this method less sensitive ? Thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207160479
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    -1 (I think it is a pretty sensitive issue to add a new api in `Dataset`....)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by mahmoudmahdi24 <gi...@git.apache.org>.
Github user mahmoudmahdi24 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207156078
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    Hello @HyukjinKwon, Thanks for your feedback.
    Actually, some methods are one liner in the current API, but they help users by allowing them to know what they can do via the API (printSchema, dtypes, columns are perfect examples for that).
    Even if it's a one liner, this method can be helpful since it will tell spark's users that they can cast a dataframe by passing a schema.
    This method can be very useful in the Big Data world; generally, we parse different files which are in strings, and we have to cast all these values following a schema. In my case, my client provides Avro schemas to define the columns and names.
    I searched a lot for a method which applies a schema on a dataframe but I couldn't find one (https://stackoverflow.com/questions/51561715/cast-values-of-a-spark-dataframe-using-a-defined-structtype/51562763#51562763).
    Even by searching on github, I always see examples where people cast each value separately. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21944: [SPARK-24988][SQL]Add a castBySchema method which casts ...

Posted by holdensmagicalunicorn <gi...@git.apache.org>.
Github user holdensmagicalunicorn commented on the issue:

    https://github.com/apache/spark/pull/21944
  
    @mahmoudmahdi24, thanks! I am a bot who has found some folks who might be able to help with the review:@rxin, @gatorsmile and @yhuai


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by mahmoudmahdi24 <gi...@git.apache.org>.
Github user mahmoudmahdi24 closed the pull request at:

    https://github.com/apache/spark/pull/21944


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21944: [SPARK-24988][SQL]Add a castBySchema method which casts ...

Posted by mahmoudmahdi24 <gi...@git.apache.org>.
Github user mahmoudmahdi24 commented on the issue:

    https://github.com/apache/spark/pull/21944
  
    Closed the PR. This might be a useful trick, but we want to avoid adding many methods to the API. We'll reopen this in case many users asks for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207182155
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    most of one liner APIs should have been considered more before being added - this doesn't mean we are okay with one liner API.
    
    You can just define it in application side and use it. Pretty easy and simple. You can answer some questions about casting. I don't think it's worth adding it. There are already too many APIs open.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207248446
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    There are many good one liner tricks and I would just leave those good tricks in mailing list or something. I wouldn't add an API only because it _might be_ helpful to some users.
    
    We shouldn't add an API only because it _might be_ useful. I would consider adding this if there's a request for this PR multiple times, it is not one liner change and there's no easy workaround for it.
    
    Otherwise, every system will have an API to send an email.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by mahmoudmahdi24 <gi...@git.apache.org>.
Github user mahmoudmahdi24 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207250060
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    Ok I understand, Thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21944: [SPARK-24988][SQL]Add a castBySchema method which casts ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21944
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21944: [SPARK-24988][SQL]Add a castBySchema method which casts ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/21944
  
    Thanks, Mahmoud!



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21944: [SPARK-24988][SQL]Add a castBySchema method which casts ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21944
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207083002
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    This is basically one line code. I don't think we need an API for the one liner.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org