You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by goungoun <gi...@git.apache.org> on 2018/09/15 13:43:09 UTC

[GitHub] spark pull request #22428: [SPARK-25430][SQL] Add map parameter for withColu...

GitHub user goungoun opened a pull request:

    https://github.com/apache/spark/pull/22428

    [SPARK-25430][SQL] Add map parameter for withColumnRenamed

    ## What changes were proposed in this pull request?
    This PR allows withColumnRenamed with a map input argument
    
    ## How was this patch tested?
    unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/goungoun/spark SPARK-25430

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22428
    
----
commit eb0858989952454e2112bf6247d85d33438525e0
Author: Goun Na <go...@...>
Date:   2018-09-15T13:30:45Z

    [SPARK-18073][SQL] Add map parameter for withColumnRenamed

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22428
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22428: [SPARK-25430][SQL] Add map parameter for withColu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22428#discussion_r217937566
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2300,6 +2300,37 @@ class Dataset[T] private[sql](
         }
       }
     
    +  /**
    +   * Returns a new Dataset with columns renamed.
    +   * This is a no-op if schema doesn't contain existingNames in columnMap.
    +   * {{{
    +   *   df.withColumnRenamed(Map(
    +   *     "c1" -> "first_column",
    +   *     "c2" -> "second_column"
    +   *   ))
    +   * }}}
    +   *
    +   * @group untypedrel
    +   * @since 2.4.0
    --- End diff --
    
    branch-2.4 is cut out. We will probably target 3.0.0 if we happen to add new APIs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22428
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22428
  
    Can we simply call the API multiple times? I think we haven't usually added such aliases for an API unless there's strong argument for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

Posted by goungoun <gi...@git.apache.org>.
Github user goungoun commented on the issue:

    https://github.com/apache/spark/pull/22428
  
    Awesome! @HyukjinKwon , @gatorsmile thanks for good information. Let me look into it further. By the way, I still hope this conversation is open to users' voice, not limited with developers' perspective. Like me who have to do data wrangling/engineering everyday, it makes things easier.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

Posted by goungoun <gi...@git.apache.org>.
Github user goungoun commented on the issue:

    https://github.com/apache/spark/pull/22428
  
    @HyukjinKwon , thanks for your review. Actually, that is the reason that I open this pull request. I think it is better to giving reusable option to users than repeating too much of same code in their analysis. In notebook environment, whenever visualization is required in the middle of the analysis, I had to convert column names rather than using it as it is so that I can deliver right messages to the report readers. During the process, I had to repeat withColumenRenamed too many times. 
    
    So, I've researched how the other users are trying to overcome the limitation. It seems that users tend to use foldleft or for loop with withColumnRenamed which can cause performance issue creating too many dataframes inside of Spark engine even without knowing it. The arguments can be found as follows.
    
    StackOverflow
    - https://stackoverflow.com/questions/38798567/pyspark-rename-more-than-one-column-using-withcolumnrenamed
    - https://stackoverflow.com/questions/35592917/renaming-column-names-of-a-dataframe-in-spark-scala?noredirect=1&lq=1
    
    Spark Issues
    [SPARK-12225] Support adding or replacing multiple columns at once in DataFrame API
    
    [SPARK-21582] DataFrame.withColumnRenamed cause huge performance overhead
    If foldleft is used, too many columns can cause performance issue



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22428
  
    The performance issue was introduced by repeating query plan analysis, which is resolved in the current master if I am not mistaken - if you're in doubt, I would suggest to do a quick benchamrk. I think this is something we should do it with one liner helper in application side code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22428
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org