You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2015/12/09 07:38:29 UTC

[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/10218

    [SPARK-12227][SQL] Support drop multiple columns specified by Column class in DataFrame API

    https://issues.apache.org/jira/browse/SPARK-12227
    
    In this PR, I added the support to drop multiple columns by reference. This was only supported with the string names.
    
    As `varargs` does not recognise the type for multuple variables, I changed `drop(colNames: String*)` to `drop(colName: String, colNames: String*)` correspondingly to `orderBy(sortCol: String, sortCols: String*)` and `sort(sortCol: String, sortCols: String*)`.
    
    Accordingly, the call in `drop(colName: String)` is changed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-12227

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10218
    
----
commit cae52921a1491f2dee9d130aaeadc21b7e9cb168
Author: hyukjinkwon <gu...@gmail.com>
Date:   2015-12-09T06:25:20Z

    Support to drop multiple columns by reference

commit 22235269995f5c525da0d9aef948803883ca91ae
Author: hyukjinkwon <gu...@gmail.com>
Date:   2015-12-09T06:36:58Z

    Change comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163146790
  
    **[Test build #47412 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47412/consoleFull)** for PR 10218 at commit [`2223526`](https://github.com/apache/spark/commit/22235269995f5c525da0d9aef948803883ca91ae).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163809737
  
    More functions for the sake of more functions does not make the API easier to use.  We should not add functions just to be consistent, we should only add them if they are going to be used.
    
    The complexity come from the fact that users who have a seq of string have to do something like this now:
    
    ```scala
    val toDrop = Seq("col1", "col2", "col3")
    df.drop(toDrop.head, toDrop.tail)
    ```
    
    Can you construct a case when you would have columns instead of strings that you wanted to drop?  In the test case is strictly more typing to use this API compared to the one that exists already.  This doesn't seem worth complicating the case above.
    ```scala
    val df = src.drop("a", "b")
    val df = src.drop(src("a"), src("b"))
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163146896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163146900
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47412/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by aray <gi...@git.apache.org>.
Github user aray commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10218#discussion_r47166749
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1271,10 +1271,11 @@ class DataFrame private[sql](
        * @since 1.6.0
        */
       @scala.annotation.varargs
    -  def drop(colNames: String*): DataFrame = {
    +  def drop(colName: String, colNames: String*): DataFrame = {
    --- End diff --
    
    Doesn't this break binary compatibility?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10218#discussion_r47197511
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1271,10 +1271,11 @@ class DataFrame private[sql](
        * @since 1.6.0
        */
       @scala.annotation.varargs
    -  def drop(colNames: String*): DataFrame = {
    +  def drop(colName: String, colNames: String*): DataFrame = {
    --- End diff --
    
    Should I simply add another `colName: String` if I got you correctly and the problem is due to the ambiguity?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon closed the pull request at:

    https://github.com/apache/spark/pull/10218


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163819931
  
    yes please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163132221
  
    **[Test build #47412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47412/consoleFull)** for PR 10218 at commit [`2223526`](https://github.com/apache/spark/commit/22235269995f5c525da0d9aef948803883ca91ae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163797608
  
    I see. I added this as `sort`, `cube`, `select` and etc do the same thing for supporting it by name and expression. Although I am not greatly insightful for this DataFrame APIs, I feel like some users would feel it is inconsistent if most APIs support this by name and expression but this does not.
    
    Also, I think this might not be so complicated. I just added a loop and made `drop(col: Column)` call this but I accept that this might look complicated (in terms of readability).
    
    So, it looks like consistency vs readability for me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163816325
  
    Thank for the detailed explanation! Then does this mean closing this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10218#discussion_r47201429
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1271,10 +1271,11 @@ class DataFrame private[sql](
        * @since 1.6.0
        */
       @scala.annotation.varargs
    -  def drop(colNames: String*): DataFrame = {
    +  def drop(colName: String, colNames: String*): DataFrame = {
    --- End diff --
    
    Just FYI, I am not too sure about all the versions but at least this works fine at Java 7 and Scala 2.10.4.
    
    - Scala
    ```scala
    class TestCompatibility {
      def test(a: String): Unit = {
        test(a, Seq(): _ *)
      }
      @varargs
      def test(a: String, b: String*): Unit = {
        (a +: b).foreach(println)
      }
      def test(a: Int): Unit = {
        test(Seq(a) : _ *)
      }
      @varargs
      def test(a: Int*): Unit = {
        a.foreach(println)
      }
    }
    ```
    
    - Java
    ```java
    public class Test {
        public static void main(String[] args) {
            new TestCompatibility().test("a");
            new TestCompatibility().test("a", "b");
            new TestCompatibility().test("a", "b", "c");
            new TestCompatibility().test(1);
            new TestCompatibility().test(1, 2);
            new TestCompatibility().test(1, 2, 3);
        }
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10218#issuecomment-163718935
  
    I'm not sure this is worth the complexity.  I think most users will only ever drop by name (since dropping a complex expression doesn't really make sense), and in that case constructing a column is strictly more typing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

Posted by aray <gi...@git.apache.org>.
Github user aray commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10218#discussion_r47232368
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1271,10 +1271,11 @@ class DataFrame private[sql](
        * @since 1.6.0
        */
       @scala.annotation.varargs
    -  def drop(colNames: String*): DataFrame = {
    +  def drop(colName: String, colNames: String*): DataFrame = {
    --- End diff --
    
    Never mind, the annotation is wrong it was added in #9862 which did not get merged into 1.6 so no worry about changing the method signature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org