You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2017/10/22 10:37:18 UTC

[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/19551

    [WIP][SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

    ## What changes were proposed in this pull request?
    
    This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.
    
    Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.
    
    ## How was this patch tested?
    
    Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-17902

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19551.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19551
    
----
commit ffe751bfbdae6d2bd4c127794ff2f6554d64bf3c
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-10-22T10:31:20Z

    Revive stringsAsFactors option for collect() in SparkR

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...

Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19551


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82958/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    **[Test build #82958 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82958/testReport)** for PR 19551 at commit [`ffe751b`](https://github.com/apache/spark/commit/ffe751bfbdae6d2bd4c127794ff2f6554d64bf3c).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...

Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19551#discussion_r146160421
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1191,6 +1191,9 @@ setMethod("collect",
                         vec <- do.call(c, col)
                         stopifnot(class(vec) != "list")
                         class(vec) <- PRIMITIVE_TYPES[[colType]]
    +                    if (stringsAsFactors && is.character(vec)) {
    --- End diff --
    
    For performance maybe it is better to reverse the order of checks: `is.character(vec) && stringsAsFactors`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    **[Test build #82982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82982/testReport)** for PR 19551 at commit [`f9e42d5`](https://github.com/apache/spark/commit/f9e42d546034743d4daa998fb64bb4e91d872a82).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...

Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19551#discussion_r146160321
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
       expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
     })
     
    +test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
    --- End diff --
    
    Would you please verify that factor orders are identical. I wonder if `expect_equal` really verifies order of values in a factor.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    **[Test build #82982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82982/testReport)** for PR 19551 at commit [`f9e42d5`](https://github.com/apache/spark/commit/f9e42d546034743d4daa998fb64bb4e91d872a82).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...

Posted by HyukjinKwon <gi...@git.apache.org>.
GitHub user HyukjinKwon reopened a pull request:

    https://github.com/apache/spark/pull/19551

    [SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

    ## What changes were proposed in this pull request?
    
    This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.
    
    Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.
    
    ## How was this patch tested?
    
    Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-17902

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19551.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19551
    
----
commit 145b60f5ff49d65e118689933da77e7d8cc01865
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-10-22T10:31:20Z

    Revive stringsAsFactors option for collect() in SparkR

commit f9e42d546034743d4daa998fb64bb4e91d872a82
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-10-23T11:37:53Z

    Swap condition for better short-circuiting

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...

Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19551#discussion_r146345486
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
       expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
     })
     
    +test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
    --- End diff --
    
    BTW: I think iris data frame all Species values are clustered together. That is why the test is passing (the new factor order ends up being identical to the existing order).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    **[Test build #82959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82959/testReport)** for PR 19551 at commit [`145b60f`](https://github.com/apache/spark/commit/145b60f5ff49d65e118689933da77e7d8cc01865).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19551#discussion_r146236358
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
       expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
     })
     
    +test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
    --- End diff --
    
    
    ```r
    > # Ordered vs unordered
    > or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=TRUE)
    > or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=FALSE)
    > expect_equal(or, or1)
    error: `or` not equal to `or1`.
    Attributes: < Component “class”: Lengths (2, 1) differ (string compare on first 1) >
    Attributes: < Component “class”: 1 string mismatch >
    ```
    
    ```r
    > # level order mismatch
    > or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
    > or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"))
    > expect_equal(or, or1)
    error: `or` not equal to `or1`.
    Attributes: < Component “levels”: 3 string mismatches >
    ```
    
    ```r
    # Data order mismatch
    > or <- factor(c("Lo", "Hi", "Med", "Med", "Hi"), levels=c("Hi", "Lo", "Med"))
    > or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
    > expect_equal(or, or1)
    error: `or` not equal to `or1`.
    4 string mismatches
    ```
    
    ```r
    > or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
    > or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
    > expect_equal(or, or1)
    ```
    
    Would this test address your concern?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    **[Test build #82959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82959/testReport)** for PR 19551 at commit [`145b60f`](https://github.com/apache/spark/commit/145b60f5ff49d65e118689933da77e7d8cc01865).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19551#discussion_r146236525
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1191,6 +1191,9 @@ setMethod("collect",
                         vec <- do.call(c, col)
                         stopifnot(class(vec) != "list")
                         class(vec) <- PRIMITIVE_TYPES[[colType]]
    +                    if (stringsAsFactors && is.character(vec)) {
    --- End diff --
    
    Sure, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    cc @falaki and @shivaram, I think I am not sure if it is the best way, so, I marked this as `[WIP]`. But, it looks at least fixing the issue though. Would you guys mind taking a look when you have some time?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    Thank you for review @falaki and @felixcheung.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82982/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon closed the pull request at:

    https://github.com/apache/spark/pull/19551


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82959/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...

Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19551#discussion_r146345075
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
       expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
     })
     
    +test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
    --- End diff --
    
    thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    Merged to master, branch-2.2 and branch-2.1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19551
  
    **[Test build #82958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82958/testReport)** for PR 19551 at commit [`ffe751b`](https://github.com/apache/spark/commit/ffe751bfbdae6d2bd4c127794ff2f6554d64bf3c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org