You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2017/10/22 10:37:18 UTC
[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...
GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/19551
[WIP][SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR
## What changes were proposed in this pull request?
This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.
Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.
## How was this patch tested?
Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-17902
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19551.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19551
----
commit ffe751bfbdae6d2bd4c127794ff2f6554d64bf3c
Author: hyukjinkwon <gu...@gmail.com>
Date: 2017-10-22T10:31:20Z
Revive stringsAsFactors option for collect() in SparkR
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...
Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on the issue:
https://github.com/apache/spark/pull/19551
LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19551
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19551
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82958/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19551
**[Test build #82958 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82958/testReport)** for PR 19551 at commit [`ffe751b`](https://github.com/apache/spark/commit/ffe751bfbdae6d2bd4c127794ff2f6554d64bf3c).
* This patch **fails SparkR unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...
Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on a diff in the pull request:
https://github.com/apache/spark/pull/19551#discussion_r146160421
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1191,6 +1191,9 @@ setMethod("collect",
vec <- do.call(c, col)
stopifnot(class(vec) != "list")
class(vec) <- PRIMITIVE_TYPES[[colType]]
+ if (stringsAsFactors && is.character(vec)) {
--- End diff --
For performance maybe it is better to reverse the order of checks: `is.character(vec) && stringsAsFactors`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19551
**[Test build #82982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82982/testReport)** for PR 19551 at commit [`f9e42d5`](https://github.com/apache/spark/commit/f9e42d546034743d4daa998fb64bb4e91d872a82).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...
Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on a diff in the pull request:
https://github.com/apache/spark/pull/19551#discussion_r146160321
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
})
+test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
--- End diff --
Would you please verify that factor orders are identical. I wonder if `expect_equal` really verifies order of values in a factor.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19551
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19551
**[Test build #82982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82982/testReport)** for PR 19551 at commit [`f9e42d5`](https://github.com/apache/spark/commit/f9e42d546034743d4daa998fb64bb4e91d872a82).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...
Posted by HyukjinKwon <gi...@git.apache.org>.
GitHub user HyukjinKwon reopened a pull request:
https://github.com/apache/spark/pull/19551
[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR
## What changes were proposed in this pull request?
This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.
Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.
## How was this patch tested?
Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-17902
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19551.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19551
----
commit 145b60f5ff49d65e118689933da77e7d8cc01865
Author: hyukjinkwon <gu...@gmail.com>
Date: 2017-10-22T10:31:20Z
Revive stringsAsFactors option for collect() in SparkR
commit f9e42d546034743d4daa998fb64bb4e91d872a82
Author: hyukjinkwon <gu...@gmail.com>
Date: 2017-10-23T11:37:53Z
Swap condition for better short-circuiting
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19551
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19551
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...
Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on a diff in the pull request:
https://github.com/apache/spark/pull/19551#discussion_r146345486
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
})
+test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
--- End diff --
BTW: I think iris data frame all Species values are clustered together. That is why the test is passing (the new factor order ends up being identical to the existing order).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19551
**[Test build #82959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82959/testReport)** for PR 19551 at commit [`145b60f`](https://github.com/apache/spark/commit/145b60f5ff49d65e118689933da77e7d8cc01865).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19551#discussion_r146236358
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
})
+test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
--- End diff --
```r
> # Ordered vs unordered
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=TRUE)
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=FALSE)
> expect_equal(or, or1)
error: `or` not equal to `or1`.
Attributes: < Component “class”: Lengths (2, 1) differ (string compare on first 1) >
Attributes: < Component “class”: 1 string mismatch >
```
```r
> # level order mismatch
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"))
> expect_equal(or, or1)
error: `or` not equal to `or1`.
Attributes: < Component “levels”: 3 string mismatches >
```
```r
# Data order mismatch
> or <- factor(c("Lo", "Hi", "Med", "Med", "Hi"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> expect_equal(or, or1)
error: `or` not equal to `or1`.
4 string mismatches
```
```r
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> expect_equal(or, or1)
```
Would this test address your concern?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19551
**[Test build #82959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82959/testReport)** for PR 19551 at commit [`145b60f`](https://github.com/apache/spark/commit/145b60f5ff49d65e118689933da77e7d8cc01865).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19551#discussion_r146236525
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1191,6 +1191,9 @@ setMethod("collect",
vec <- do.call(c, col)
stopifnot(class(vec) != "list")
class(vec) <- PRIMITIVE_TYPES[[colType]]
+ if (stringsAsFactors && is.character(vec)) {
--- End diff --
Sure, thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19551
cc @falaki and @shivaram, I think I am not sure if it is the best way, so, I marked this as `[WIP]`. But, it looks at least fixing the issue though. Would you guys mind taking a look when you have some time?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19551
Thank you for review @falaki and @felixcheung.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19551
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82982/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon closed the pull request at:
https://github.com/apache/spark/pull/19551
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19551
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82959/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19551: [SPARK-17902][R] Revive stringsAsFactors option f...
Posted by falaki <gi...@git.apache.org>.
Github user falaki commented on a diff in the pull request:
https://github.com/apache/spark/pull/19551#discussion_r146345075
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
})
+test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
--- End diff --
thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [SPARK-17902][R] Revive stringsAsFactors option for coll...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19551
Merged to master, branch-2.2 and branch-2.1.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors option for...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19551
**[Test build #82958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82958/testReport)** for PR 19551 at commit [`ffe751b`](https://github.com/apache/spark/commit/ffe751bfbdae6d2bd4c127794ff2f6554d64bf3c).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org