You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2020/06/24 02:08:00 UTC

[spark] branch branch-3.0 updated: [SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new f18c0f6  [SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+
f18c0f6 is described below

commit f18c0f61f8cd3757fc28078a3d92ce69babf04b3
Author: HyukjinKwon <gu...@apache.org>
AuthorDate: Wed Jun 24 11:03:05 2020 +0900

    [SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+.
    
    Currently, when you run the codes that runs R native codes, it fails as below with R 4.0.0:
    
    ```r
    df <- createDataFrame(lapply(seq(100), function (e) list(value=e)))
    count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df)))
    ```
    
    ```
    org.apache.spark.SparkException: R unexpectedly exited.
    R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue
    ```
    
    The root cause seems to be related to when an S4 generic method is manually included into the closure's environment via `SparkR:::cleanClosure`. For example, when an RRDD is created via `createDataFrame` with calling `lapply` to convert, `lapply` itself:
    
    https://github.com/apache/spark/blob/f53d8c63e80172295e2fbc805c0c391bdececcaa/R/pkg/R/RDD.R#L484
    
    is added into the environment of the cleaned closure - because this is not an exposed namespace; however, this is broken in R 4.0.0+ for an unknown reason with an error message such as "attempt to bind a variable to R_UnboundValue".
    
    Actually, we don't need to add the `lapply` into the environment of the closure because it is not supposed to be called in worker side. In fact, there is no private generic methods supposed to be called in worker side in SparkR at all from my understanding.
    
    Therefore, this PR takes a simpler path to work around just by explicitly excluding the S4 generic methods under SparkR namespace to support R 4.0.0. in SparkR.
    
    ### Why are the changes needed?
    
    To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN requires the tests pass with the latest R.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it will support R 4.0.0 to end-users.
    
    ### How was this patch tested?
    
    Manually tested. Both CRAN and tests with R 4.0.1:
    
    ```
    ══ testthat results  ═══════════════════════════════════════════════════════════
    [ OK: 13 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ]
    ✔ |  OK F W S | Context
    ✔ |  11       | binary functions [2.5 s]
    ✔ |   4       | functions on binary files [2.1 s]
    ✔ |   2       | broadcast variables [0.5 s]
    ✔ |   5       | functions in client.R
    ✔ |  46       | test functions in sparkR.R [6.3 s]
    ✔ |   2       | include R packages [0.3 s]
    ✔ |   2       | JVM API [0.2 s]
    ✔ |  75       | MLlib classification algorithms, except for tree-based algorithms [86.3 s]
    ✔ |  70       | MLlib clustering algorithms [44.5 s]
    ✔ |   6       | MLlib frequent pattern mining [3.0 s]
    ✔ |   8       | MLlib recommendation algorithms [9.6 s]
    ✔ | 136       | MLlib regression algorithms, except for tree-based algorithms [76.0 s]
    ✔ |   8       | MLlib statistics algorithms [0.6 s]
    ✔ |  94       | MLlib tree-based algorithms [85.2 s]
    ✔ |  29       | parallelize() and collect() [0.5 s]
    ✔ | 428       | basic RDD functions [25.3 s]
    ✔ |  39       | SerDe functionality [2.2 s]
    ✔ |  20       | partitionBy, groupByKey, reduceByKey etc. [3.9 s]
    ✔ |   4       | functions in sparkR.R
    ✔ |  16       | SparkSQL Arrow optimization [19.2 s]
    ✔ |   6       | test show SparkDataFrame when eager execution is enabled. [1.1 s]
    ✔ | 1175       | SparkSQL functions [134.8 s]
    ✔ |  42       | Structured Streaming [478.2 s]
    ✔ |  16       | tests RDD function take() [1.1 s]
    ✔ |  14       | the textFile() function [2.9 s]
    ✔ |  46       | functions in utils.R [0.7 s]
    ✔ |   0     1 | Windows-specific tests
    ────────────────────────────────────────────────────────────────────────────────
    test_Windows.R:22: skip: sparkJars tag in SparkContext
    Reason: This test is only for Windows, skipped
    ────────────────────────────────────────────────────────────────────────────────
    
    ══ Results ═════════════════════════════════════════════════════════════════════
    Duration: 987.3 s
    
    OK:       2304
    Failed:   0
    Warnings: 0
    Skipped:  1
    ...
    Status: OK
    + popd
    Tests passed.
    ```
    
    Note that I tested to build SparkR in R 4.0.0, and run the tests with R 3.6.3. It all passed. See also [the comment in the JIRA](https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17142837).
    
    Closes #28907 from HyukjinKwon/SPARK-31918.
    
    Authored-by: HyukjinKwon <gu...@apache.org>
    Signed-off-by: HyukjinKwon <gu...@apache.org>
    (cherry picked from commit 11d2b07b74c73ce6d59ac4f7446f1eb8bc6bbb4b)
    Signed-off-by: HyukjinKwon <gu...@apache.org>
---
 R/pkg/R/utils.R                                   |  5 ++++-
 R/pkg/tests/fulltests/test_context.R              |  4 +++-
 R/pkg/tests/fulltests/test_mllib_classification.R | 18 +++++++++---------
 R/pkg/tests/fulltests/test_mllib_clustering.R     |  2 +-
 R/pkg/tests/fulltests/test_mllib_regression.R     |  2 +-
 5 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/R/pkg/R/utils.R b/R/pkg/R/utils.R
index 65db9c2..cef2fa9 100644
--- a/R/pkg/R/utils.R
+++ b/R/pkg/R/utils.R
@@ -529,7 +529,10 @@ processClosure <- function(node, oldEnv, defVars, checkedFuncs, newEnv) {
         # Namespaces other than "SparkR" will not be searched.
         if (!isNamespace(func.env) ||
             (getNamespaceName(func.env) == "SparkR" &&
-               !(nodeChar %in% getNamespaceExports("SparkR")))) {
+               !(nodeChar %in% getNamespaceExports("SparkR")) &&
+                  # Note that generic S4 methods should not be set to the environment of
+                  # cleaned closure. It does not work with R 4.0.0+. See also SPARK-31918.
+                  nodeChar != "" && !methods::isGeneric(nodeChar, func.env))) {
           # Only include SparkR internals.
 
           # Set parameter 'inherits' to FALSE since we do not need to search in
diff --git a/R/pkg/tests/fulltests/test_context.R b/R/pkg/tests/fulltests/test_context.R
index 6be04b3..f86872d 100644
--- a/R/pkg/tests/fulltests/test_context.R
+++ b/R/pkg/tests/fulltests/test_context.R
@@ -26,7 +26,9 @@ test_that("Check masked functions", {
                      "colnames", "colnames<-", "intersect", "rank", "rbind", "sample", "subset",
                      "summary", "transform", "drop", "window", "as.data.frame", "union", "not")
   version <- packageVersion("base")
-  if (as.numeric(version$major) >= 3 && as.numeric(version$minor) >= 3) {
+  is33Above <- as.numeric(version$major) >= 3 && as.numeric(version$minor) >= 3
+  is40Above <- as.numeric(version$major) >= 4
+  if (is33Above || is40Above) {
     namesOfMasked <- c("endsWith", "startsWith", namesOfMasked)
   }
   masked <- conflicts(detail = TRUE)$`package:SparkR`
diff --git a/R/pkg/tests/fulltests/test_mllib_classification.R b/R/pkg/tests/fulltests/test_mllib_classification.R
index 2da3a02..2a09c23 100644
--- a/R/pkg/tests/fulltests/test_mllib_classification.R
+++ b/R/pkg/tests/fulltests/test_mllib_classification.R
@@ -34,7 +34,7 @@ test_that("spark.svmLinear", {
   summary <- summary(model)
 
   # test summary coefficients return matrix type
-  expect_true(class(summary$coefficients) == "matrix")
+  expect_true(any(class(summary$coefficients) == "matrix"))
   expect_true(class(summary$coefficients[, 1]) == "numeric")
 
   coefs <- summary$coefficients[, "Estimate"]
@@ -130,7 +130,7 @@ test_that("spark.logit", {
   summary <- summary(model)
 
   # test summary coefficients return matrix type
-  expect_true(class(summary$coefficients) == "matrix")
+  expect_true(any(class(summary$coefficients) == "matrix"))
   expect_true(class(summary$coefficients[, 1]) == "numeric")
 
   versicolorCoefsR <- c(1.52, 0.03, -0.53, 0.04, 0.00)
@@ -242,8 +242,8 @@ test_that("spark.logit", {
   # Test binomial logistic regression against two classes with upperBoundsOnCoefficients
   # and upperBoundsOnIntercepts
   u <- matrix(c(1.0, 0.0, 1.0, 0.0), nrow = 1, ncol = 4)
-  model <- spark.logit(training, Species ~ ., upperBoundsOnCoefficients = u,
-                       upperBoundsOnIntercepts = 1.0)
+  model <- suppressWarnings(spark.logit(training, Species ~ ., upperBoundsOnCoefficients = u,
+                                        upperBoundsOnIntercepts = 1.0))
   summary <- summary(model)
   coefsR <- c(-11.13331, 1.00000, 0.00000, 1.00000, 0.00000)
   coefs <- summary$coefficients[, "Estimate"]
@@ -255,8 +255,8 @@ test_that("spark.logit", {
   # Test binomial logistic regression against two classes with lowerBoundsOnCoefficients
   # and lowerBoundsOnIntercepts
   l <- matrix(c(0.0, -1.0, 0.0, -1.0), nrow = 1, ncol = 4)
-  model <- spark.logit(training, Species ~ ., lowerBoundsOnCoefficients = l,
-                       lowerBoundsOnIntercepts = 0.0)
+  model <- suppressWarnings(spark.logit(training, Species ~ ., lowerBoundsOnCoefficients = l,
+                                        lowerBoundsOnIntercepts = 0.0))
   summary <- summary(model)
   coefsR <- c(0, 0, -1, 0, 1.902192)
   coefs <- summary$coefficients[, "Estimate"]
@@ -268,9 +268,9 @@ test_that("spark.logit", {
   # Test multinomial logistic regression with lowerBoundsOnCoefficients
   # and lowerBoundsOnIntercepts
   l <- matrix(c(0.0, -1.0, 0.0, -1.0, 0.0, -1.0, 0.0, -1.0), nrow = 2, ncol = 4)
-  model <- spark.logit(training, Species ~ ., family = "multinomial",
-                       lowerBoundsOnCoefficients = l,
-                       lowerBoundsOnIntercepts = as.array(c(0.0, 0.0)))
+  model <- suppressWarnings(spark.logit(training, Species ~ ., family = "multinomial",
+                                        lowerBoundsOnCoefficients = l,
+                                        lowerBoundsOnIntercepts = as.array(c(0.0, 0.0))))
   summary <- summary(model)
   versicolorCoefsR <- c(42.639465, 7.258104, 14.330814, 16.298243, 11.716429)
   virginicaCoefsR <- c(0.0002970796, 4.79274, 7.65047, 25.72793, 30.0021)
diff --git a/R/pkg/tests/fulltests/test_mllib_clustering.R b/R/pkg/tests/fulltests/test_mllib_clustering.R
index 028ad57..f180aee 100644
--- a/R/pkg/tests/fulltests/test_mllib_clustering.R
+++ b/R/pkg/tests/fulltests/test_mllib_clustering.R
@@ -171,7 +171,7 @@ test_that("spark.kmeans", {
   expect_equal(sort(collect(distinct(select(cluster, "prediction")))$prediction), c(0, 1))
 
   # test summary coefficients return matrix type
-  expect_true(class(summary.model$coefficients) == "matrix")
+  expect_true(any(class(summary.model$coefficients) == "matrix"))
   expect_true(class(summary.model$coefficients[1, ]) == "numeric")
 
   # Test model save/load
diff --git a/R/pkg/tests/fulltests/test_mllib_regression.R b/R/pkg/tests/fulltests/test_mllib_regression.R
index b40c4cb..929cd27 100644
--- a/R/pkg/tests/fulltests/test_mllib_regression.R
+++ b/R/pkg/tests/fulltests/test_mllib_regression.R
@@ -116,7 +116,7 @@ test_that("spark.glm summary", {
   rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = dataset))
 
   # test summary coefficients return matrix type
-  expect_true(class(stats$coefficients) == "matrix")
+  expect_true(any(class(stats$coefficients) == "matrix"))
   expect_true(class(stats$coefficients[, 1]) == "numeric")
 
   coefs <- stats$coefficients


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org