You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/11/06 09:45:04 UTC

[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/22954

    [DO-NOT-MERGE][POC] Enables Arrow optimization from R DataFrame to Spark DataFrame

    ## What changes were proposed in this pull request?
    
    This PR is not for merging it but targets to demonstrates the feasibility (with reusing PyArrow code path at its best) and performance improvement when converting R dataframes to Spark's dataframe. This can be tested as below:
    
    ```bash
    $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
    ```
    
    ```r
    collect(createDataFrame(mtcars))
    ```
    
    **Requirements:**
      - R 3.5.x 
      - Arrow package 0.12+ (not released yet)
      - CRAN released (ARROW-3204)
      - withr package
    
    **TODOs:**
    - [ ] Performance measurement
    - [ ] TDB
    
    ## How was this patch tested?
    
    Small test was added.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark r-arrow-createdataframe

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22954.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22954
    
----
commit 90011a5ff48f2c5fa5fae0e2573fcdaa85d44976
Author: hyukjinkwon <gu...@...>
Date:   2018-11-06T02:38:37Z

    [POC] Enables Arrow optimization from R DataFrame to Spark DataFrame

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4790/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4926/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22954: [WIP] Enables Arrow optimization from R DataFrame...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231992763
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    --- End diff --
    
    These workarounds will be removed when Arrow 0.12.0 is released. I did it to make CARN passed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98696/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98690 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98690/testReport)** for PR 22954 at commit [`6f28aa5`](https://github.com/apache/spark/commit/6f28aa5c79854cc3df176e0fa0fa6f3e7d21a98a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    @felixcheung! performance improvement was **955%** ! I described the benchmark I took in PR description.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473697
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    --- End diff --
    
    This resembles PySpark side logic:
    
    https://github.com/apache/spark/blob/d367bdcf521f564d2d7066257200be26b27ea926/python/pyspark/sql/session.py#L554-L556
    
    Let me check the difference between them


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232172546
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    +      stream_writer <- NULL
    +      for (rdf_slice in rdf_slices) {
    +        batch <- record_batch(rdf_slice)
    +        if (is.null(stream_writer)) {
    +          # We should avoid private calls like 'close_on_exit' (CRAN disallows) but looks
    +          # there's no exposed API for it. Here's a workaround but ideally this should
    +          # be removed.
    +          close_on_exit <- get("close_on_exit", envir = asNamespace("arrow"), inherits = FALSE)
    --- End diff --
    
    so is this an API missing in Arrow?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98613/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232477171
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,10 +221,10 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    --- End diff --
    
    ok


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98595 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98595/testReport)** for PR 22954 at commit [`8813192`](https://github.com/apache/spark/commit/881319298b844d934b1eb02994d7f1a85274efaa).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98613 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98613/testReport)** for PR 22954 at commit [`2ddbd69`](https://github.com/apache/spark/commit/2ddbd694663b7ff0f5b3a8099ba87e4800397a57).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473761
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    --- End diff --
    
    I suspect that it happens when `numeric` (which is like `1.0`) is casted into float type. I think it's related with casting behaviour. Let me take a look and file a JIRA there in Arrow side but if you don't mind I will focus on matching exact type cases for now ... 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232895848
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,36 +257,72 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    +  shouldUseArrow <- FALSE
    +  firstRow <- NULL
       if (is.data.frame(data)) {
    -      # Convert data into a list of rows. Each row is a list.
    -
    -      # get the names of columns, they will be put into RDD
    -      if (is.null(schema)) {
    -        schema <- names(data)
    -      }
    +    # get the names of columns, they will be put into RDD
    +    if (is.null(schema)) {
    +      schema <- names(data)
    +    }
     
    -      # get rid of factor type
    -      cleanCols <- function(x) {
    -        if (is.factor(x)) {
    -          as.character(x)
    -        } else {
    -          x
    -        }
    +    # get rid of factor type
    +    cleanCols <- function(x) {
    +      if (is.factor(x)) {
    +        as.character(x)
    +      } else {
    +        x
           }
    +    }
    +    data[] <- lapply(data, cleanCols)
    +
    +    args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +    if (arrowEnabled) {
    +      shouldUseArrow <- tryCatch({
    --- End diff --
    
    Yup, correct. Let me address other comments as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98508/testReport)** for PR 22954 at commit [`90011a5`](https://github.com/apache/spark/commit/90011a5ff48f2c5fa5fae0e2573fcdaa85d44976).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98696 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98696/testReport)** for PR 22954 at commit [`b2e0fc2`](https://github.com/apache/spark/commit/b2e0fc2d9e5e18b334b0177157ebd282b654c0a0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232620582
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging {
         }
         sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
       }
    +
    +  /**
    +   * R callable function to read a file in Arrow stream format and create an `RDD`
    +   * using each serialized ArrowRecordBatch as a partition.
    +   */
    +  def readArrowStreamFromFile(
    +      sparkSession: SparkSession,
    +      filename: String): JavaRDD[Array[Byte]] = {
    +    ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename)
    +  }
    +
    +  /**
    +   * R callable function to read a file in Arrow stream format and create a `DataFrame`
    --- End diff --
    
    Is this going to read a file in Arrow stream format?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473669
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    --- End diff --
    
    We should; however, it follows the original code path's behaviour. I matched it as the same so that we can compare the performances in the same conditions. If you don't mind, I will fix both in a separate PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98697/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org