You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/11/06 09:45:04 UTC

[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/22954

    [DO-NOT-MERGE][POC] Enables Arrow optimization from R DataFrame to Spark DataFrame

    ## What changes were proposed in this pull request?
    
    This PR is not for merging it but targets to demonstrates the feasibility (with reusing PyArrow code path at its best) and performance improvement when converting R dataframes to Spark's dataframe. This can be tested as below:
    
    ```bash
    $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
    ```
    
    ```r
    collect(createDataFrame(mtcars))
    ```
    
    **Requirements:**
      - R 3.5.x 
      - Arrow package 0.12+ (not released yet)
      - CRAN released (ARROW-3204)
      - withr package
    
    **TODOs:**
    - [ ] Performance measurement
    - [ ] TDB
    
    ## How was this patch tested?
    
    Small test was added.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark r-arrow-createdataframe

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22954.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22954
    
----
commit 90011a5ff48f2c5fa5fae0e2573fcdaa85d44976
Author: hyukjinkwon <gu...@...>
Date:   2018-11-06T02:38:37Z

    [POC] Enables Arrow optimization from R DataFrame to Spark DataFrame

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4790/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4926/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [WIP] Enables Arrow optimization from R DataFrame...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231992763
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    --- End diff --
    
    These workarounds will be removed when Arrow 0.12.0 is released. I did it to make CARN passed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98696/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98690 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98690/testReport)** for PR 22954 at commit [`6f28aa5`](https://github.com/apache/spark/commit/6f28aa5c79854cc3df176e0fa0fa6f3e7d21a98a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    @felixcheung! performance improvement was **955%** ! I described the benchmark I took in PR description.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473697
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    --- End diff --
    
    This resembles PySpark side logic:
    
    https://github.com/apache/spark/blob/d367bdcf521f564d2d7066257200be26b27ea926/python/pyspark/sql/session.py#L554-L556
    
    Let me check the difference between them


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232172546
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    +      stream_writer <- NULL
    +      for (rdf_slice in rdf_slices) {
    +        batch <- record_batch(rdf_slice)
    +        if (is.null(stream_writer)) {
    +          # We should avoid private calls like 'close_on_exit' (CRAN disallows) but looks
    +          # there's no exposed API for it. Here's a workaround but ideally this should
    +          # be removed.
    +          close_on_exit <- get("close_on_exit", envir = asNamespace("arrow"), inherits = FALSE)
    --- End diff --
    
    so is this an API missing in Arrow?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98613/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232477171
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,10 +221,10 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    --- End diff --
    
    ok


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98595 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98595/testReport)** for PR 22954 at commit [`8813192`](https://github.com/apache/spark/commit/881319298b844d934b1eb02994d7f1a85274efaa).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98613 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98613/testReport)** for PR 22954 at commit [`2ddbd69`](https://github.com/apache/spark/commit/2ddbd694663b7ff0f5b3a8099ba87e4800397a57).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473761
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    --- End diff --
    
    I suspect that it happens when `numeric` (which is like `1.0`) is casted into float type. I think it's related with casting behaviour. Let me take a look and file a JIRA there in Arrow side but if you don't mind I will focus on matching exact type cases for now ... 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232895848
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,36 +257,72 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    +  shouldUseArrow <- FALSE
    +  firstRow <- NULL
       if (is.data.frame(data)) {
    -      # Convert data into a list of rows. Each row is a list.
    -
    -      # get the names of columns, they will be put into RDD
    -      if (is.null(schema)) {
    -        schema <- names(data)
    -      }
    +    # get the names of columns, they will be put into RDD
    +    if (is.null(schema)) {
    +      schema <- names(data)
    +    }
     
    -      # get rid of factor type
    -      cleanCols <- function(x) {
    -        if (is.factor(x)) {
    -          as.character(x)
    -        } else {
    -          x
    -        }
    +    # get rid of factor type
    +    cleanCols <- function(x) {
    +      if (is.factor(x)) {
    +        as.character(x)
    +      } else {
    +        x
           }
    +    }
    +    data[] <- lapply(data, cleanCols)
    +
    +    args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +    if (arrowEnabled) {
    +      shouldUseArrow <- tryCatch({
    --- End diff --
    
    Yup, correct. Let me address other comments as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98508/testReport)** for PR 22954 at commit [`90011a5`](https://github.com/apache/spark/commit/90011a5ff48f2c5fa5fae0e2573fcdaa85d44976).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98696 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98696/testReport)** for PR 22954 at commit [`b2e0fc2`](https://github.com/apache/spark/commit/b2e0fc2d9e5e18b334b0177157ebd282b654c0a0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232620582
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging {
         }
         sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
       }
    +
    +  /**
    +   * R callable function to read a file in Arrow stream format and create an `RDD`
    +   * using each serialized ArrowRecordBatch as a partition.
    +   */
    +  def readArrowStreamFromFile(
    +      sparkSession: SparkSession,
    +      filename: String): JavaRDD[Array[Byte]] = {
    +    ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename)
    +  }
    +
    +  /**
    +   * R callable function to read a file in Arrow stream format and create a `DataFrame`
    --- End diff --
    
    Is this going to read a file in Arrow stream format?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473669
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    --- End diff --
    
    We should; however, it follows the original code path's behaviour. I matched it as the same so that we can compare the performances in the same conditions. If you don't mind, I will fix both in a separate PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98697/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231814143
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  stopifnot(require("arrow", quietly = TRUE))
    --- End diff --
    
    Hmhmhmhm for checking arrow version itself, it will be but I think it's fine. R API for arrow will likely be from 0.12.0 and we will support 0.12.0+ from now on.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232167634
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    --- End diff --
    
    require1 sounds a bit like a hack though...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232477257
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    --- End diff --
    
    LG, I tested a few cases too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232477131
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging {
         }
         sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
       }
    +
    +  /**
    +   * R callable function to read a file in Arrow stream format and create a `RDD`
    +   * using each serialized ArrowRecordBatch as a partition.
    +   */
    +  def readArrowStreamFromFile(
    +      sparkSession: SparkSession,
    +      filename: String): JavaRDD[Array[Byte]] = {
    +    ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename)
    --- End diff --
    
    hmm, I see your point... but there could be hundreds of these wrapper we need add if we set as a practice, I'm guessing. 
    
    a few problems with these wrappers I see:
    1. they are extra work to add or maintain
    2. many are very simple, not much value add
    3. many get abandoned over the years - they are not called and not removed
    
    but I kinda see your way, let's keep this one and review any new one in the future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98692/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    > @BryanCutler BTW, do you know the rough expected timing for Arrow 0.12.0 release?
    
    I think we should be starting the release process soon, so maybe in a week or two.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r233292436
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    --- End diff --
    
    I think it's a bug because it always produces a corrupt value when I try to use `number` as explicit float types.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98510/testReport)** for PR 22954 at commit [`46eaeca`](https://github.com/apache/spark/commit/46eaeca5854281a5688badbe71981029096674a5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98595 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98595/testReport)** for PR 22954 at commit [`8813192`](https://github.com/apache/spark/commit/881319298b844d934b1eb02994d7f1a85274efaa).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4870/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232176721
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    +            }
    +          }
    +          firstRow <- do.call(mapply, append(args, dataHead))[[1]]
    +          fileName <- writeToTempFileInArrow(data, numPartitions)
    +          tryCatch(
    +            jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
    +                                       "readArrowStreamFromFile",
    +                                       sparkSession,
    +                                       fileName),
    +          finally = {
    +            file.remove(fileName)
    +          })
    +          TRUE
    +        },
    +        error = function(e) {
    +          message(paste0("WARN: createDataFrame attempted Arrow optimization because ",
    --- End diff --
    
    ? https://stat.ethz.ch/R-manual/R-devel/library/base/html/warning.html


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98699/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232525184
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", {
       unsetHiveContext()
     })
     
    +test_that("createDataFrame Arrow optimization", {
    +  skip_if_not_installed("arrow")
    +  skip_if_not_installed("withr")
    --- End diff --
    
    Maybe we should hold it for now .. because I realised R API for Arrow requires R 3.5.x and Jenkins's one is 3.1.x if I remember this correctly. Ideally, we could probably do that via AppVeyor if everything goes fine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232477325
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    --- End diff --
    
    maybe out of bit range? 53 bit https://stat.ethz.ch/R-manual/R-patched/library/base/html/double.html



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4924/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98512/testReport)** for PR 22954 at commit [`614170e`](https://github.com/apache/spark/commit/614170e2187ebb904a767adb9289ac48300e533c).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473705
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    +      stream_writer <- NULL
    +      for (rdf_slice in rdf_slices) {
    +        batch <- record_batch(rdf_slice)
    +        if (is.null(stream_writer)) {
    +          # We should avoid private calls like 'close_on_exit' (CRAN disallows) but looks
    +          # there's no exposed API for it. Here's a workaround but ideally this should
    +          # be removed.
    +          close_on_exit <- get("close_on_exit", envir = asNamespace("arrow"), inherits = FALSE)
    --- End diff --
    
    Hm, possibly yea. Let me try it. In this way, we could get rid of `require`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    So far, the regressions tests are passed and newly added test for R optimization is verified locally. Let me fix CRAN test and some nits.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232173367
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    --- End diff --
    
    can you check -  I think `is` `is.x` doesn't something do the right thing when
    
    head(df, 1) and one of the field is `NA`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98699/testReport)** for PR 22954 at commit [`c3f47ce`](https://github.com/apache/spark/commit/c3f47ce04c82f6fb6efe6d55baa19ddc8cf50dd3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98760/testReport)** for PR 22954 at commit [`954bc0e`](https://github.com/apache/spark/commit/954bc0eec206902cb8176338e1f72886f5b3c626).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98595/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232171176
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    --- End diff --
    
    how is this slices computed? is it similar to https://github.com/apache/spark/blob/c3b4a94a91d66c172cf332321d3a78dba29ef8f0/R/pkg/R/context.R#L152


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232477271
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    +            }
    +          }
    +          firstRow <- do.call(mapply, append(args, dataHead))[[1]]
    +          fileName <- writeToTempFileInArrow(data, numPartitions)
    +          tryCatch(
    +            jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
    +                                       "readArrowStreamFromFile",
    +                                       sparkSession,
    +                                       fileName),
    +          finally = {
    +            file.remove(fileName)
    --- End diff --
    
    yes, just more consistent. I also don't know for sure why all other instances are calling unlink


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232170936
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    --- End diff --
    
    `1 : ceiling`? `1 : nrow`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473723
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    --- End diff --
    
    Yup, let me try.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232477365
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    --- End diff --
    
    that will be good. circumventing CRAN check for method name is... problematic..
    (there are other more hacky way too, but ..)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4789/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232167926
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging {
         }
         sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
       }
    +
    +  /**
    +   * R callable function to read a file in Arrow stream format and create a `RDD`
    +   * using each serialized ArrowRecordBatch as a partition.
    +   */
    +  def readArrowStreamFromFile(
    +      sparkSession: SparkSession,
    +      filename: String): JavaRDD[Array[Byte]] = {
    +    ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename)
    --- End diff --
    
    what's the advantage of adding this wrapper here - I've thinking to eliminate most of these if possible - and just use callJMethod on `ArrowConverters` say?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4860/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231815154
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  stopifnot(require("arrow", quietly = TRUE))
    --- End diff --
    
    hmmm .. yea I will add the version check later when R API of Arrow is officially released to CRAN.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98614/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98603 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98603/testReport)** for PR 22954 at commit [`7be15d3`](https://github.com/apache/spark/commit/7be15d36989ba1a00a86bb5eeb0b3d6fb7da3c4c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232202773
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    +            }
    +          }
    +          firstRow <- do.call(mapply, append(args, dataHead))[[1]]
    +          fileName <- writeToTempFileInArrow(data, numPartitions)
    +          tryCatch(
    +            jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
    +                                       "readArrowStreamFromFile",
    +                                       sparkSession,
    +                                       fileName),
    +          finally = {
    +            file.remove(fileName)
    +          })
    +          TRUE
    +        },
    +        error = function(e) {
    +          message(paste0("WARN: createDataFrame attempted Arrow optimization because ",
    --- End diff --
    
    ooops


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98713/testReport)** for PR 22954 at commit [`d9d9f98`](https://github.com/apache/spark/commit/d9d9f982d26a5dd2141515e0c9089243b7b93554).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98614 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98614/testReport)** for PR 22954 at commit [`0903736`](https://github.com/apache/spark/commit/0903736b46e1ff737577f7db3f29db67795701ee).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98508/testReport)** for PR 22954 at commit [`90011a5`](https://github.com/apache/spark/commit/90011a5ff48f2c5fa5fae0e2573fcdaa85d44976).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98713/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4858/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4793/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4925/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98713/testReport)** for PR 22954 at commit [`d9d9f98`](https://github.com/apache/spark/commit/d9d9f982d26a5dd2141515e0c9089243b7b93554).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98514 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98514/testReport)** for PR 22954 at commit [`b15d79c`](https://github.com/apache/spark/commit/b15d79c0d734d04a783470d793cf52e5c6c99b9d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473611
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    --- End diff --
    
    Yup .. it is .. looks we shouldn't have this error from a cursory look in R API of Arrow. Maybe this can be gone when I use official R Arrow release version. Let me check it later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98692 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98692/testReport)** for PR 22954 at commit [`a81134e`](https://github.com/apache/spark/commit/a81134efe787bbd3b7a5b92d03b91e5f4e2f3cf6).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232618853
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,36 +257,72 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    +  shouldUseArrow <- FALSE
    +  firstRow <- NULL
       if (is.data.frame(data)) {
    -      # Convert data into a list of rows. Each row is a list.
    -
    -      # get the names of columns, they will be put into RDD
    -      if (is.null(schema)) {
    -        schema <- names(data)
    -      }
    +    # get the names of columns, they will be put into RDD
    +    if (is.null(schema)) {
    +      schema <- names(data)
    +    }
     
    -      # get rid of factor type
    -      cleanCols <- function(x) {
    -        if (is.factor(x)) {
    -          as.character(x)
    -        } else {
    -          x
    -        }
    +    # get rid of factor type
    +    cleanCols <- function(x) {
    +      if (is.factor(x)) {
    +        as.character(x)
    +      } else {
    +        x
           }
    +    }
    +    data[] <- lapply(data, cleanCols)
    +
    +    args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +    if (arrowEnabled) {
    +      shouldUseArrow <- tryCatch({
    +        stopifnot(length(data) > 0)
    +        dataHead <- head(data, 1)
    +        checkTypeRequirementForArrow(data, schema)
    +        fileName <- writeToTempFileInArrow(data, numPartitions)
    --- End diff --
    
    Should we move `writeToTempFileInArrow` call into next `tryCatch`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232475881
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    --- End diff --
    
    Looks okay in this case specifically:
    
    ```r
    > any(sapply(head(data.frame(list(list(a=NA))), 1), is.raw))
    [1] FALSE
    > any(sapply(head(data.frame(list(list(a=NA))), 1), function(x) is(x, "POSIXct")))
    [1] FALSE
    > any(sapply(head(data.frame(list(list(a=1))), 1), is.raw))
    [1] FALSE
    > any(sapply(head(data.frame(list(list(a="a"))), 1), function(x) is(x, "POSIXct")))
    [1] FALSE
    > any(sapply(head(data.frame(list(list(a=raw(1)))), 1), is.raw))
    [1] TRUE
    > any(sapply(head(data.frame(list(list(a=as.POSIXct("2000-01-01")))), 1), function(x) is(x, "POSIXct")))
    [1] TRUE
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232499902
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,91 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  requireNamespace1 <- requireNamespace
    +
    +  # For some reasons, Arrow R API requires to load 'defer_parent' which is from 'withr' package.
    +  # This is a workaround to avoid this error. Otherwise, we should directly load 'withr'
    +  # package, which CRAN complains about.
    +  defer_parent <- function(x, ...) {
    +  if (requireNamespace1("withr", quietly = TRUE)) {
    +      defer_parent <- get("defer_parent", envir = asNamespace("withr"), inherits = FALSE)
    +      defer_parent(x, ...)
    +    } else {
    +      stop("'withr' package should be installed.")
    +    }
    +  }
    +
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    numPartitions <- if (!is.null(numPartitions)) {
    +      numToInt(numPartitions)
    +    } else {
    +      1
    --- End diff --
    
    future: consolidate the default here and inside makeSplits


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98755 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98755/testReport)** for PR 22954 at commit [`954bc0e`](https://github.com/apache/spark/commit/954bc0eec206902cb8176338e1f72886f5b3c626).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4859/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Yea .. I will make the followup works right away after this one get merged. Thanks @felixcheung. Let me address the rest of comments, and wait for Arrow release.
    
    @BryanCutler BTW, do you know the rough expected timing for Arrow 0.12.0 release?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Let me hide some comments that are addressed (it looks messy). Please make unhide if I mistakenly hide some comments that are not addressed yet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4967/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98514/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231847272
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -215,14 +278,16 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
       }
     
       if (is.null(schema) || (!inherits(schema, "structType") && is.null(names(schema)))) {
    -    row <- firstRDD(rdd)
    +    if (is.null(firstRow)) {
    +      firstRow <- firstRDD(rdd)
    --- End diff --
    
    Note that this PR optimizes the original code path as well here - when the input is local R DataFrame, here we avoid `firstRDD` operation.
    
    In the master branch, the benchmark shows:
    
    ```
    Exception in thread "dispatcher-event-loop-6" java.lang.OutOfMemoryError: Java heap space
    	at java.util.Arrays.copyOf(Arrays.java:3236)
    	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    	at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    ```
    
    So, technically this PR improves
    
    If I try this with `100000.csv` (79MB) record, it takes longer to cut it short:
    
    **Current master**:
    
    ```
    Time difference of 8.502607 secs
    ```
    
    **With this PR, but without Arrow**
    
    ```
    Time difference of 5.143395 secs
    ```
    
    **With this PR, but with Arrow**
    
    
    ```
    Time difference of 0.6981369 secs
    ```
    
    So, technically this PR improves more **1200%**


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231402297
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,15 +196,17 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  conf <- callJMethod(sparkSession, "conf")
    +  arrowEnabled <- tolower(callJMethod(conf, "get", "spark.sql.execution.arrow.enabled")) == "true"
    --- End diff --
    
    I think you can use sparkR.conf


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232476906
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    --- End diff --
    
    Let me try to reuse the R side slicing logic.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4928/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232453690
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    --- End diff --
    
    nit: `arrow` -> `Arrow`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4922/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232499848
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", {
       unsetHiveContext()
     })
     
    +test_that("createDataFrame Arrow optimization", {
    +  skip_if_not_installed("arrow")
    +  skip_if_not_installed("withr")
    --- End diff --
    
    are we going to ask shane to install arrow/withr on the Jenkins machines?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98510/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231402726
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  stopifnot(require("arrow", quietly = TRUE))
    +  stopifnot(require("withr", quietly = TRUE))
    +  numPartitions <- if (!is.null(numPartitions)) {
    +    numToInt(numPartitions)
    +  } else {
    +    1
    +  }
    +  fileName <- tempfile()
    --- End diff --
    
    might need to give it a dir prefix to use - the tempfile default is not CRAN compliant and possibly some ACL issue


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232176774
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    +            }
    +          }
    +          firstRow <- do.call(mapply, append(args, dataHead))[[1]]
    +          fileName <- writeToTempFileInArrow(data, numPartitions)
    +          tryCatch(
    +            jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
    +                                       "readArrowStreamFromFile",
    +                                       sparkSession,
    +                                       fileName),
    +          finally = {
    +            file.remove(fileName)
    --- End diff --
    
    unlink?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232500065
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,36 +257,72 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    +  shouldUseArrow <- FALSE
    +  firstRow <- NULL
       if (is.data.frame(data)) {
    -      # Convert data into a list of rows. Each row is a list.
    -
    -      # get the names of columns, they will be put into RDD
    -      if (is.null(schema)) {
    -        schema <- names(data)
    -      }
    +    # get the names of columns, they will be put into RDD
    +    if (is.null(schema)) {
    +      schema <- names(data)
    +    }
     
    -      # get rid of factor type
    -      cleanCols <- function(x) {
    -        if (is.factor(x)) {
    -          as.character(x)
    -        } else {
    -          x
    -        }
    +    # get rid of factor type
    +    cleanCols <- function(x) {
    +      if (is.factor(x)) {
    +        as.character(x)
    +      } else {
    +        x
           }
    +    }
    +    data[] <- lapply(data, cleanCols)
    +
    +    args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +    if (arrowEnabled) {
    +      shouldUseArrow <- tryCatch({
    +        stopifnot(length(data) > 0)
    +        dataHead <- head(data, 1)
    +        checkTypeRequirementForArrow(data, schema)
    +        fileName <- writeToTempFileInArrow(data, numPartitions)
    +        tryCatch(
    +          jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
    +                                     "readArrowStreamFromFile",
    +                                     sparkSession,
    +                                     fileName),
    +        finally = {
    +          file.remove(fileName)
    +        })
    +
    +        firstRow <- do.call(mapply, append(args, dataHead))[[1]]
    +        TRUE
    +      },
    +      error = function(e) {
    +        warning(paste0("createDataFrame attempted Arrow optimization because ",
    +                       "'spark.sql.execution.arrow.enabled' is set to true; however, ",
    +                       "failed, attempting non-optimization. Reason: ",
    +                       e))
    +        return(FALSE)
    --- End diff --
    
    nit: just `FALSE` is good


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98614 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98614/testReport)** for PR 22954 at commit [`0903736`](https://github.com/apache/spark/commit/0903736b46e1ff737577f7db3f29db67795701ee).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98512/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98697/testReport)** for PR 22954 at commit [`92419d0`](https://github.com/apache/spark/commit/92419d055c9f9eb4cbe1172158863866b4401b4b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4846/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98696 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98696/testReport)** for PR 22954 at commit [`b2e0fc2`](https://github.com/apache/spark/commit/b2e0fc2d9e5e18b334b0177157ebd282b654c0a0).
     * This patch **fails some tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    For encryption stuff, I will try to handle that as well (maybe as a followup(?)) so that we support it even when that's enabled.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98508/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232425279
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    --- End diff --
    
    Any idea what's going on with the `FloatType`? Is it a problem on the arrow side?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Adding @yanghaogn as well fyi


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98603 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98603/testReport)** for PR 22954 at commit [`7be15d3`](https://github.com/apache/spark/commit/7be15d36989ba1a00a86bb5eeb0b3d6fb7da3c4c).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232525068
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", {
       unsetHiveContext()
     })
     
    +test_that("createDataFrame Arrow optimization", {
    +  skip_if_not_installed("arrow")
    +  skip_if_not_installed("withr")
    +
    +  conf <- callJMethod(sparkSession, "conf")
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]]
    +
    +  callJMethod(conf, "set", "spark.sql.execution.arrow.enabled", "false")
    +  tryCatch({
    --- End diff --
    
    Just to inject the finally .. :-) ..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98615 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98615/testReport)** for PR 22954 at commit [`2ba6add`](https://github.com/apache/spark/commit/2ba6addbcd52940ef989880bff69fe126a4dd2e1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Thanks, @felixcheung. I will address those comments during cleaning up.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98512/testReport)** for PR 22954 at commit [`614170e`](https://github.com/apache/spark/commit/614170e2187ebb904a767adb9289ac48300e533c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Let me leave a cc @felixcheung, @BryanCutler FYI.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98755 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98755/testReport)** for PR 22954 at commit [`954bc0e`](https://github.com/apache/spark/commit/954bc0eec206902cb8176338e1f72886f5b3c626).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98514 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98514/testReport)** for PR 22954 at commit [`b15d79c`](https://github.com/apache/spark/commit/b15d79c0d734d04a783470d793cf52e5c6c99b9d).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4850/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232475752
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    --- End diff --
    
    Yea, I at least managed to get rid of this hack itself. Will push soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98628/testReport)** for PR 22954 at commit [`2ba6add`](https://github.com/apache/spark/commit/2ba6addbcd52940ef989880bff69fe126a4dd2e1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98760/testReport)** for PR 22954 at commit [`954bc0e`](https://github.com/apache/spark/commit/954bc0eec206902cb8176338e1f72886f5b3c626).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98615 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98615/testReport)** for PR 22954 at commit [`2ba6add`](https://github.com/apache/spark/commit/2ba6addbcd52940ef989880bff69fe126a4dd2e1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    I have finished most of todos except waiting for R API of Arrow 0.12.0 and fixing some changes accordingly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232173043
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,10 +221,10 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    --- End diff --
    
    is it always in the conf - I think you can also pass in a default value to sparkR.conf


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98628/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r233209385
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    --- End diff --
    
    Oh if it's only when casting to a float, then maybe not that big of an issue. I just wanted to make sure a bug was filed for Arrow if the problem is there.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232177132
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    --- End diff --
    
    refactor this into a method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98692/testReport)** for PR 22954 at commit [`a81134e`](https://github.com/apache/spark/commit/a81134efe787bbd3b7a5b92d03b91e5f4e2f3cf6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473716
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,10 +221,10 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    --- End diff --
    
    Yea,I checked that it always has the default value. I initially left the default value but took it out after double checking.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    adding @falaki and @mengxr as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232172687
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    +      stream_writer <- NULL
    +      for (rdf_slice in rdf_slices) {
    +        batch <- record_batch(rdf_slice)
    +        if (is.null(stream_writer)) {
    +          # We should avoid private calls like 'close_on_exit' (CRAN disallows) but looks
    +          # there's no exposed API for it. Here's a workaround but ideally this should
    +          # be removed.
    +          close_on_exit <- get("close_on_exit", envir = asNamespace("arrow"), inherits = FALSE)
    --- End diff --
    
    actually, I think if you use withr here it will call close_on_exit for you? (but when?)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98510/testReport)** for PR 22954 at commit [`46eaeca`](https://github.com/apache/spark/commit/46eaeca5854281a5688badbe71981029096674a5).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98697/testReport)** for PR 22954 at commit [`92419d0`](https://github.com/apache/spark/commit/92419d055c9f9eb4cbe1172158863866b4401b4b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4972/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98690 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98690/testReport)** for PR 22954 at commit [`6f28aa5`](https://github.com/apache/spark/commit/6f28aa5c79854cc3df176e0fa0fa6f3e7d21a98a).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232167480
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    --- End diff --
    
    ok


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232425031
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging {
         }
         sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
       }
    +
    +  /**
    +   * R callable function to read a file in Arrow stream format and create a `RDD`
    --- End diff --
    
    nit: a `RDD` -> an `RDD`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232499794
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", {
       unsetHiveContext()
     })
     
    +test_that("createDataFrame Arrow optimization", {
    +  skip_if_not_installed("arrow")
    +  skip_if_not_installed("withr")
    +
    +  conf <- callJMethod(sparkSession, "conf")
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]]
    +
    +  callJMethod(conf, "set", "spark.sql.execution.arrow.enabled", "false")
    +  tryCatch({
    --- End diff --
    
    does this fail? or just a way to inject a finally?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231402235
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  stopifnot(require("arrow", quietly = TRUE))
    --- End diff --
    
    btw, is it worthwhile to check the arrow package version?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231401994
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  stopifnot(require("arrow", quietly = TRUE))
    --- End diff --
    
    perhaps best to add a clearer error message?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98628/testReport)** for PR 22954 at commit [`2ba6add`](https://github.com/apache/spark/commit/2ba6addbcd52940ef989880bff69fe126a4dd2e1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Hm .. the CRAN passed in my local. Let me workaround for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231414561
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  stopifnot(require("arrow", quietly = TRUE))
    +  stopifnot(require("withr", quietly = TRUE))
    --- End diff --
    
    It's actually dependent on Arrow's API calls .. Those APIs were only added in few weeks ago. I will make it clean when R API of Arrow is close to release (at 0.12.0)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232167110
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -215,14 +278,16 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
       }
     
       if (is.null(schema) || (!inherits(schema, "structType") && is.null(names(schema)))) {
    -    row <- firstRDD(rdd)
    +    if (is.null(firstRow)) {
    +      firstRow <- firstRDD(rdd)
    --- End diff --
    
    I <3 4 digits!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98603/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232473643
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging {
         }
         sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
       }
    +
    +  /**
    +   * R callable function to read a file in Arrow stream format and create a `RDD`
    +   * using each serialized ArrowRecordBatch as a partition.
    +   */
    +  def readArrowStreamFromFile(
    +      sparkSession: SparkSession,
    +      filename: String): JavaRDD[Array[Byte]] = {
    +    ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, filename)
    --- End diff --
    
    Hmhmhm .. yea. What I was trying to do is to add SQL related codes called in R from JVM, into here when they are not official APIs in order to avoid, we change the internal APIs within Scala, and it causes R test failure. I was trying to do the similar things within PySpark side.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    I don't know R well enough to review that code, but the results look awesome! Nice work @HyukjinKwon!!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98690/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4792/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98760/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231402063
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  stopifnot(require("arrow", quietly = TRUE))
    +  stopifnot(require("withr", quietly = TRUE))
    --- End diff --
    
    is it possible to not depend on this withr? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232477155
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    --- End diff --
    
    ah, any idea that was done that way in python? this seems to be different from sc.paralleize which is what https://github.com/apache/spark/blob/c3b4a94a91d66c172cf332321d3a78dba29ef8f0/R/pkg/R/context.R#L152 is done


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232475777
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
               x
             }
           }
    +      data[] <- lapply(data, cleanCols)
     
    -      # drop factors and wrap lists
    -      data <- setNames(lapply(data, cleanCols), NULL)
    +      args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +      if (arrowEnabled) {
    +        shouldUseArrow <- tryCatch({
    +          stopifnot(length(data) > 0)
    +          dataHead <- head(data, 1)
    +          # Currenty Arrow optimization does not support POSIXct and raw for now.
    +          # Also, it does not support explicit float type set by users. It leads to
    +          # incorrect conversion. We will fall back to the path without Arrow optimization.
    +          if (any(sapply(dataHead, function(x) is(x, "POSIXct")))) {
    +            stop("Arrow optimization with R DataFrame does not support POSIXct type yet.")
    +          }
    +          if (any(sapply(dataHead, is.raw))) {
    +            stop("Arrow optimization with R DataFrame does not support raw type yet.")
    +          }
    +          if (inherits(schema, "structType")) {
    +            if (any(sapply(schema$fields(), function(x) x$dataType.toString() == "FloatType"))) {
    +              stop("Arrow optimization with R DataFrame does not support FloatType type yet.")
    +            }
    +          }
    +          firstRow <- do.call(mapply, append(args, dataHead))[[1]]
    +          fileName <- writeToTempFileInArrow(data, numPartitions)
    +          tryCatch(
    +            jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
    +                                       "readArrowStreamFromFile",
    +                                       sparkSession,
    +                                       fileName),
    +          finally = {
    +            file.remove(fileName)
    --- End diff --
    
    I believe either way is fine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Hey guys thanks for reviewing! Will address them soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98615/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    **[Test build #98699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98699/testReport)** for PR 22954 at commit [`c3f47ce`](https://github.com/apache/spark/commit/c3f47ce04c82f6fb6efe6d55baa19ddc8cf50dd3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232169938
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    --- End diff --
    
    should this go by default with default parallelism ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232115966
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    --- End diff --
    
    It's actually a bit odd that I need to manually require this package. Otherwise, it complains, for instance, here https://github.com/apache/arrow/blob/d3ec69069649013229366ebe01e22f389597dc19/r/R/on_exit.R#L20


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4939/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232478997
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
       l[["spark.sql.sources.default"]]
     }
     
    +writeToTempFileInArrow <- function(rdf, numPartitions) {
    +  # R API in Arrow is not yet released. CRAN requires to add the package in requireNamespace
    +  # at DESCRIPTION. Later, CRAN checks if the package is available or not. Therefore, it works
    +  # around by avoiding direct requireNamespace.
    +  requireNamespace1 <- requireNamespace
    +  if (requireNamespace1("arrow", quietly = TRUE)) {
    +    record_batch <- get("record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +    record_batch_stream_writer <- get(
    +      "record_batch_stream_writer", envir = asNamespace("arrow"), inherits = FALSE)
    +    file_output_stream <- get(
    +      "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
    +    write_record_batch <- get(
    +      "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
    +
    +    # Currently arrow requires withr; otherwise, write APIs don't work.
    +    # Direct 'require' is not recommended by CRAN. Here's a workaround.
    +    require1 <- require
    +    if (require1("withr", quietly = TRUE)) {
    +      numPartitions <- if (!is.null(numPartitions)) {
    +        numToInt(numPartitions)
    +      } else {
    +        1
    +      }
    +      fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
    +      chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
    +      rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = chunk)[1:nrow(rdf)])
    --- End diff --
    
    Not sure. I think the intention is the same. Let me stick to R's one for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r232619364
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -172,36 +257,72 @@ getDefaultSqlSource <- function() {
     createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
                                 numPartitions = NULL) {
       sparkSession <- getSparkSession()
    -
    +  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
    +  shouldUseArrow <- FALSE
    +  firstRow <- NULL
       if (is.data.frame(data)) {
    -      # Convert data into a list of rows. Each row is a list.
    -
    -      # get the names of columns, they will be put into RDD
    -      if (is.null(schema)) {
    -        schema <- names(data)
    -      }
    +    # get the names of columns, they will be put into RDD
    +    if (is.null(schema)) {
    +      schema <- names(data)
    +    }
     
    -      # get rid of factor type
    -      cleanCols <- function(x) {
    -        if (is.factor(x)) {
    -          as.character(x)
    -        } else {
    -          x
    -        }
    +    # get rid of factor type
    +    cleanCols <- function(x) {
    +      if (is.factor(x)) {
    +        as.character(x)
    +      } else {
    +        x
           }
    +    }
    +    data[] <- lapply(data, cleanCols)
    +
    +    args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
    +    if (arrowEnabled) {
    +      shouldUseArrow <- tryCatch({
    --- End diff --
    
    When `shouldUseArrow` is true, I think it means we already done using arrow? Maybe just `useArrow`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98755/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22954: [WIP] Enables Arrow optimization from R DataFrame to Spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22954
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org