You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/12 04:07:06 UTC

[GitHub] HyukjinKwon opened a new pull request #23760: [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame

HyukjinKwon opened a new pull request #23760: [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame
URL: https://github.com/apache/spark/pull/23760
 
 
   ## What changes were proposed in this pull request?
   
   This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame.
   Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.
   
   This can be tested as below:
   
   ```bash
   $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
   ```
   
   ```r
   collect(createDataFrame(mtcars))
   ```
   
   ### Requirements
     - R 3.5.x 
     - Arrow package 0.12+
       ```bash
       Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'
       ```
   
   **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
   **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.
   
   
   ### Benchmarks
   
   **Shall**
   
   ```bash
   sync && sudo purge
   ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
   ```
   
   ```bash
   sync && sudo purge
   ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
   ```
   
   **R code**
   
   ```r
   df <- cache(createDataFrame(read.csv("500000.csv")))
   count(df)
   
   test <- function() {
     options(digits.secs = 6) # milliseconds
     start.time <- Sys.time()
     collect(df)
     end.time <- Sys.time()
     time.taken <- end.time - start.time
     print(time.taken)
   }
   
   test()
   ```
   
   **Data (350 MB):**
   
   ```r
   object.size(read.csv("500000.csv"))
   350379504 bytes
   ```
   
   "500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/
   
   **Results**
   
   ```
   Time difference of 221.32014 secs
   ```
   
   ```
   Time difference of 8.579493 secs
   ```
   
   The performance improvement was around **2579%**.
   
   ### Limitations:
   
   - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path.
   
   - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.
   
   ## How was this patch tested?
   
   Existing tests related with Arrow optimization cover this change. I fixed test title.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org