You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/14 09:44:00 UTC
[GitHub] HyukjinKwon opened a new pull request #23787: [SPARK-26830][SQL][R] Vectorized R dapply() implementation

HyukjinKwon opened a new pull request #23787: [SPARK-26830][SQL][R] Vectorized R dapply() implementation
URL: https://github.com/apache/spark/pull/23787
 
 
   ## What changes were proposed in this pull request?
   
   This PR targets to add vectorized `dapply()` in R, Arrow optimization.
   
   This can be tested as below:
   
   ```bash
   $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
   ```
   
   ```r
   df <- createDataFrame(mtcars)
   collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
   ```
   
   ### Requirements
     - R 3.5.x 
     - Arrow package 0.12+
       ```bash
       Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'
       ```
   
   **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
   **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.
   
   
   ### Benchmarks
   
   **Shall**
   
   ```bash
   sync && sudo purge
   ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
   ```
   
   ```bash
   sync && sudo purge
   ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
   ```
   
   **R code**
   
   ```r
   rdf <- read.csv("500000.csv")
   rdf <- rdf[, c("First.Name", "Month.of.Joining")]  # We're only interested in the key and values to calculate.
   df <- cache(createDataFrame(rdf))
   count(df)
   
   test <- function() {
     options(digits.secs = 6) # milliseconds
     start.time <- Sys.time()
     count(dapply(df,
                  function(rdf) {
                    rdf$Month_of_Joining <- rdf$Month_of_Joining + 1
                    rdf
                  },
                  structType("First_Name string, Month_of_Joining double")))
     end.time <- Sys.time()
     time.taken <- end.time - start.time
     print(time.taken)
   }
   
   test()
   ```
   
   **Data (350 MB):**
   
   ```r
   object.size(read.csv("500000.csv"))
   350379504 bytes
   ```
   
   "500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/
   
   **Results**
   
   ```
   Time difference of 92.78868 secs
   ```
   
   ```
   Time difference of 1.997686 secs
   ```
   
   The performance improvement was around **4644%**.
   
   
   ### Limitations
   
   - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values.
   
   - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.
   
   ## How was this patch tested?
   
   Unit tests were added, and manually tested.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org