You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/06 16:18:18 UTC

[GitHub] [arrow] nealrichardson edited a comment on pull request #8650: ARROW-10530: [R] Use Converter API to convert SEXP to Array/ChunkedArray

nealrichardson edited a comment on pull request #8650:
URL: https://github.com/apache/arrow/pull/8650#issuecomment-754955102


   Since the latest commits aren't compiling, I did some benchmarking on https://github.com/apache/arrow/pull/8650/commits/bcb1be733697b0e7ca86534a6700b5816e0dad46. Summary of findings:
   
   * Character to string conversion is usually (but not always) faster with the new code, around 20-30% better. Because string conversion is generally slower than other types, a small percentage improvement can be significant.
   * Integer and integer64 conversion is slower by an order of magnitude or more in the new code
   * `bench::mark` didn't report results for numeric vectors because the results were not equal.
   
   Not sure where things will stand with the latest changes, but I think this suggests that the (numpy-like) special handling for vector types that can be just copied/moved to Arrow are important where appropriate. Otherwise, the string results suggest that there is some performance gain to be had with this work, and if the new approach will handle chunking and parallelization, we can do even better.
   
   Code:
   
   ```r
   download.file("https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz", "fanniemae.csv.gz")
   df <- read_delim_arrow("fanniemae.csv.gz", delim="|", col_names=FALSE)
   dim(df)
   ## [1] 22180168       31
   for (n in names(df)) {
     print(n)
     print(class(df[[n]]))
     try(print(bench::mark(arrow:::Array__from_vector(df[[n]], NULL), arrow:::vec_to_arrow(df[[n]], NULL))))
   }
   ```
   
   There's also a NYC taxi CSV at https://ursa-qa.s3.amazonaws.com/nyctaxi/yellow_tripdata_2010-01.csv.gz you can test with (just `read_csv_arrow()`, it has colnames).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org