You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by pa...@apache.org on 2022/07/22 17:04:50 UTC

[arrow] branch master updated: ARROW-16578: [R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file (#13415)

This is an automated email from the ASF dual-hosted git repository.

paleolimbot pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 9ad22551f7 ARROW-16578: [R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file (#13415)
9ad22551f7 is described below

commit 9ad22551f73a93e5ef23b6ebdb6473a81234d38e
Author: Hideaki Hayashi <hi...@gmail.com>
AuthorDate: Fri Jul 22 10:04:45 2022 -0700

    ARROW-16578: [R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file (#13415)
    
    Fixes ARROW-16578 "[R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file".
    
    Here I'm materializing the AltrepVectorString at the first call to Elt.
    My thought is that it would make sense since it is likely that there will be another call from R if there is one call (e.g. unique()), and also because getting a string from Array seems to be much more costly than from data2.
    Something like 3-strike rule may make sense too, but here in this PR, I'm taking this simple approach.
    
    ARROW-16578 reprex with the fix:
    ```
    > df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
    > write_parquet(df1,"/tmp/test.parquet")
    > df2 <- read_parquet("/tmp/test.parquet")
    > system.time(unique(df2$x))
       user  system elapsed
      0.074   0.002   0.082
    > system.time(unique(df1$x))
       user  system elapsed
      0.022   0.001   0.025
    > system.time(is.na(df2$x))
       user  system elapsed
      0.006   0.001   0.006
    > system.time(is.na(df1$x))
       user  system elapsed
      0.003   0.000   0.004
    ```
    
    devtools::test() result:
    ```
    [ FAIL 0 | WARN 0 | SKIP 30 | PASS 7271 ]
    ```
    
    
    Authored-by: Hideaki Hayashi <hi...@gmail.com>
    Signed-off-by: Dewey Dunnington <de...@fishandwhistle.net>
---
 r/src/altrep.cpp | 33 +++++++--------------------------
 1 file changed, 7 insertions(+), 26 deletions(-)

diff --git a/r/src/altrep.cpp b/r/src/altrep.cpp
index f0cd33fa95..97bb72b3df 100644
--- a/r/src/altrep.cpp
+++ b/r/src/altrep.cpp
@@ -782,34 +782,15 @@ struct AltrepVectorString : public AltrepVectorBase<AltrepVectorString<Type>> {
     util::string_view view_;
   };
 
-  // Get a single string, as a CHARSXP SEXP,
-  // either from data2 or directly from the Array
+  // Get a single string, as a CHARSXP SEXP from data2.
+  // Materialize if not done so yet, given that it is
+  // likely that there will be another call from R if there is a call (e.g. unique()),
+  // and getting a string from Array is much more costly than from data2.
   static SEXP Elt(SEXP alt, R_xlen_t i) {
-    if (Base::IsMaterialized(alt)) {
-      return STRING_ELT(Representation(alt), i);
+    if (!Base::IsMaterialized(alt)) {
+      Materialize(alt);
     }
-    BEGIN_CPP11
-
-    ArrayResolve resolve(GetChunkedArray(alt), i);
-    auto array = resolve.array_;
-    auto j = resolve.index_;
-
-    RStringViewer r_string_viewer;
-    r_string_viewer.SetArray(array);
-
-    // r_string_viewer.Convert() might jump so it's wrapped
-    // in cpp11::unwind_protect() so that string_viewer
-    // can be properly destructed before the unwinding continues
-    SEXP s = NA_STRING;
-    cpp11::unwind_protect([&]() {
-      s = r_string_viewer.Convert(j);
-      if (r_string_viewer.nul_was_stripped()) {
-        cpp11::warning("Stripping '\\0' (nul) from character vector");
-      }
-    });
-    return s;
-
-    END_CPP11
+    return STRING_ELT(Representation(alt), i);
   }
 
   static void* Dataptr(SEXP alt, Rboolean writeable) { return DATAPTR(Materialize(alt)); }