You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/06 15:02:57 UTC

[GitHub] [arrow] paleolimbot commented on a diff in pull request #14271: ARROW-17187: [R] Improve lazy ALTREP implementation for String

paleolimbot commented on code in PR #14271:
URL: https://github.com/apache/arrow/pull/14271#discussion_r989165031


##########
r/src/altrep.cpp:
##########
@@ -777,20 +822,39 @@ struct AltrepVectorString : public AltrepVectorBase<AltrepVectorString<Type>> {
     std::shared_ptr<Array> array_;
     const StringArrayType* string_array_;
     std::string stripped_string_;
-    const bool strip_out_nuls_;
+    bool strip_out_nuls_;
     bool nul_was_stripped_;
     std::string_view view_;
   };
 
-  // Get a single string, as a CHARSXP SEXP from data2.
-  // Materialize if not done so yet, given that it is
-  // likely that there will be another call from R if there is a call (e.g. unique()),
-  // and getting a string from Array is much more costly than from data2.
+  // Get a single string as a CHARSXP SEXP
+  // Note that a previous version of this function used BEGIN_CPP11
+  // and END_CPP11, which stack-allocates 8kb for the error message.
+  // This is fine for top-level calls, but here
   static SEXP Elt(SEXP alt, R_xlen_t i) {
-    if (!Base::IsMaterialized(alt)) {
-      Materialize(alt);
+    if (Base::IsMaterialized(alt)) {
+      return STRING_ELT(Representation(alt), i);
     }
-    return STRING_ELT(Representation(alt), i);
+
+    auto altrep_data = external_pointer<ArrowAltrepData>(R_altrep_data1(alt));
+    auto resolve = altrep_data->locate(i);
+    auto array = altrep_data->chunked_array->chunk(resolve.chunk_index);
+    auto j = resolve.index_in_chunk;
+
+    SEXP s = NA_STRING;
+    RStringViewer& r_string_viewer = string_viewer();
+    r_string_viewer.SetArray(array);
+    // Note: we don't check GetBoolOption("arrow.skip_nul", false) here
+    // because it is too expensive to do so. We do set this value whenever
+    // an altrep string; however, there is a chance that this value could
+    // be out of date by the time a value in the vector is accessed.
+    r_string_viewer.reset_null_was_stripped();
+    s = r_string_viewer.Convert(j);
+    if (r_string_viewer.nul_was_stripped()) {
+      Rf_warning("Stripping '\\0' (nul) from character vector");

Review Comment:
   I change these lines to `reinterpret_cast<ArrowAltrepData*>(R_ExternalPtrAddr(R_altrep_data1(alt)))`, since I forget exactly what the constructor for `external_pointer()` does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org