You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/06 15:54:58 UTC

[GitHub] [arrow] nealrichardson opened a new pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

nealrichardson opened a new pull request #8365:
URL: https://github.com/apache/arrow/pull/8365


   @romainfrancois see the comment I added with the reprex: this fails differently now, presumably due to the cpp11 change


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r551542022



##########
File path: r/src/array_to_vector.cpp
##########
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
     }
 
     StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
-    if (array->null_count()) {
-      // need to watch for nulls
-      arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-                                                array->offset(), n);
+
+    const bool all_valid = array->null_count() == 0;
+    const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+    bool nul_was_stripped = false;
+
+    if (all_valid) {
+      // no need to watch for missing strings
       cpp11::unwind_protect([&] {
-        for (int i = 0; i < n; i++, null_reader.Next()) {
-          if (null_reader.IsSet()) {
-            SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
-          } else {
-            SET_STRING_ELT(data, start + i, NA_STRING);
+        if (strip_out_nuls) {
+          for (int i = 0; i < n; i++) {
+            SET_STRING_ELT(data, start + i,
+                           r_string_from_view_strip_nul(string_array->GetView(i),
+                                                        &nul_was_stripped));
           }
+          return;

Review comment:
       This `return` is breaking out of the lambda wrapped by `unwind_protect`; it doesn't make the parent scope return. If you prefer, we could use an `else` here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-726237202


   What you describe (including using GetView) is essentially what we now have on master: https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L290-L321
   
   The difference is that we moved back to `Rf_mkCharLenCE`, which is much faster because length is known and errors (correctly) when there are embedded nuls, instead of `Rf_mkCharCE`, which IIUC was the reason for the performance regression in #8356 and which was responsible for silently truncating at an embedded nul. 
   
   If `Rf_mkCharLenCE` is what is raising the "embedded nul in string" error, then we have the option of catching that and falling back to a slower `skipNul` path to strip the nulls. If the error comes from somewhere else later, we may need to consider other options.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-726577844


   It does look like `Rf_mkCharLenCE()` is generating the error: 
   
   ``` r
   cpp11::cpp_eval('Rf_mkCharLenCE("camer\\0a", 6, CE_UTF8)')
   #> Error in f(): embedded nul in string: 'camer\0'
   ```
   
   <sup>Created on 2020-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-725971159


   I think failing asap is better, either with the current code, or with an `unwind_protect()` outside the loop. 
   
   ```cpp
   StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
       auto unsafe_r_string = [](const std::string& s) {
         return Rf_mkCharCE(s.c_str(), CE_UTF8);
       };
       cpp11::unwind_protect([&] {
         if (array->null_count()) {
           // need to watch for nulls
           arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
                                                     array->offset(), n);
           for (int i = 0; i < n; i++, null_reader.Next()) {
             if (null_reader.IsSet()) {
               SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
             } else {
               SET_STRING_ELT(data, start + i, NA_STRING);
             }
           }
   
         } else {
           for (int i = 0; i < n; i++) {
             SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
           }
         }
       });
   
       return Status::OK();
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-718021849


   Arguably failing is better than silently truncating, but that puts us back at the original user report. I see our options as:
   
   1. Do nothing in the code; document a workflow of casting the `utf8` column to `binary`, then stripping out the `00`s yourself in R
   2. In arrow_to_vector, if a nul is hit, downcast to binary and proceed with conversion (raise a warning instead of erroring)
   3. Add a `skipNul`-like option (cf. https://stat.ethz.ch/R-manual/R-devel/library/base/html/readLines.html and others) and strip nulls if they're hit, raising a warning but keeping the result as a character vector.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] johncassil commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
johncassil commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-754215072


   @jimhester, @nealrichardson, @bkietz @dianaclarke @romainfrancois 
   
   Just wanted to say thanks for working on this.  I reported it a long time ago and have just been periodically watching the developments slowly progress.   I'm excited to see that there will be a resolution!
   
   Cheers!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-717996007


   based on @bkietz comment from https://github.com/apache/arrow/pull/8536 Converter_String::Ingest_some_nulls() could end with: 
   
   ```cpp
   cpp11::unwind_protect([&] {
         if (array->null_count()) {
           // need to watch for nulls
           arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
                                                     array->offset(), n);
           for (int i = 0; i < n; i++, null_reader.Next()) {
             if (null_reader.IsSet()) {
               SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
             } else {
               SET_STRING_ELT(data, start + i, NA_STRING);
             }
           }
   
         } else {
           for (int i = 0; i < n; i++) {
             SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
           }
         }
       });
   ```
   
   this builds on knowing that `GetString()` gives an utf8 string, and that we know its size, as opposed to what `cpp11::r_string(std::string)` does here: https://github.com/r-lib/cpp11/blob/master/inst/include/cpp11/r_string.hpp#L18 and uses unwind protection around the loop instead of the individual conversions. 
   
   i.e. it assumes utf-8 but since it does not use the known size, it searches for `nul`: r_string(const std::string& data) : data_(safe[Rf_mkCharCE](data.c_str(), CE_UTF8)) {}
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #8365:
URL: https://github.com/apache/arrow/pull/8365


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r551551461



##########
File path: r/src/array_to_vector.cpp
##########
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
     }
 
     StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
-    if (array->null_count()) {
-      // need to watch for nulls
-      arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-                                                array->offset(), n);
+
+    const bool all_valid = array->null_count() == 0;
+    const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+    bool nul_was_stripped = false;
+
+    if (all_valid) {
+      // no need to watch for missing strings
       cpp11::unwind_protect([&] {
-        for (int i = 0; i < n; i++, null_reader.Next()) {
-          if (null_reader.IsSet()) {
-            SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
-          } else {
-            SET_STRING_ELT(data, start + i, NA_STRING);
+        if (strip_out_nuls) {
+          for (int i = 0; i < n; i++) {
+            SET_STRING_ELT(data, start + i,
+                           r_string_from_view_strip_nul(string_array->GetView(i),
+                                                        &nul_was_stripped));
           }
+          return;

Review comment:
       no preference, just trying to understand




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r513455884



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -616,6 +616,23 @@ test_that("Array$create() handles vector -> fixed size list arrays", {
   )
 })
 
+test_that("Handling string data with embedded nuls", {
+  raws <- structure(list(
+    as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e)),
+    as.raw(c(0x77, 0x6f, 0x6d, 0x61, 0x6e)),
+    as.raw(c(0x6d, 0x61, 0x00, 0x6e)), # <-- there's your nul, 0x00
+    as.raw(c(0x63, 0x61, 0x6d, 0x65, 0x72, 0x61)),
+    as.raw(c(0x74, 0x76))),
+    class = c("arrow_binary", "vctrs_vctr", "list"))
+  expect_error(rawToChar(raws[[3]]), "nul") # See?
+  array_with_nul <- Array$create(raws)$cast(utf8())
+  # In version 1.0.1, as.vector(array_with_nul) errors:
+  # Error in Array__as_vector(self) : embedded nul in string: 'ma\0n'
+  # On master (presumably this is a cpp11 thing) it does not error,
+  # but it terminates the string early:
+  # [1] "person" "woman"  "ma"     "camera" "tv"

Review comment:
       🙈




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jimhester commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
jimhester commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-718020238


   Oh hmm, it is not on purpose, I think it was just copy pasted from the `const char*` implementation above. We should use `mkCharLenCE` for `std::string()` since we know the length.
   
   That being said having an embedded `NULL` in any `CHARSXP` is not really going to work in R. Lots of places in the R codebase assumes strings are null terminated. I think the only way you can handle string data with embedded nulls in R is with raw vectors.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jimhester edited a comment on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
jimhester edited a comment on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-718020238


   Oh hmm, it is not on purpose, I think it was just copy pasted from the `const char*` implementation above. We should use `mkCharLenCE` for `std::string()` since we know the length.
   
   That being said having an embedded `NULL` in any `CHARSXP` is not really going to work in R. Lots of places in the R codebase assumes strings are null terminated. I think the only way you can handle string data with embedded nulls in R is with raw vectors.
   
   Which erroring maybe is the intent here, so perhaps switching this to `mkCharLenCE` will do what we want.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-718012902


   @nealrichardson is the intent that we do get to the `embedded nul in string: 'ma\0n'` error ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dianaclarke commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
dianaclarke commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r500667844



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -616,6 +616,23 @@ test_that("Array$create() handles vector -> fixed size list arrays", {
   )
 })
 
+test_that("Handling string data with embedded nuls", {
+  raws <- structure(list(
+    as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e)),
+    as.raw(c(0x77, 0x6f, 0x6d, 0x61, 0x6e)),
+    as.raw(c(0x6d, 0x61, 0x00, 0x6e)), # <-- there's your nul, 0x00
+    as.raw(c(0x63, 0x61, 0x6d, 0x65, 0x72, 0x61)),
+    as.raw(c(0x74, 0x76))),
+    class = c("arrow_binary", "vctrs_vctr", "list"))
+  expect_error(rawToChar(raws[[3]]), "nul") # See?
+  array_with_nul <- Array$create(raws)$cast(utf8())
+  # In version 1.0.1, as.vector(array_with_nul) errors:
+  # Error in Array__as_vector(self) : embedded nul in string: 'ma\0n'
+  # On master (presumably this is a cpp11 thing) it does not error,
+  # but it terminates the string early:
+  # [1] "person" "woman"  "ma"     "camera" "tv"

Review comment:
       > # [1] "person" "woman"  "ma"     "camera" "tv"
   
   😆




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois edited a comment on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
romainfrancois edited a comment on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-726577844


   It does look like `Rf_mkCharLenCE()` is generating the error: 
   
   ``` r
   cpp11::cpp_eval('Rf_mkCharLenCE("camer\\0a", 7, CE_UTF8)')
   #> Error in f(): embedded nul in string: 'camer\0a'
   ```
   
   <sup>Created on 2020-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r551519433



##########
File path: r/src/array_to_vector.cpp
##########
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
     }
 
     StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
-    if (array->null_count()) {
-      // need to watch for nulls
-      arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-                                                array->offset(), n);
+
+    const bool all_valid = array->null_count() == 0;
+    const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+    bool nul_was_stripped = false;
+
+    if (all_valid) {
+      // no need to watch for missing strings
       cpp11::unwind_protect([&] {
-        for (int i = 0; i < n; i++, null_reader.Next()) {
-          if (null_reader.IsSet()) {
-            SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
-          } else {
-            SET_STRING_ELT(data, start + i, NA_STRING);
+        if (strip_out_nuls) {
+          for (int i = 0; i < n; i++) {
+            SET_STRING_ELT(data, start + i,
+                           r_string_from_view_strip_nul(string_array->GetView(i),
+                                                        &nul_was_stripped));
           }
+          return;

Review comment:
       The main function return (L347) is `return Status::OK();` -- should these do the same? Or do I misunderstand what this return does?

##########
File path: r/src/array_to_vector.cpp
##########
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
     }
 
     StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
-    if (array->null_count()) {
-      // need to watch for nulls
-      arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-                                                array->offset(), n);
+
+    const bool all_valid = array->null_count() == 0;
+    const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+    bool nul_was_stripped = false;
+
+    if (all_valid) {
+      // no need to watch for missing strings
       cpp11::unwind_protect([&] {
-        for (int i = 0; i < n; i++, null_reader.Next()) {
-          if (null_reader.IsSet()) {
-            SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
-          } else {
-            SET_STRING_ELT(data, start + i, NA_STRING);
+        if (strip_out_nuls) {
+          for (int i = 0; i < n; i++) {
+            SET_STRING_ELT(data, start + i,
+                           r_string_from_view_strip_nul(string_array->GetView(i),
+                                                        &nul_was_stripped));
           }
+          return;
+        }
+
+        for (int i = 0; i < n; i++) {
+          SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
         }
       });
     } else {
       cpp11::unwind_protect([&] {
-        for (int i = 0; i < n; i++) {
-          SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
+        arrow::internal::BitmapReader validity_reader(array->null_bitmap_data(),
+                                                      array->offset(), n);
+
+        if (strip_out_nuls) {
+          for (int i = 0; i < n; i++, validity_reader.Next()) {
+            if (validity_reader.IsSet()) {
+              SET_STRING_ELT(data, start + i,
+                             r_string_from_view_strip_nul(string_array->GetView(i),
+                                                          &nul_was_stripped));
+            } else {
+              SET_STRING_ELT(data, start + i, NA_STRING);
+            }
+          }
+          return;
+        }
+
+        for (int i = 0; i < n; i++, validity_reader.Next()) {
+          if (validity_reader.IsSet()) {
+            SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
+          } else {
+            SET_STRING_ELT(data, start + i, NA_STRING);
+          }
         }
       });
     }
 
+    if (nul_was_stripped) {
+      cpp11::warning("Stripping '\\0' (nul) from character vector");
+    }
+
     return Status::OK();
   }
 
   bool Parallel() const { return false; }
 
  private:
-  static SEXP r_string_from_view(const arrow::util::string_view& view) {
+  static SEXP r_string_from_view(arrow::util::string_view view) {

Review comment:
       Why this change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r551542391



##########
File path: r/src/array_to_vector.cpp
##########
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
     }
 
     StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
-    if (array->null_count()) {
-      // need to watch for nulls
-      arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-                                                array->offset(), n);
+
+    const bool all_valid = array->null_count() == 0;
+    const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+    bool nul_was_stripped = false;
+
+    if (all_valid) {
+      // no need to watch for missing strings
       cpp11::unwind_protect([&] {
-        for (int i = 0; i < n; i++, null_reader.Next()) {
-          if (null_reader.IsSet()) {
-            SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
-          } else {
-            SET_STRING_ELT(data, start + i, NA_STRING);
+        if (strip_out_nuls) {
+          for (int i = 0; i < n; i++) {
+            SET_STRING_ELT(data, start + i,
+                           r_string_from_view_strip_nul(string_array->GetView(i),
+                                                        &nul_was_stripped));
           }
+          return;
+        }
+
+        for (int i = 0; i < n; i++) {
+          SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
         }
       });
     } else {
       cpp11::unwind_protect([&] {
-        for (int i = 0; i < n; i++) {
-          SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
+        arrow::internal::BitmapReader validity_reader(array->null_bitmap_data(),
+                                                      array->offset(), n);
+
+        if (strip_out_nuls) {
+          for (int i = 0; i < n; i++, validity_reader.Next()) {
+            if (validity_reader.IsSet()) {
+              SET_STRING_ELT(data, start + i,
+                             r_string_from_view_strip_nul(string_array->GetView(i),
+                                                          &nul_was_stripped));
+            } else {
+              SET_STRING_ELT(data, start + i, NA_STRING);
+            }
+          }
+          return;
+        }
+
+        for (int i = 0; i < n; i++, validity_reader.Next()) {
+          if (validity_reader.IsSet()) {
+            SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
+          } else {
+            SET_STRING_ELT(data, start + i, NA_STRING);
+          }
         }
       });
     }
 
+    if (nul_was_stripped) {
+      cpp11::warning("Stripping '\\0' (nul) from character vector");
+    }
+
     return Status::OK();
   }
 
   bool Parallel() const { return false; }
 
  private:
-  static SEXP r_string_from_view(const arrow::util::string_view& view) {
+  static SEXP r_string_from_view(arrow::util::string_view view) {

Review comment:
       string_view is a trivially copyable structure so there's no need to protect it from copies with `const&`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-752264998


   It would be good to get this resolved for 3.0. I pushed a naive fix: if `arrow.skip_nul = TRUE` (default FALSE, per base::readLines and base::scan), we go through a slow path and strip out nuls. A better solution (1) would check `arrow.skip_nul` outside of the loop (I could do this now but figure there's a smarter C/C++ way than I would have come up with); (2) would probably try the fast conversion path anyway and catch the exception and retry with the slow path, so even if arrow.skip_nul == true, we only do the slow path if there is a nul (I tried try/catch but didn't get it working right); and (3) it should raise a warning when it does strip a nul, as I believe readLines and scan do.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois edited a comment on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
romainfrancois edited a comment on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-717996007


   if what we want is that the nul is kept, based on @bkietz comment from https://github.com/apache/arrow/pull/8536 `Converter_String::Ingest_some_nulls()` could end with: 
   
   ```cpp
       cpp11::unwind_protect([&] {
         if (array->null_count()) {
           // need to watch for nulls
           arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
                                                     array->offset(), n);
           for (int i = 0; i < n; i++, null_reader.Next()) {
             if (null_reader.IsSet()) {
               SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
             } else {
               SET_STRING_ELT(data, start + i, NA_STRING);
             }
           }
   
         } else {
           for (int i = 0; i < n; i++) {
             SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
           }
         }
       });
   ```
   
   with: 
   
   ```cpp
   private:
     SEXP unsafe_r_string(const std::string& s) const {
       return Rf_mkCharLenCE(s.c_str(), s.size(), CE_UTF8);
     }
   ```
   
   this builds on knowing that `GetString()` gives an utf8 string, and that we know its size, as opposed to what `cpp11::r_string(std::string)` does here: https://github.com/r-lib/cpp11/blob/master/inst/include/cpp11/r_string.hpp#L18 and uses unwind protection around the loop instead of the individual conversions. 
   
   i.e. it assumes utf-8 but since it does not use the known size, it searches for `nul`: 
   
   ```
   r_string(const std::string& data) : data_(safe[Rf_mkCharCE](data.c_str(), CE_UTF8)) {}
   ```
   
   cc @jimhester, is this on purpose that this constructor uses `Rf_mkCharCE` instead of `Rf_mkCharLenCE` ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-704396850


   https://issues.apache.org/jira/browse/ARROW-6582


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-726224318


   @romainfrancois that looks good to me. I'd recommend using `GetView` instead of `GetString` to avoid allocating an unnecessary temporary for non-short strings


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-754056855


   @nealrichardson 1) I'll push an implementation of this 2) unfortunately, unwind_exceptions can't *really* be caught. They are used by cpp11 to get c++ stack unwinding correct but if one is currently in flight then the R runtime has already been informed that `stop()` was called. If the exception is caught and not rethrown, the R stack will continue to unwind (leaking c++ resources). 3) wilco


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org