You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/28 15:10:16 UTC

[GitHub] [arrow] romainfrancois edited a comment on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

romainfrancois edited a comment on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-717996007


   if what we want is that the nul is kept, based on @bkietz comment from https://github.com/apache/arrow/pull/8536 `Converter_String::Ingest_some_nulls()` could end with: 
   
   ```cpp
       cpp11::unwind_protect([&] {
         if (array->null_count()) {
           // need to watch for nulls
           arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
                                                     array->offset(), n);
           for (int i = 0; i < n; i++, null_reader.Next()) {
             if (null_reader.IsSet()) {
               SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
             } else {
               SET_STRING_ELT(data, start + i, NA_STRING);
             }
           }
   
         } else {
           for (int i = 0; i < n; i++) {
             SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
           }
         }
       });
   ```
   
   with: 
   
   ```cpp
   private:
     SEXP unsafe_r_string(const std::string& s) const {
       return Rf_mkCharLenCE(s.c_str(), s.size(), CE_UTF8);
     }
   ```
   
   this builds on knowing that `GetString()` gives an utf8 string, and that we know its size, as opposed to what `cpp11::r_string(std::string)` does here: https://github.com/r-lib/cpp11/blob/master/inst/include/cpp11/r_string.hpp#L18 and uses unwind protection around the loop instead of the individual conversions. 
   
   i.e. it assumes utf-8 but since it does not use the known size, it searches for `nul`: 
   
   ```
   r_string(const std::string& data) : data_(safe[Rf_mkCharCE](data.c_str(), CE_UTF8)) {}
   ```
   
   cc @jimhester, is this on purpose that this constructor uses `Rf_mkCharCE` instead of `Rf_mkCharLenCE` ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org