You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/26 22:14:43 UTC

[GitHub] [arrow] nealrichardson opened a new pull request #8536: ARROW-10399: [R] Fix performance regression from cpp11::r_string

nealrichardson opened a new pull request #8536:
URL: https://github.com/apache/arrow/pull/8536


   @bkietz and I identified the performance regression I observed today, and by more-or-less reverting https://github.com/apache/arrow/commit/55defbf7a14896dc1a166f04f62f576179b4a1d5#diff-090c5cff4eadd62a121e85babd186c9838055d2757670971204817eb2d96211aR297-R313 (Ben suggested renaming the function and calling GetView instead of GetString), the performance regression went away.
   
   I haven't investigated why the `cpp11::r_string` way is so significantly slower but it's worth exploring. We should also make sure we aren't inefficiently calling that elsewhere.
   
   cc @kszucs 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8536: ARROW-10399: [R] Fix performance regression from cpp11::r_string

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8536:
URL: https://github.com/apache/arrow/pull/8536#issuecomment-717308032


   > `cpp11::r_string(const std::string&)` does not take advantage of knowing the length of the string (so the terminating null must be searched for),
   
   Side note: it's not always safe to assume nuls are terminating, and R and Rcpp don't assume that. cf. #8365


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #8536: ARROW-10399: [R] Fix performance regression from cpp11::r_string

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #8536:
URL: https://github.com/apache/arrow/pull/8536


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8536: ARROW-10399: [R] Fix performance regression from cpp11::r_string

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8536:
URL: https://github.com/apache/arrow/pull/8536#issuecomment-717333540


   > Side note: it's not always safe to assume nuls are terminating, and R and Rcpp don't assume that. cf. #8365
   
   Indeed, with this change, the test on that PR now errors again rather than silently truncating strings.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8536: ARROW-10399: [R] Fix performance regression from cpp11::r_string

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8536:
URL: https://github.com/apache/arrow/pull/8536#issuecomment-717286906


   The only other usage of `cpp11::r_string` which I see that might be problematic is in [array_from_vector](https://github.com/apache/arrow/blob/3b461625505c621557c28226d5260b1d0c55e4fb/r/src/array_from_vector.cpp#L178) where it's used to translate each string element to utf8. It appears that code path is only taken for string (and binary) arrays which are nested inside other arrays as a dictionary, list item column, ... Repairing this can probably wait for a follow up in which we apply the framework introduced in #8088 to clean up conversion for R. @kszucs 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8536: ARROW-10399: [R] Fix performance regression from cpp11::r_string

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8536:
URL: https://github.com/apache/arrow/pull/8536#issuecomment-716858889


   https://issues.apache.org/jira/browse/ARROW-10399


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8536: ARROW-10399: [R] Fix performance regression from cpp11::r_string

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8536:
URL: https://github.com/apache/arrow/pull/8536#discussion_r512717837



##########
File path: r/src/array_to_vector.cpp
##########
@@ -294,22 +294,25 @@ struct Converter_String : public Converter {
                                                 array->offset(), n);
       for (int i = 0; i < n; i++, null_reader.Next()) {

Review comment:
       Please wrap these loops in a call to `unwind_protect`:
   ```suggestion
         cpp11::unwind_protect([&] {
           for (int i = 0; i < n; i++, null_reader.Next()) {
   ```

##########
File path: r/src/array_to_vector.cpp
##########
@@ -294,22 +294,25 @@ struct Converter_String : public Converter {
                                                 array->offset(), n);
       for (int i = 0; i < n; i++, null_reader.Next()) {
         if (null_reader.IsSet()) {
-          SET_STRING_ELT(data, start + i, cpp11::r_string(string_array->GetString(i)));
+          SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
         } else {
           SET_STRING_ELT(data, start + i, NA_STRING);
         }
       }
 
     } else {
       for (int i = 0; i < n; i++) {
-        SET_STRING_ELT(data, start + i, cpp11::r_string(string_array->GetString(i)));
+        SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i)));
       }
     }
 
     return Status::OK();
   }
 
   bool Parallel() const { return false; }
+  inline SEXP r_string_from_view(const arrow::util::string_view& view) const {

Review comment:
       This method doesn't need to be explicitly marked `inline` or exposed in the public interface, so let's make it private and static. `string_view`s are TriviallyCopyable so it's unnecessary to declare them const reference
   ```suggestion
    private:
     static SEXP r_string_from_view(arrow::util::string_view view) {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8536: ARROW-10399: [R] Fix performance regression from cpp11::r_string

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8536:
URL: https://github.com/apache/arrow/pull/8536#issuecomment-717315905


   Confirmed that Ben's suggestions don't hurt the performance that was recovered.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org