You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/15 17:23:34 UTC

[GitHub] [arrow] bkietz commented on a change in pull request #8470: ARROW-10313: [C++] Faster UTF8 validation for small strings

bkietz commented on a change in pull request #8470:
URL: https://github.com/apache/arrow/pull/8470#discussion_r505708835



##########
File path: cpp/src/arrow/util/utf8.h
##########
@@ -154,13 +156,52 @@ inline bool ValidateUTF8(const uint8_t* data, int64_t size) {
     return false;
   }
 
-  // Validate string tail one byte at a time
+  // Check if string tail is full ASCII (common case, fast)
+  if (size >= 4) {
+    uint32_t mask1, mask2;
+    memcpy(&mask2, data + size - 4, 4);
+    memcpy(&mask1, data, 4);
+    if (ARROW_PREDICT_TRUE(((mask1 | mask2) & high_bits_32) == 0)) {
+      return true;
+    }

Review comment:
       ```suggestion
       uint32_t head_mask = internal::SafeLoadAs<uint32_t>(data);
       uint32_t tail_mask = internal::SafeLoadAs<uint32_t>(data + size - 4);
       if (ARROW_PREDICT_TRUE(((head_mask | tail_mask) & high_bits_32) == 0)) {
         return true;
       } 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org