You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/15 09:29:12 UTC

[GitHub] [arrow] maartenbreddels opened a new pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

maartenbreddels opened a new pull request #7434:
URL: https://github.com/apache/arrow/pull/7434


   Following up on #7418 I tried and benchmarked a different way for
    * ascii_lower
    * ascii_upper
   
   Before (lower is similar):
   ```
   --------------------------------------------------
   Benchmark           Time           CPU Iterations
   --------------------------------------------------
   AsciiUpper_median    4922843 ns      4918961 ns           10 bytes_per_second=3.1457G/s items_per_second=213.17M/s
   ```
   
   After:
   ```
   --------------------------------------------------
   Benchmark           Time           CPU Iterations
   --------------------------------------------------
   AsciiUpper_median    1391272 ns      1390014 ns           10 bytes_per_second=11.132G/s items_per_second=754.363M/s
   
   ```
   
   This is a 3.7x speedup (on a AMD machine).
   
   Using http://quick-bench.com/JaDErmVCY23Z1tu6YZns_KBt0qU I found 4.6x speedup for clang 9, 6.4x for GCC 9.2.
   
   Also, the test is expanded a bit to include a non-ascii codepoint, to make explicit it is fine to upper
   or lower case a utf8 string. The non-overlap encoding of utf8 make this ok (see section 2.5 of Unicode
   Standard Core Specification v13.0).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644149831


   > PS2: Are there alternative channels for quick/small questions, or is this fine?
   
   Can we use dev@arrow.apache.org so everyone can see the discussion and it's searchable via Google later?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644104830


   Travis-CI failure is unrelated.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644142377


   Thanks, let me know if my workflow is ok, or if I can make things go smoother.
   
   PS: I am looking for a document describing the kernel design. I see these two cases (`if (batch[0].kind() == Datum::ARRAY) {` and the else clause, but I am not sure I fully understand this. But I'm not sure where this is described, if it is.
   
   PS2: Are there alternative channels for quick/small questions, or is this fine?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644144990


   I think the current approach (implement kernels one-by-one) is reasonable and manageable for us (and for you as well I hope).
   
   I don't think there's much documentation for now around the kernel design. Basically, kernels should usually be able to process two kinds of inputs (represented as `Datum`s): arrays and scalars. That said, arrays are the dominant case, so if we leave scalars unimplemented in a given kernel, that's not an urgent problem.
   
   For quick development questions, we have a public chat instance at https://ursalabs.zulipchat.com/ - just register though and you can chat with the team. The main channel there is the "dev" channel, and Zulip allows you to create subtopics - don't hesitate to use those!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644075652


   I also thought that we could do a bit check instead of the range check, e.g. `code_unit & 0b11100000) == 0b01100000`, but that would also transform the backtick for instance (binary value 0b1100000).
   
   The generated code looks vectorized indeed. I didn't look into the details of the generated code by clang and GCC, it seems their performance is a bit different, so we might be able to squeeze out a bit more if we want. Happy to look into that later (create a new issue), but I rather spend my time on other functions now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644084984


   It's ok, there's no need to further optimize those functions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644146353


   > I think the current approach (implement kernels one-by-one) is reasonable and manageable for us (and for you as well I hope).
   
   No, this is fine.
   
   
   
   > That said, arrays are the dominant case, so if we leave scalars unimplemented in a given kernel, that's not an urgent problem.
   
   👍 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644158423


   It shouldn't be a problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644017246


   https://issues.apache.org/jira/browse/ARROW-9131


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644153016


   > Can we use [dev@arrow.apache.org](mailto:dev@arrow.apache.org) so everyone can see the discussion and it's searchable via Google later?
   
   I'm ok with that if not considered too noisy (like small cmake questions etc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm edited a comment on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644149831


   > PS2: Are there alternative channels for quick/small questions, or is this fine?
   
   Can we use dev@arrow.apache.org so everyone can see the discussion and it's searchable via Google later?
   
   > PS: I am looking for a document describing the kernel design.
   
   Keep in mind that the new kernels framework is only a few weeks old (https://github.com/apache/arrow/commit/7ad49eeca5215d9b2a56b6439f1bd6ea38888ea9), so developer documentation is a bit behind, so don't hesitate to ask questions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7434:
URL: https://github.com/apache/arrow/pull/7434


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#issuecomment-644106486


   Thank you @maartenbreddels !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7434: ARROW-9131: [C++] Faster ascii_lower and ascii_upper.

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7434:
URL: https://github.com/apache/arrow/pull/7434#discussion_r440095501



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -64,77 +64,30 @@ void StringDataTransform(KernelContext* ctx, const ExecBatch& batch,
   }
 }
 
-// Generated with
-//
-// print("static constexpr uint8_t kAsciiUpperTable[] = {")
-// for i in range(256):
-//     if i > 0: print(', ', end='')
-//     if i >= ord('a') and i <= ord('z'):
-//         print(i - 32, end='')
-//     else:
-//         print(i, end='')
-// print("};")
-
-static constexpr uint8_t kAsciiUpperTable[] = {
-    0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,  15,
-    16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,
-    32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,
-    48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
-    64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
-    80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,
-    96,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
-    80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  123, 124, 125, 126, 127,
-    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
-    144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
-    160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
-    176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191,
-    192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
-    208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
-    224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-    240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255};
-
 void TransformAsciiUpper(const uint8_t* input, int64_t length, uint8_t* output) {
   for (int64_t i = 0; i < length; ++i) {
-    *output++ = kAsciiUpperTable[*input++];
+    const uint8_t utf8_code_unit = *input++;
+    // Code units in the range [a-z] can only be an encoding of an ascii
+    // character/codepoint, not the 2nd, 3rd or 4th code unit (byte) of an different
+    // codepoint. This guaranteed by non-overal design of the unicode standard. (see

Review comment:
       "non-overlap"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org