You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/11 10:37:28 UTC

[GitHub] [arrow] maartenbreddels commented on pull request #8621: ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

maartenbreddels commented on pull request #8621:
URL: https://github.com/apache/arrow/pull/8621#issuecomment-725345905


   The `std::vector<bool>` was a good idea, and indeed because of it's bit usage, the memory usage for Unicode isn't that heavy (most extreme: `0x10FFFF bits = 140kb` in case of a contiguous array implementation).
   
   Benchmarks:
   ```
   set:
   TrimManyAscii_median   28346892 ns   28345125 ns         25   558.956MB/s   35.2794M items/s
   TrimManyUtf8_median    28302644 ns   28294883 ns         25   559.949MB/s   35.3421M items/s
   
   unordered_set:
   TrimManyAscii_median   32017530 ns   32014024 ns         22   494.898MB/s   31.2363M items/s
   TrimManyUtf8_median (not run)
   
   vector<bool>
   TrimManyAscii_median   14911543 ns   14910620 ns         47   1062.58MB/s   67.0663M items/s
   TrimManyUtf8_median    16148001 ns   16146053 ns         44   981.273MB/s   61.9346M items/s
   
   bitset<256>
   TrimManyAscii_median   14304925 ns   14304010 ns         49   1107.64MB/s   69.9105M items/s
   ```
   
   
   `vector<bool>` is good enough I think, the bitset is consistently faster (5%), but I'd rather have similar code for both solutions.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org