You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2022/11/15 03:39:23 UTC

[GitHub] [incubator-pegasus] empiredan commented on pull request #1240: refactor: use state machine to improve split_args for strings

empiredan commented on PR #1240:
URL: https://github.com/apache/incubator-pegasus/pull/1240#issuecomment-1314721615

   > I believe that we shouldn't invest too much effort in writing commonly-used utilities on our own. Instead, we should find if there's a trustworthy library providing this functionality, such as https://abseil.io/docs/cpp/guides/strings#abslstrsplit-for-splitting-strings or https://www.boost.org/doc/libs/1_80_0/doc/html/string_algo/usage.html.
   > 
   > I feel that C++ coders rarely consider relying on the community. Programmers using other languages would never think of writing their string split unless there are some special requirements. Therefore, if possible, please try out the above two libraries and see if they can address your problem.
   
   Good advice ! I strongly agree that we should use mature libraries for basic tools. And initially I was going to use [boost split](https://www.boost.org/doc/libs/1_69_0/doc/html/boost/algorithm/split.html) since we'd introduced boost 1.69.0 as our third-party package. However, according to the doc boost only supports outputting tokens to sequence containers such as `std::vector` or `std::list` even in newest version 1.80.0. Actually tokens are also extracted to associative containers such as `unordered_set` which has already been supported by old `split_args`.
   
   According to [str_split_test.cc](https://github.com/abseil/abseil-cpp/blob/master/absl/strings/str_split_test.cc), it seems that abseil has supported all types of containers. However, there is some special logic for `split_args`: for example, after split for each time, old `split_args` will strip the leading and trailing spaces; moreover, the characters belonging to leading spaces are not the same with those belonging to trailing spaces though I wonder why this happened and which code has depended on this.
   
   I think by [StripAsciiWhitespace](https://github.com/abseil/abseil-cpp/blob/master/absl/strings/ascii.h) each split can be stripped. However, this will erase the string while new `split_args` will just skip leading and trailing spaces of tokens after split without extra stripping operations. Also, the reason why inequality of characters of leading and trailing spaces should be checked. Abseil uses `absl::ascii_isspace` to check both leading and trailing spaces. Even if striping can be implemented by ourselves, it will have to do extra operations and another copy.
   
   To sum up, I think we can just `split_args` firstly. I believe the common library will function nevertheless. However, it will cost some time to verify. I'll create other new issues to track. This is an important job that will be supported for long term.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org