You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Attila Bukor (Code Review)" <ge...@cloudera.org> on 2019/11/11 19:40:52 UTC

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#8).

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (6698 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (6161 ms)

After:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (7746 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (1028 ms)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 425 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/8
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 8
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)