You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Attila Bukor (Code Review)" <ge...@cloudera.org> on 2019/10/02 23:24:39 UTC

[kudu-CR] [WIP] KUDU-1938 Performance tuning

Hello Adar Dembo, Grant Henke,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/14353

to review the following change.


Change subject: [WIP] KUDU-1938 Performance tuning
......................................................................

[WIP] KUDU-1938 Performance tuning

Before:

Attilas-MacBook-Pro:debug charger$ ./bin/char_util-test
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from CharUtilTest
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3469 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (3574 ms)
[ RUN      ] CharUtilTest.CorrectnessTest
[       OK ] CharUtilTest.CorrectnessTest (1 ms)
[----------] 3 tests from CharUtilTest (7044 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test case ran. (7044 ms total)
[  PASSED  ] 3 tests.

After:

Attilas-MacBook-Pro:debug charger$ ninja char_util-test && ./bin/char_util-test
[8/8] Linking CXX executable bin/char_util-test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (1 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (1 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3568 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (902 ms)
[----------] 4 tests from CharUtilTest (4472 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (4472 ms total)
[  PASSED  ] 4 tests.

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 410 insertions(+), 10 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/1
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#11).

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (6698 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (6161 ms)

After:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (7746 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (1028 ms)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 421 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/11
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 11
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (6698 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (6161 ms)

After:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (7746 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (1028 ms)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Reviewed-on: http://gerrit.cloudera.org:8080/14353
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Kudu Jenkins
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 421 insertions(+), 11 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 12
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [WIP] KUDU-1938 Performance tuning

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#3).

Change subject: [WIP] KUDU-1938 Performance tuning
......................................................................

[WIP] KUDU-1938 Performance tuning

Before:

Attilas-MacBook-Pro:debug charger$ ./bin/char_util-test
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from CharUtilTest
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3469 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (3574 ms)
[ RUN      ] CharUtilTest.CorrectnessTest
[       OK ] CharUtilTest.CorrectnessTest (1 ms)
[----------] 3 tests from CharUtilTest (7044 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test case ran. (7044 ms total)
[  PASSED  ] 3 tests.

After:

Attilas-MacBook-Pro:debug charger$ ninja char_util-test && ./bin/char_util-test
[8/8] Linking CXX executable bin/char_util-test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (1 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (1 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3568 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (902 ms)
[----------] 4 tests from CharUtilTest (4472 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (4472 ms total)
[  PASSED  ] 4 tests.

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 410 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/3
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has removed a vote on this change.

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#8).

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (6698 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (6161 ms)

After:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (7746 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (1028 ms)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 425 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/8
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 8
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................


Patch Set 6: Verified+1

second unrelated random failure broke the Jenkins build, not retriggering.


-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Fri, 04 Oct 2019 15:35:30 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#4).

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (1 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (1 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3026 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (2888 ms)
[----------] 4 tests from CharUtilTest (5916 ms total)

After:

[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (2 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (0 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3235 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (290 ms)
[----------] 4 tests from CharUtilTest (3527 ms total)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 417 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/4
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [WIP] KUDU-1938 Performance tuning

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#2).

Change subject: [WIP] KUDU-1938 Performance tuning
......................................................................

[WIP] KUDU-1938 Performance tuning

Before:

Attilas-MacBook-Pro:debug charger$ ./bin/char_util-test
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from CharUtilTest
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3469 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (3574 ms)
[ RUN      ] CharUtilTest.CorrectnessTest
[       OK ] CharUtilTest.CorrectnessTest (1 ms)
[----------] 3 tests from CharUtilTest (7044 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test case ran. (7044 ms total)
[  PASSED  ] 3 tests.

After:

Attilas-MacBook-Pro:debug charger$ ninja char_util-test && ./bin/char_util-test
[8/8] Linking CXX executable bin/char_util-test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (1 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (1 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3568 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (902 ms)
[----------] 4 tests from CharUtilTest (4472 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (4472 ms total)
[  PASSED  ] 4 tests.

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 410 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/2
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................


Patch Set 11:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/14353/9/src/kudu/util/char_util-test.cc
File src/kudu/util/char_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14353/9/src/kudu/util/char_util-test.cc@39
PS9, Line 39: 
> KuduTest has an env_ member.
Done


http://gerrit.cloudera.org:8080/#/c/14353/9/src/kudu/util/char_util-test.cc@82
PS9, Line 82:   ASSERT_EQ(9756, result.size());
> Nit: indentation. And wrapping.
Done


http://gerrit.cloudera.org:8080/#/c/14353/9/src/kudu/util/char_util-test.cc@87
PS9, Line 87: TEST_F(CharUtilTest, CorrectnessTestIncompleteUtf8) {
            :   Slice result;
            :   Slice test_data = "aaaa
> Combine?
Done


http://gerrit.cloudera.org:8080/#/c/14353/10/src/kudu/util/char_util-test.cc
File src/kudu/util/char_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14353/10/src/kudu/util/char_util-test.cc@50
PS10, Line 50: 
> using
Done


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc
File src/kudu/util/char_util.cc:

http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc@32
PS7, Line 32:   while (num_bytes < size) {
> Oh so 8-byte alignment isn't important? Interesting.
yeah I thought it was, but based on the benchmarks it was actually faster without the alignment.



-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 11
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 03 Dec 2019 22:31:17 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#5).

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (1 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (1 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3026 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (2888 ms)
[----------] 4 tests from CharUtilTest (5916 ms total)

After:

[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (2 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (0 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3235 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (290 ms)
[----------] 4 tests from CharUtilTest (3527 ms total)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 419 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/5
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................


Patch Set 11: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 11
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 03 Dec 2019 22:36:52 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................


Patch Set 10:

(5 comments)

Sorry, I had reviewed this a week or two ago but forgot to publish.

http://gerrit.cloudera.org:8080/#/c/14353/9/src/kudu/util/char_util-test.cc
File src/kudu/util/char_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14353/9/src/kudu/util/char_util-test.cc@39
PS9, Line 39:     Env* env = Env::Default();
KuduTest has an env_ member.


http://gerrit.cloudera.org:8080/#/c/14353/9/src/kudu/util/char_util-test.cc@82
PS9, Line 82: ptr = Truncate(data_ascii_, 10549, &result); ASSERT_EQ(10549, result.size());
Nit: indentation. And wrapping.


http://gerrit.cloudera.org:8080/#/c/14353/9/src/kudu/util/char_util-test.cc@87
PS9, Line 87:   Slice test_data;
            : 
            :   test_data = "aaaa\xf3";
Combine?


http://gerrit.cloudera.org:8080/#/c/14353/10/src/kudu/util/char_util-test.cc
File src/kudu/util/char_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14353/10/src/kudu/util/char_util-test.cc@50
PS10, Line 50: std::
using


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc
File src/kudu/util/char_util.cc:

http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc@32
PS7, Line 32:   while (num_bytes < size) {
> Done
Oh so 8-byte alignment isn't important? Interesting.



-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 03 Dec 2019 18:22:24 +0000
Gerrit-HasComments: Yes

[kudu-CR] [WIP] KUDU-1938 Performance tuning

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: [WIP] KUDU-1938 Performance tuning
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14353/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14353/1//COMMIT_MSG@41
PS1, Line 41: [       OK ] CharUtilTest.StressTestAscii (902 ms)
It looks like this patch makes ascii handling ~3x faster and utf8 is the same?


http://gerrit.cloudera.org:8080/#/c/14353/1/src/kudu/util/char_util.cc
File src/kudu/util/char_util.cc:

http://gerrit.cloudera.org:8080/#/c/14353/1/src/kudu/util/char_util.cc@24
PS1, Line 24: Slice UTF8Truncate(Slice val, size_t max_utf8_length) {
Was this copied from a reference implementation somewhere or written from scratch?



-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 03 Oct 2019 01:40:53 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#6).

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (1 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (1 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3026 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (2888 ms)
[----------] 4 tests from CharUtilTest (5916 ms total)

After:

[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CharUtilTest
[ RUN      ] CharUtilTest.CorrectnessTestUtf8
[       OK ] CharUtilTest.CorrectnessTestUtf8 (2 ms)
[ RUN      ] CharUtilTest.CorrectnessTestAscii
[       OK ] CharUtilTest.CorrectnessTestAscii (0 ms)
[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (3235 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (290 ms)
[----------] 4 tests from CharUtilTest (3527 ms total)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 431 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/6
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14353/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14353/7//COMMIT_MSG@33
PS7, Line 33: 
> I presume these measurements were taken from char_util.cc compiled in RELEA
They were DEBUG, changed them to RELEASE and rewritten the code a bit.



-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 8
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 11 Nov 2019 19:41:29 +0000
Gerrit-HasComments: Yes

[kudu-CR] [WIP] KUDU-1938 Performance tuning

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: [WIP] KUDU-1938 Performance tuning
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14353/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14353/1//COMMIT_MSG@41
PS1, Line 41: [       OK ] CharUtilTest.StressTestAscii (902 ms)
> It looks like this patch makes ascii handling ~3x faster and utf8 is the sa
yep, basically it adds a fast-path for chunks that contain only ASCII characters in which case we don't need to examine each byte individually.


http://gerrit.cloudera.org:8080/#/c/14353/1/src/kudu/util/char_util.cc
File src/kudu/util/char_util.cc:

http://gerrit.cloudera.org:8080/#/c/14353/1/src/kudu/util/char_util.cc@24
PS1, Line 24: Slice UTF8Truncate(Slice val, size_t max_utf8_length) {
> Was this copied from a reference implementation somewhere or written from s
This was written from scratch. Todd's comment on https://gerrit.cloudera.org/c/13928/3//COMMIT_MSG#17 gave me the idea, then I brainstormed with two friends (Istvan Farmosi & Zoltan Chovan) and we came up with this optimization.



-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 03 Oct 2019 06:59:00 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................


Patch Set 7:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14353/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14353/7//COMMIT_MSG@33
PS7, Line 33: After:
I presume these measurements were taken from char_util.cc compiled in RELEASE mode?


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util-test.cc
File src/kudu/util/char_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util-test.cc@37
PS7, Line 37:   const int kNumCycles = 100000;
This can be declared private.


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util-test.cc@44
PS7, Line 44:   void ReadFile(const std::string& file_name, Slice* slice, std::string* string) {
You can reuse ReadFileToString (defined in env.h).


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util-test.cc@68
PS7, Line 68:   ASSERT_EQ(10549, result.size());
If this fails, the delete won't be called. Instead of explicitly deleting, set up a unique_ptr around the memory you want freed right after the call to UTF8Truncate.

Same goes for the other tests here. Probably best to wrap the call to UTF8Truncate in something that calls it and also returns a unique_ptr.


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc
File src/kudu/util/char_util.cc:

http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc@32
PS7, Line 32:   while (num_bytes < size) {
Should the two > conditions below be >=?

I'd also doc the net effect of the four conditions: if the next 8 bytes (8-byte aligned) are all ASCII characters, we can fast path them.


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc@33
PS7, Line 33:     if (num_bytes % 8 == 0
            :         && size - num_bytes > 8
            :         && max_utf8_length - num_utf8_chars > 8
            :         && (*(reinterpret_cast<const int64_t*>(str)) & 0x8080808080808080) == 0) {
Nit: move && to the end of the previous line.


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc@43
PS7, Line 43:           num_bytes++;
Nit: add FALLTHROUGH_INTENDED to the end of each non-empty case where we fall through to the next one.



-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Sat, 05 Oct 2019 00:55:39 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................


Patch Set 7:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util-test.cc
File src/kudu/util/char_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util-test.cc@37
PS7, Line 37:   const int kNumCycles = 100000;
> This can be declared private.
Done


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util-test.cc@44
PS7, Line 44:   void ReadFile(const std::string& file_name, Slice* slice, std::string* string) {
> You can reuse ReadFileToString (defined in env.h).
Done


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util-test.cc@68
PS7, Line 68:   ASSERT_EQ(10549, result.size());
> If this fails, the delete won't be called. Instead of explicitly deleting, 
Done


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc
File src/kudu/util/char_util.cc:

http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc@32
PS7, Line 32:   while (num_bytes < size) {
> Should the two > conditions below be >=?
Done


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc@33
PS7, Line 33:     if (num_bytes % 8 == 0
            :         && size - num_bytes > 8
            :         && max_utf8_length - num_utf8_chars > 8
            :         && (*(reinterpret_cast<const int64_t*>(str)) & 0x8080808080808080) == 0) {
> Nit: move && to the end of the previous line.
Done


http://gerrit.cloudera.org:8080/#/c/14353/7/src/kudu/util/char_util.cc@43
PS7, Line 43:           num_bytes++;
> Nit: add FALLTHROUGH_INTENDED to the end of each non-empty case where we fa
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 11 Nov 2019 19:41:03 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#9).

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (6698 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (6161 ms)

After:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (7746 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (1028 ms)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 420 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/9
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 9
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-1938 Make UTF-8 truncation faster pt 1

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14353

to look at the new patch set (#10).

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (6698 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (6161 ms)

After:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (7746 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (1028 ms)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 421 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/14353/10
-- 
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)