You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2020/04/24 23:59:09 UTC

[kudu-CR] row serialization: dedupe string columns

Todd Lipcon has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15806


Change subject: row_serialization: dedupe string columns
......................................................................

row_serialization: dedupe string columns

This adds a small hashtable which avoids serializing multiple copies of
dictionary-encoded strings to the wire. This saves a lot of CPU on
memcpying the string values into the indirect data buffer, and saves
further CPU (and sometimes network) when the result is sent to the client.

For clients that will then perform aggregation or filtering based on
this column, it's likely to save even more downstream CPU by fitting the
values in a smaller amount of memory and thus being more
CPU-cache-efficient.

The hashtable is "lossy" -- on collision, the old entry is evicted and
the new entry is added. This means that it doesn't provide 100% dedupe,
but is very fast.

On a similar angle, it uses CRC32 as a hash function. That's not a very
good hash function but it's very fast, so "good enough" when our goal is
mostly to dedupe low-cardinality strings.

I tested this by scanning a lineitem table while running the tserver
under perf-stat as in other recent changes (eg KUDU-2844), and measuring
the number of cycles and CPU seconds.

The first test scans the l_shipmode column. This column has only 7
distinct values, so should have very effective deduplication:

Before:
   250,368,757,568      cycles
      78.681321000 seconds user
      15.131539000 seconds sys

After:
   141,539,844,532      cycles
      41.275169000 seconds user
      12.391870000 seconds sys

The results here show a 1.76x improvement in total cycles, 1.9x
reduction in user CPU, 1.22x reduction in system CPU. The system CPU is
reduced due to having to send less data over the wire.

The second test scans the l_shipdate column. This column has ~2500
unique values, so is unlikely to get any significant deduplication in the
context of serializing a 100-row scan batch. So, this models a sort of
worst-case for the overhead of checking for duplicates when none are
found:

Before:
   189,745,098,880      cycles
      51.212566000 seconds user
      21.410176000 seconds sys

After:
   206,022,470,215      cycles
      57.642555000 seconds user
      20.969642000 seconds sys

As expected, there is some overhead here: 1.09x more cycles, 1.13x more
user CPU.

I think this tradeoff is worth it -- if a query is aggregating over a
low cardinality column, the scan is probably a larger percentage of the
overall query resource consumption, and we should try to make it as fast
as possible. If the query is aggregating over a high-cardinality column
(eg a count(distinct) or a big join) then the downstream operators are
likely to be expensive, and a 10% hit on the scan speed shouldn't be a
significant slowdown on the query.

Looking at the generated assembly, I think we can also cut down the
overhead a little more through some micro-optimization followup (eg
unrolling, etc).

Change-Id: I056cd57b9eea1d6988b8ab053d86a3493361dc6b
---
M src/kudu/common/row_serialization.cc
1 file changed, 93 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/06/15806/1
-- 
To view, visit http://gerrit.cloudera.org:8080/15806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I056cd57b9eea1d6988b8ab053d86a3493361dc6b
Gerrit-Change-Number: 15806
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>

[kudu-CR] row serialization: dedupe string columns

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15806 )

Change subject: row_serialization: dedupe string columns
......................................................................


Patch Set 1: Code-Review+1

(2 comments)

Not sure what's up with the build failures though.

http://gerrit.cloudera.org:8080/#/c/15806/1/src/kudu/common/row_serialization.cc
File src/kudu/common/row_serialization.cc:

http://gerrit.cloudera.org:8080/#/c/15806/1/src/kudu/common/row_serialization.cc@199
PS1, Line 199:   pair<bool, size_t> AddOrGetExisting(const uint8_t* p, size_t off) {
nit: add docs. Mentioning the units of the keys and values would be nice too.


http://gerrit.cloudera.org:8080/#/c/15806/1/src/kudu/common/row_serialization.cc@272
PS1, Line 272: ins_pair
nit: `missed_and_offset` or somesuch would be more self-documenting



-- 
To view, visit http://gerrit.cloudera.org:8080/15806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I056cd57b9eea1d6988b8ab053d86a3493361dc6b
Gerrit-Change-Number: 15806
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 May 2020 06:28:35 +0000
Gerrit-HasComments: Yes

[kudu-CR] row serialization: dedupe string columns

Posted by "Volodymyr Verovkin (Code Review)" <ge...@cloudera.org>.
Volodymyr Verovkin has posted comments on this change. ( http://gerrit.cloudera.org:8080/15806 )

Change subject: row_serialization: dedupe string columns
......................................................................


Patch Set 1: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/15806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I056cd57b9eea1d6988b8ab053d86a3493361dc6b
Gerrit-Change-Number: 15806
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>
Gerrit-Comment-Date: Sat, 25 Apr 2020 00:52:58 +0000
Gerrit-HasComments: No