You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Attila Bukor (Code Review)" <ge...@cloudera.org> on 2019/08/12 19:28:07 UTC

[kudu-CR] KUDU-1938 Add support for CHAR/VARCHAR pt 1

Hello Will Berkeley, Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13760

to look at the new patch set (#25).

Change subject: KUDU-1938 Add support for CHAR/VARCHAR pt 1
......................................................................

KUDU-1938 Add support for CHAR/VARCHAR pt 1

Introduces the CHAR and VARCHAR data types to the server. Follow up
commits will add integration to the clients. The CHAR and VARCHAR types
are parameterized with a length column type attribute similar to
DECIMAL's scale and precision. Internally both of them are stored as
BINARY.

The maximum length for VARCHAR is 65,535 and 255 for CHAR. Both of them
are truncated to the length of the column and trailing spaces left after
truncation to length are also removed from CHAR fields.

Truncation happens before persisting the data on the server side to
prevent wasting space and also for the predicates to be applied correctly.

The maximum lengths were chosen for compatibility reasons. Apache Impala
has a maximum length of 255 characters for CHAR and 65,535 for VARCHAR,
major RDBMSs I checked also had similar, or lower limits.

There's a difference between the 'standard' approach used by traditional
RDBMSs of padding CHARs vs. Apache Impala's and now ours.

Originally I implemented the padding of CHARs *before* persisting which
it seems is what other databases (e.g. MySQL[1], Oracle[2] and
PostgreSQL[3]) is doing. IIRC this was originally to have fixed-width
rows, but with UTF-8 they still wouldn't be fixed-width as UTF-8 itself
is variable length.

In MySQL's case the trailing spaces are even removed by default when
scanned:

> The length of a CHAR column is fixed to the length that you declare
> when you create the table. The length can be any value from 0 to 255.
> When CHAR values are stored, they are right-padded with spaces to the
> specified length. When CHAR values are retrieved, trailing spaces are
> removed unless the PAD_CHAR_TO_FULL_LENGTH SQL mode is enabled.

Impala[4] on the other hand stores the data without trailing whitespaces
and it's padded upon retrieval:

> If you store a CHAR value containing trailing spaces in a table, those
> trailing spaces are not stored in the data file. When the value is
> retrieved by a query, the result could have a different number of
> trailing spaces. That is, the value includes however many spaces are
> needed to pad it to the specified length of the column.

Due to the variable length nature of UTF8 and the columnar format I
believe it makes most sense to implement it the same way as Impala did.

[1] https://docs.oracle.com/cd/E17952_01/mysql-5.1-en/char.html
[2]
https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT1821
[3] https://www.postgresql.org/docs/9.0/datatype-character.html
[4] https://impala.apache.org/docs/build/html/topics/impala_char.html

Change-Id: I998982dba93831db91c43a97ce30d3e68c2a4a54
---
M src/kudu/common/column_predicate-test.cc
M src/kudu/common/common.proto
M src/kudu/common/partial_row-test.cc
M src/kudu/common/partial_row.cc
M src/kudu/common/partial_row.h
M src/kudu/common/schema.cc
M src/kudu/common/schema.h
M src/kudu/common/types.cc
M src/kudu/common/types.h
M src/kudu/common/wire_protocol.cc
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util.cc
A src/kudu/util/char_util.h
13 files changed, 357 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/60/13760/25
-- 
To view, visit http://gerrit.cloudera.org:8080/13760
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I998982dba93831db91c43a97ce30d3e68c2a4a54
Gerrit-Change-Number: 13760
Gerrit-PatchSet: 25
Gerrit-Owner: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>