You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2016/12/01 01:16:20 UTC

[kudu-CR] KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes

Todd Lipcon has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5296

Change subject: KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes
......................................................................

KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes

Ensures that table names and identifiers are valid UTF8 with no embedded
null bytes which could cause problems.

KUDU-981 proposes more stringent restrictions such as limiting to only
ASCII. We should collect more info from users before making such a
limitation since it may affect Chinese users, etc. This more lenient
validation is meant more at catching bugs and identifiers that may cause
downstream software to crash.

Change-Id: Ic3b05e2882c9c2ce9b47c16450d9d54a04d3e38b
---
M src/kudu/client/client-test.cc
M src/kudu/master/catalog_manager.cc
2 files changed, 43 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/5296/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5296
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic3b05e2882c9c2ce9b47c16450d9d54a04d3e38b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes
......................................................................


Patch Set 1: Code-Review+2

(1 comment)

I'm fond of the policy followed by the Linux kernel with respect to directory entries:
1. The kernel uses '\0' as string termination markers, so userspace must do the same (and can't use them mid-string).
2. The kernel will look for '/' (encoded as per US-ASCII) to split paths, so they take on special meaning in userspace.
3. The rest is fair game.

Forcing everyone to use UTF-8 is pragmatic, but it's also restrictive for users who want to use their own random encoding (granted, the Linux kernel policy is incompatible with many encodings). What do you think of adopting a similar policy for Kudu?

Separately, this change is backwards incompatible and should be documented, and perhaps even gated with a flag (not sure if opt-in or opt-out though).

http://gerrit.cloudera.org:8080/#/c/5296/1//COMMIT_MSG
Commit Message:

PS1, Line 12: limiting to only
            : ASCII
I'm pretty strongly opposed to this.


-- 
To view, visit http://gerrit.cloudera.org:8080/5296
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3b05e2882c9c2ce9b47c16450d9d54a04d3e38b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes
......................................................................


KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes

Ensures that table names and identifiers are valid UTF8 with no embedded
null bytes which could cause problems.

KUDU-981 proposes more stringent restrictions such as limiting to only
ASCII. We should collect more info from users before making such a
limitation since it may affect Chinese users, etc. This more lenient
validation is meant more at catching bugs and identifiers that may cause
downstream software to crash.

Change-Id: Ic3b05e2882c9c2ce9b47c16450d9d54a04d3e38b
Reviewed-on: http://gerrit.cloudera.org:8080/5296
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/client/client-test.cc
M src/kudu/master/catalog_manager.cc
2 files changed, 43 insertions(+), 15 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5296
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ic3b05e2882c9c2ce9b47c16450d9d54a04d3e38b
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes
......................................................................


Patch Set 1:

> Forcing everyone to use UTF-8 is pragmatic, but it's also restrictive for users who want to use their own random encoding (granted, the Linux kernel policy is incompatible with many encodings). What do you think of adopting a similar policy for Kudu?

I think the days of non-UTF8 international encodings are starting to wane, no?

More importantly, we currently use the 'string' type in our .protos for such identifiers, and protobuf prescribes that strings should be utf8. That has specific impact on the Java APIs which expose these things as 'String' and hard-code UTF8. So, I think in practice we are already enforcing UTF8 for any apps that use the Java client, and this is just adding server-side validation that prevents C++ users from creating tables that can't be read from Java.

-- 
To view, visit http://gerrit.cloudera.org:8080/5296
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3b05e2882c9c2ce9b47c16450d9d54a04d3e38b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No

[kudu-CR] KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change.

Change subject: KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes
......................................................................


Patch Set 1: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5296/1//COMMIT_MSG
Commit Message:

PS1, Line 12: limiting to only
            : ASCII
> I'm pretty strongly opposed to this.
Me too. Though Impala does restrict identifiers to ASCII-only. See http://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_identifiers.html


-- 
To view, visit http://gerrit.cloudera.org:8080/5296
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3b05e2882c9c2ce9b47c16450d9d54a04d3e38b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-981 (part 1): validate identifiers as UTF8 with no null bytes
......................................................................


Patch Set 2: Code-Review+2

Carrying +2s

-- 
To view, visit http://gerrit.cloudera.org:8080/5296
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3b05e2882c9c2ce9b47c16450d9d54a04d3e38b
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No