You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Dan Burkert (Code Review)" <ge...@cloudera.org> on 2016/09/20 23:21:48 UTC

[kudu-CR] refactor schema-design guide

Hello Adar Dembo, Will Berkeley, Todd Lipcon, John Russell,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/4485

to review the following change.

Change subject: refactor schema-design guide
......................................................................

refactor schema-design guide

Introduce some long-overdue changes to the schema design guide. In
particular, more content on adding and dropping range partitions, as
well as a better introduction to how to analyze partitioning strategies.

Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
---
A docs/images/hash-hash-partitioning-example.png
A docs/images/hash-partitioning-example.png
A docs/images/hash-range-partitioning-example.png
A docs/images/range-partitioning-example.png
A docs/media-src/schema_design.graffle
M docs/schema_design.adoc
6 files changed, 3,649 insertions(+), 228 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/85/4485/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] refactor schema-design guide

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has posted comments on this change.

Change subject: refactor schema-design guide
......................................................................


Patch Set 1:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/4485/1/docs/schema_design.adoc
File docs/schema_design.adoc:

PS1, Line 40:  traditional relational databases
even partitioning isn't unfamiliar if you've used MPP relational databases. Perhaps say something like "familiar with single-node relational databases"?


PS1, Line 123:  If the column values of
             : a given row set are unable to be compressed because the number of unique values
             : is too high, Kudu will transparently fall back to plain encoding for that row
             : set. This is evaluated during flush.
I'm not super keen on how internals-focused this bit of doc is, since we don't really expect users to understand rowsets and flushes, etc... but feel free to leave, since you're just reformatting this and didn't write it.


PS1, Line 144: not
             : typically beneficial
maybe we should be stronger here: it is redundant and actually hurts perf


PS1, Line 155: rows
typo: row


PS1, Line 155: failure
maybe "error" instead of "failure"


PS1, Line 161: ,
should be a ';' before 'however'


PS1, Line 167:  RDBMs
hrm, not sure how to pluralize "RDBMS" but I dont think this is it. maybe just expand "relational databases"


PS1, Line 167: most 
s/most/many/ (eg innodb is PK-clustered but MyISAM is not, other databases require you to explicitly say CLUSTERED when you create the index)


PS1, Line 168: rows within a tablet are kept in primary key sorted order
this isn't quite accurate, but maybe the inaccuracy is OK because it gets the point across? any way to improve accuracy without increasing complexity?


PS1, Line 180: traditional relational tables
I think MPP is "traditional"


PS1, Line 188: many short scans,
maybe worth clarifying this? essentially the crossover point is where the fixed costs of contacting a server are of a similar order of magnitude to the actual amount of work done, such that adding more servers consumes more resources but doesn't actually improve (or harms) performance.


PS1, Line 189: the
each scan


PS1, Line 197: strongly recommended that new tables have at least as many tablets as tablet
             : servers.
worth calling out as an exception the case of small tables? eg TPCH "nation" which has 5 rows probably doesn't need 100 buckets on a 100-node cluster. That's an extreme example, but in general I think the blanket "strong recommendation" here is a bit much.


PS1, Line 294:  partitioning a
dangling phrase


-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes

[kudu-CR] Refactor schema-design guide

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.

Dan Burkert has posted comments on this change.

Change subject: Refactor schema-design guide
......................................................................


Patch Set 1:

(27 comments)

http://gerrit.cloudera.org:8080/#/c/4485/1//COMMIT_MSG
Commit Message:

PS1, Line 7: r
> Nit capitalize
Done


http://gerrit.cloudera.org:8080/#/c/4485/1/docs/schema_design.adoc
File docs/schema_design.adoc:

PS1, Line 40:  traditional relational databases
> even partitioning isn't unfamiliar if you've used MPP relational databases.
I've added 'non-distributed'


PS1, Line 40: next
> doesn't the next section discuss the perfect schema and the concerns you me
Done


PS1, Line 49: the
> nit: remove "the"?
Done


Line 51:   remain steady over time. This is most impacted by partitioning.
> nit: say "table", like above, for consistency, or remove table above, same 
Done


PS1, Line 64: optionally
> nit: consider dropping 'optionally'
Done


PS1, Line 123:  If the column values of
             : a given row set are unable to be compressed because the number of unique values
             : is too high, Kudu will transparently fall back to plain encoding for that row
             : set. This is evaluated during flush.
> I'm not super keen on how internals-focused this bit of doc is, since we do
I think this is an important bit of info, since it effectively lets people know that the downside of dictionary encoding is not very big.


PS1, Line 144: not
             : typically beneficial
> maybe we should be stronger here: it is redundant and actually hurts perf
I've changed to "not recommended".


PS1, Line 161: ,
> should be a ';' before 'however'
Done


PS1, Line 167:  RDBMs
> hrm, not sure how to pluralize "RDBMS" but I dont think this is it. maybe j
Done


PS1, Line 167: most 
> s/most/many/ (eg innodb is PK-clustered but MyISAM is not, other databases 
Done


PS1, Line 168: rows within a tablet are kept in primary key sorted order
> this isn't quite accurate, but maybe the inaccuracy is OK because it gets t
Are you referring to overlapping rowsets? I think it's fine to brush over that.


PS1, Line 180: traditional relational tables
> I think MPP is "traditional"
Done


PS1, Line 188: many short scans,
> maybe worth clarifying this? essentially the crossover point is where the f
I added the reasoning for preferring short scans be over a single tablet.  I didn't add reasoning to prefer parallelism for long scans, since this isn't a fundamental tradeoff that Kudu makes (we will eventually make parallelizing a single scan over the same tablet easy).


PS1, Line 189: the
> each scan
Done


PS1, Line 197: strongly recommended that new tables have at least as many tablets as tablet
             : servers.
> worth calling out as an exception the case of small tables? eg TPCH "nation
Done


PS1, Line 221: Attempting to insert a row which falls in a 'non-covered'
             : portion of the range key space will fail.
> For users familar with HBase and such this will seem weird without more con
It's not really that simple any more, the number of tablets is equal to the number of range partitions specified, plus the number of split points.  I tried to convey this in the first few sentences of this paragraph.


PS1, Line 255:  
> add randomly
Done


PS1, Line 256: write skew
> What is write skew here?  If it is as defined at https://en.wikipedia.org/w
It was meant as a synonym of hot spotting, I'll change it.


PS1, Line 256: which helps
> Does it make sense to add that it also allows to achieve higher write/read 
Done


PS1, Line 294:  partitioning a
> dangling phrase
Done


PS1, Line 302: STRING
> nit: if the syntax is targeted to be very close to SQL, consider using VARC
That's a good point.  However, since we introduced the types using the Kudu names (string, int32, etc.), I can see it being confusing having to define the mapping.


PS1, Line 305: DOUBLE
> nit: DOUBLE PRECISION or just use FLOAT instead
same as above


PS1, Line 306: ,
> nit: strayed comma; in some DB engines it might cause syntax error.
This isn't meant to be run.


PS1, Line 306: timestamp
> nit: the column name in columns definition and in primary key constraint do
Done


Line 310: [[range-partitioning-example]]
> any specific reason why the examples aren't next to their partitioning type
I think having the reference together and the examples together will make it easier in the long run, although maybe it hurts readability for the first time through.  I don't feel too strongly either way.


PS1, Line 442: Known Limitations
> Is it worth adding that Kudu does not support altering column types?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes

[kudu-CR] Refactor schema-design guide

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Alexey Serbin has posted comments on this change.

Change subject: Refactor schema-design guide
......................................................................


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No

[kudu-CR] refactor schema-design guide

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Alexey Serbin has posted comments on this change.

Change subject: refactor schema-design guide
......................................................................


Patch Set 1:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/4485/1/docs/schema_design.adoc
File docs/schema_design.adoc:

PS1, Line 64: optionally
nit: consider dropping 'optionally'


PS1, Line 256: which helps
Does it make sense to add that it also allows to achieve higher write/read throughput and avoid hotspotting?


PS1, Line 256: write skew
What is write skew here?  If it is as defined at https://en.wikipedia.org/wiki/Snapshot_isolation, then I don't quite understand how hash partitioning helps.


PS1, Line 302: STRING
nit: if the syntax is targeted to be very close to SQL, consider using VARCHAR instead of STRING.


PS1, Line 305: DOUBLE
nit: DOUBLE PRECISION or just use FLOAT instead


PS1, Line 306: ,
nit: strayed comma; in some DB engines it might cause syntax error.


PS1, Line 306: timestamp
nit: the column name in columns definition and in primary key constraint do not match; consider bringing them in sync.


PS1, Line 442: Known Limitations
Is it worth adding that Kudu does not support altering column types?


-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes

[kudu-CR] Refactor schema-design guide

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4485

to look at the new patch set (#2).

Change subject: Refactor schema-design guide
......................................................................

Refactor schema-design guide

Introduce some long-overdue changes to the schema design guide. In
particular, more content on adding and dropping range partitions, as
well as a better introduction to how to analyze partitioning strategies.

Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
---
A docs/images/hash-hash-partitioning-example.png
A docs/images/hash-partitioning-example.png
A docs/images/hash-range-partitioning-example.png
A docs/images/range-partitioning-example.png
A docs/media-src/schema_design.graffle
M docs/schema_design.adoc
6 files changed, 3,656 insertions(+), 231 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/85/4485/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] refactor schema-design guide

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.

David Ribeiro Alves has posted comments on this change.

Change subject: refactor schema-design guide
......................................................................


Patch Set 1:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/4485/1//COMMIT_MSG
Commit Message:

PS1, Line 7: r
Nit capitalize


http://gerrit.cloudera.org:8080/#/c/4485/1/docs/schema_design.adoc
File docs/schema_design.adoc:

PS1, Line 40: next
doesn't the next section discuss the perfect schema and the concerns you mention before?


PS1, Line 49: the
nit: remove "the"?


Line 51:   remain steady over time. This is most impacted by partitioning.
nit: say "table", like above, for consistency, or remove table above, same below


PS1, Line 155: failure
s/failure/error


PS1, Line 155: rows 
s/rows/row


PS1, Line 221: Attempting to insert a row which falls in a 'non-covered'
             : portion of the range key space will fail.
For users familar with HBase and such this will seem weird without more context, i.e. How can a range be non-covered? Please elaborate a bit on that, jut point to the next section or remove this altogether and introduce it only in the next section.

Maybe also mention that the user would get NumSplitPoints + 1 tablets.


PS1, Line 255:  
add randomly


PS1, Line 255: is an effective tool
xxs nit: s/is an effective tool/is effective the former sounds repetitive with the previous sentence


Line 310: [[range-partitioning-example]]
any specific reason why the examples aren't next to their partitioning type? could they be? the explanation for multi level, for instance is abstract enough that my guess is most people will need the example to understand it


-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes

[kudu-CR] Refactor schema-design guide

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.

Dan Burkert has submitted this change and it was merged.

Change subject: Refactor schema-design guide
......................................................................


Refactor schema-design guide

Introduce some long-overdue changes to the schema design guide. In
particular, more content on adding and dropping range partitions, as
well as a better introduction to how to analyze partitioning strategies.

Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Reviewed-on: http://gerrit.cloudera.org:8080/4485
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
A docs/images/hash-hash-partitioning-example.png
A docs/images/hash-partitioning-example.png
A docs/images/hash-range-partitioning-example.png
A docs/images/range-partitioning-example.png
A docs/media-src/schema_design.graffle
M docs/schema_design.adoc
6 files changed, 3,656 insertions(+), 231 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] refactor schema-design guide

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.

Dan Burkert has posted comments on this change.

Change subject: refactor schema-design guide
......................................................................


Patch Set 1:

rendered: https://github.com/danburkert/kudu/blob/schema-design/docs/schema_design.adoc

-- 
To view, visit http://gerrit.cloudera.org:8080/4485
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If4d91043d72e816a930d50927bf9441bd6b30b9b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: John Russell <jr...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No