You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kudu.apache.org by Dan Burkert <da...@apache.org> on 2016/05/19 23:03:33 UTC

Proposal: remove default partitioning for new tables

Hi all,

One of the issues that trips up new Kudu users is the uncertainty about how
partitioning works, and how to use partitioning effectively.  Much of this
can be addressed with better documentation and explanatory materials, and
that should be an area of focus leading up to our 1.0 release. However, the
default partitioning behavior is suboptimal, and changing the default could
lead to significantly less user confusion and frustration. Currently, when
creating a new table, Kudu defaults to using only a single tablet, which is
a known anti-pattern.  This can be painful for users who create a table
assuming Kudu will have good defaults, and begin loading data only to find
out later that they will need to recreate the table with partitioning to
achieve good results.

A better default partitioning strategy might be hash partitioning over the
primary key columns, with a number of hash buckets based on the number of
tablet servers (perhaps something like 3x the number of tablet servers).
This would alleviate the worst scalability issues with the current default,
however it has a few downsides of its own. Hash partitioning is not
appropriate for every use case, and any rule-of-thumb number of tablets we
could come up with will not always be optimal.

Given that there is no bullet-proof default, and that changing partitioning
strategy after table creation is impossible, and changing the default
partitioning strategy is a backwards incompatible change, I propose we
remove the default altogether.  Users would be required to explicitly
specify the table partitioning during creation, and failing to do so would
result in an illegal argument error.  Users who really do want only a
single tablet will still be able to do so by explicitly configuring range
partitioning with no split rows.

I'd like to get community feedback on whether this seems like a good
direction to take.  I have put together a patch, you can check out the
changes to test files to see what it looks like to add partitioning
explicitly in cases where the default was being relied on.
http://gerrit.cloudera.org:8080/#/c/3131/

- Dan

Re: Proposal: remove default partitioning for new tables

Posted by Dan Burkert <da...@cloudera.com>.
This change was intentional, since I wanted to thoroughly remove the
defaults in one go.  We can always add back implicit range columns when
range splits are specified later if we decide it's too much of a burden.  I
personally like the symmetry with hash partitioning, where setting the
columns explicitly is required.  It also makes it clear that you _do not_
get range partitioning if you do not set the range partitioning columns.  I
can see it  being somewhat confusing if we implicitly set range
partitioning columns when there are range splits, but not when there are
hash partitions.

 - Dan

On Thu, Jun 2, 2016 at 8:40 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Dan,
>
> One quick thing I just stumbled upon... it seems like the old behavior was
> that you could do the following:
>
> CreateTableOptions builder = new CreateTableOptions();
> builder.addSplitRow(...);
> builder.addSplitRow(...);
> ...
> client.createTable("foo", schema, builder);
>
> and it would assume that this was range partitioning based on the whole
> primary key. The user in this case _is_ specifying split rows, so I figured
> this counted as an explicit partitioning choice and thus wouldn't be
> affected by the change mentioned above. Instead, I'm getting an error that
> no range partition columns were specified.
>
> Was this on purpose? Of course I can call setRangePartitionColumns to work
> around it, but didn't know if it was intentional.
>
> -Todd
>
> On Thu, May 26, 2016 at 11:53 PM, Dan Burkert <da...@cloudera.com> wrote:
>
> > Hi all,
> >
> > Thanks for the feedback!  We've made this change, and it will be part of
> > the upcoming 0.9 release.  Going forward, all create table calls must
> have
> > partitioning specified.  Existing tables will not be affected.
> >
> > - Dan
> >
> > On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
> > jordan.birdsell.kdvm@statefarm.com> wrote:
> >
> > > +1 ...this is a great recommendation
> > >
> > > -----Original Message-----
> > > From: Sand Stone [mailto:sand.m.stone@gmail.com]
> > > Sent: Thursday, May 19, 2016 10:39 PM
> > > To: user@kudu.incubator.apache.org
> > > Cc: dev@kudu.incubator.apache.org
> > > Subject: Re: Proposal: remove default partitioning for new tables
> > >
> > > Agreed that this is a sensible API change.
> > >
> > > On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
> > >
> > > > I think this a very reasonable feature request. I have recently
> started
> > > > working with Kudu and the "default" behavior has already tripped me
> up
> > a
> > > > couple times.
> > > >
> > > > Thanks,
> > > >
> > > > Abhi
> > > >
> > > > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> One of the issues that trips up new Kudu users is the uncertainty
> > about
> > > >> how partitioning works, and how to use partitioning effectively.
> Much
> > > of
> > > >> this can be addressed with better documentation and explanatory
> > > materials,
> > > >> and that should be an area of focus leading up to our 1.0 release.
> > > However,
> > > >> the default partitioning behavior is suboptimal, and changing the
> > > default
> > > >> could lead to significantly less user confusion and frustration.
> > > Currently,
> > > >> when creating a new table, Kudu defaults to using only a single
> > tablet,
> > > >> which is a known anti-pattern.  This can be painful for users who
> > > create a
> > > >> table assuming Kudu will have good defaults, and begin loading data
> > > only to
> > > >> find out later that they will need to recreate the table with
> > > partitioning
> > > >> to achieve good results.
> > > >>
> > > >> A better default partitioning strategy might be hash partitioning
> over
> > > >> the primary key columns, with a number of hash buckets based on the
> > > number
> > > >> of tablet servers (perhaps something like 3x the number of tablet
> > > >> servers).  This would alleviate the worst scalability issues with
> the
> > > >> current default, however it has a few downsides of its own. Hash
> > > >> partitioning is not appropriate for every use case, and any
> > > rule-of-thumb
> > > >> number of tablets we could come up with will not always be optimal.
> > > >>
> > > >> Given that there is no bullet-proof default, and that changing
> > > >> partitioning strategy after table creation is impossible, and
> changing
> > > the
> > > >> default partitioning strategy is a backwards incompatible change, I
> > > propose
> > > >> we remove the default altogether.  Users would be required to
> > explicitly
> > > >> specify the table partitioning during creation, and failing to do so
> > > would
> > > >> result in an illegal argument error.  Users who really do want only
> a
> > > >> single tablet will still be able to do so by explicitly configuring
> > > range
> > > >> partitioning with no split rows.
> > > >>
> > > >> I'd like to get community feedback on whether this seems like a good
> > > >> direction to take.  I have put together a patch, you can check out
> the
> > > >> changes to test files to see what it looks like to add partitioning
> > > >> explicitly in cases where the default was being relied on.
> > > >> http://gerrit.cloudera.org:8080/#/c/3131/
> > > >>
> > > >> - Dan
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Abhi Basu
> > > >
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Proposal: remove default partitioning for new tables

Posted by Dan Burkert <da...@cloudera.com>.
This change was intentional, since I wanted to thoroughly remove the
defaults in one go.  We can always add back implicit range columns when
range splits are specified later if we decide it's too much of a burden.  I
personally like the symmetry with hash partitioning, where setting the
columns explicitly is required.  It also makes it clear that you _do not_
get range partitioning if you do not set the range partitioning columns.  I
can see it  being somewhat confusing if we implicitly set range
partitioning columns when there are range splits, but not when there are
hash partitions.

 - Dan

On Thu, Jun 2, 2016 at 8:40 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Dan,
>
> One quick thing I just stumbled upon... it seems like the old behavior was
> that you could do the following:
>
> CreateTableOptions builder = new CreateTableOptions();
> builder.addSplitRow(...);
> builder.addSplitRow(...);
> ...
> client.createTable("foo", schema, builder);
>
> and it would assume that this was range partitioning based on the whole
> primary key. The user in this case _is_ specifying split rows, so I figured
> this counted as an explicit partitioning choice and thus wouldn't be
> affected by the change mentioned above. Instead, I'm getting an error that
> no range partition columns were specified.
>
> Was this on purpose? Of course I can call setRangePartitionColumns to work
> around it, but didn't know if it was intentional.
>
> -Todd
>
> On Thu, May 26, 2016 at 11:53 PM, Dan Burkert <da...@cloudera.com> wrote:
>
> > Hi all,
> >
> > Thanks for the feedback!  We've made this change, and it will be part of
> > the upcoming 0.9 release.  Going forward, all create table calls must
> have
> > partitioning specified.  Existing tables will not be affected.
> >
> > - Dan
> >
> > On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
> > jordan.birdsell.kdvm@statefarm.com> wrote:
> >
> > > +1 ...this is a great recommendation
> > >
> > > -----Original Message-----
> > > From: Sand Stone [mailto:sand.m.stone@gmail.com]
> > > Sent: Thursday, May 19, 2016 10:39 PM
> > > To: user@kudu.incubator.apache.org
> > > Cc: dev@kudu.incubator.apache.org
> > > Subject: Re: Proposal: remove default partitioning for new tables
> > >
> > > Agreed that this is a sensible API change.
> > >
> > > On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
> > >
> > > > I think this a very reasonable feature request. I have recently
> started
> > > > working with Kudu and the "default" behavior has already tripped me
> up
> > a
> > > > couple times.
> > > >
> > > > Thanks,
> > > >
> > > > Abhi
> > > >
> > > > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> One of the issues that trips up new Kudu users is the uncertainty
> > about
> > > >> how partitioning works, and how to use partitioning effectively.
> Much
> > > of
> > > >> this can be addressed with better documentation and explanatory
> > > materials,
> > > >> and that should be an area of focus leading up to our 1.0 release.
> > > However,
> > > >> the default partitioning behavior is suboptimal, and changing the
> > > default
> > > >> could lead to significantly less user confusion and frustration.
> > > Currently,
> > > >> when creating a new table, Kudu defaults to using only a single
> > tablet,
> > > >> which is a known anti-pattern.  This can be painful for users who
> > > create a
> > > >> table assuming Kudu will have good defaults, and begin loading data
> > > only to
> > > >> find out later that they will need to recreate the table with
> > > partitioning
> > > >> to achieve good results.
> > > >>
> > > >> A better default partitioning strategy might be hash partitioning
> over
> > > >> the primary key columns, with a number of hash buckets based on the
> > > number
> > > >> of tablet servers (perhaps something like 3x the number of tablet
> > > >> servers).  This would alleviate the worst scalability issues with
> the
> > > >> current default, however it has a few downsides of its own. Hash
> > > >> partitioning is not appropriate for every use case, and any
> > > rule-of-thumb
> > > >> number of tablets we could come up with will not always be optimal.
> > > >>
> > > >> Given that there is no bullet-proof default, and that changing
> > > >> partitioning strategy after table creation is impossible, and
> changing
> > > the
> > > >> default partitioning strategy is a backwards incompatible change, I
> > > propose
> > > >> we remove the default altogether.  Users would be required to
> > explicitly
> > > >> specify the table partitioning during creation, and failing to do so
> > > would
> > > >> result in an illegal argument error.  Users who really do want only
> a
> > > >> single tablet will still be able to do so by explicitly configuring
> > > range
> > > >> partitioning with no split rows.
> > > >>
> > > >> I'd like to get community feedback on whether this seems like a good
> > > >> direction to take.  I have put together a patch, you can check out
> the
> > > >> changes to test files to see what it looks like to add partitioning
> > > >> explicitly in cases where the default was being relied on.
> > > >> http://gerrit.cloudera.org:8080/#/c/3131/
> > > >>
> > > >> - Dan
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Abhi Basu
> > > >
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Proposal: remove default partitioning for new tables

Posted by Todd Lipcon <to...@cloudera.com>.
Hey Dan,

One quick thing I just stumbled upon... it seems like the old behavior was
that you could do the following:

CreateTableOptions builder = new CreateTableOptions();
builder.addSplitRow(...);
builder.addSplitRow(...);
...
client.createTable("foo", schema, builder);

and it would assume that this was range partitioning based on the whole
primary key. The user in this case _is_ specifying split rows, so I figured
this counted as an explicit partitioning choice and thus wouldn't be
affected by the change mentioned above. Instead, I'm getting an error that
no range partition columns were specified.

Was this on purpose? Of course I can call setRangePartitionColumns to work
around it, but didn't know if it was intentional.

-Todd

On Thu, May 26, 2016 at 11:53 PM, Dan Burkert <da...@cloudera.com> wrote:

> Hi all,
>
> Thanks for the feedback!  We've made this change, and it will be part of
> the upcoming 0.9 release.  Going forward, all create table calls must have
> partitioning specified.  Existing tables will not be affected.
>
> - Dan
>
> On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> > +1 ...this is a great recommendation
> >
> > -----Original Message-----
> > From: Sand Stone [mailto:sand.m.stone@gmail.com]
> > Sent: Thursday, May 19, 2016 10:39 PM
> > To: user@kudu.incubator.apache.org
> > Cc: dev@kudu.incubator.apache.org
> > Subject: Re: Proposal: remove default partitioning for new tables
> >
> > Agreed that this is a sensible API change.
> >
> > On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
> >
> > > I think this a very reasonable feature request. I have recently started
> > > working with Kudu and the "default" behavior has already tripped me up
> a
> > > couple times.
> > >
> > > Thanks,
> > >
> > > Abhi
> > >
> > > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> One of the issues that trips up new Kudu users is the uncertainty
> about
> > >> how partitioning works, and how to use partitioning effectively.  Much
> > of
> > >> this can be addressed with better documentation and explanatory
> > materials,
> > >> and that should be an area of focus leading up to our 1.0 release.
> > However,
> > >> the default partitioning behavior is suboptimal, and changing the
> > default
> > >> could lead to significantly less user confusion and frustration.
> > Currently,
> > >> when creating a new table, Kudu defaults to using only a single
> tablet,
> > >> which is a known anti-pattern.  This can be painful for users who
> > create a
> > >> table assuming Kudu will have good defaults, and begin loading data
> > only to
> > >> find out later that they will need to recreate the table with
> > partitioning
> > >> to achieve good results.
> > >>
> > >> A better default partitioning strategy might be hash partitioning over
> > >> the primary key columns, with a number of hash buckets based on the
> > number
> > >> of tablet servers (perhaps something like 3x the number of tablet
> > >> servers).  This would alleviate the worst scalability issues with the
> > >> current default, however it has a few downsides of its own. Hash
> > >> partitioning is not appropriate for every use case, and any
> > rule-of-thumb
> > >> number of tablets we could come up with will not always be optimal.
> > >>
> > >> Given that there is no bullet-proof default, and that changing
> > >> partitioning strategy after table creation is impossible, and changing
> > the
> > >> default partitioning strategy is a backwards incompatible change, I
> > propose
> > >> we remove the default altogether.  Users would be required to
> explicitly
> > >> specify the table partitioning during creation, and failing to do so
> > would
> > >> result in an illegal argument error.  Users who really do want only a
> > >> single tablet will still be able to do so by explicitly configuring
> > range
> > >> partitioning with no split rows.
> > >>
> > >> I'd like to get community feedback on whether this seems like a good
> > >> direction to take.  I have put together a patch, you can check out the
> > >> changes to test files to see what it looks like to add partitioning
> > >> explicitly in cases where the default was being relied on.
> > >> http://gerrit.cloudera.org:8080/#/c/3131/
> > >>
> > >> - Dan
> > >>
> > >
> > >
> > >
> > > --
> > > Abhi Basu
> > >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Proposal: remove default partitioning for new tables

Posted by Todd Lipcon <to...@cloudera.com>.
Hey Dan,

One quick thing I just stumbled upon... it seems like the old behavior was
that you could do the following:

CreateTableOptions builder = new CreateTableOptions();
builder.addSplitRow(...);
builder.addSplitRow(...);
...
client.createTable("foo", schema, builder);

and it would assume that this was range partitioning based on the whole
primary key. The user in this case _is_ specifying split rows, so I figured
this counted as an explicit partitioning choice and thus wouldn't be
affected by the change mentioned above. Instead, I'm getting an error that
no range partition columns were specified.

Was this on purpose? Of course I can call setRangePartitionColumns to work
around it, but didn't know if it was intentional.

-Todd

On Thu, May 26, 2016 at 11:53 PM, Dan Burkert <da...@cloudera.com> wrote:

> Hi all,
>
> Thanks for the feedback!  We've made this change, and it will be part of
> the upcoming 0.9 release.  Going forward, all create table calls must have
> partitioning specified.  Existing tables will not be affected.
>
> - Dan
>
> On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> > +1 ...this is a great recommendation
> >
> > -----Original Message-----
> > From: Sand Stone [mailto:sand.m.stone@gmail.com]
> > Sent: Thursday, May 19, 2016 10:39 PM
> > To: user@kudu.incubator.apache.org
> > Cc: dev@kudu.incubator.apache.org
> > Subject: Re: Proposal: remove default partitioning for new tables
> >
> > Agreed that this is a sensible API change.
> >
> > On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
> >
> > > I think this a very reasonable feature request. I have recently started
> > > working with Kudu and the "default" behavior has already tripped me up
> a
> > > couple times.
> > >
> > > Thanks,
> > >
> > > Abhi
> > >
> > > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> One of the issues that trips up new Kudu users is the uncertainty
> about
> > >> how partitioning works, and how to use partitioning effectively.  Much
> > of
> > >> this can be addressed with better documentation and explanatory
> > materials,
> > >> and that should be an area of focus leading up to our 1.0 release.
> > However,
> > >> the default partitioning behavior is suboptimal, and changing the
> > default
> > >> could lead to significantly less user confusion and frustration.
> > Currently,
> > >> when creating a new table, Kudu defaults to using only a single
> tablet,
> > >> which is a known anti-pattern.  This can be painful for users who
> > create a
> > >> table assuming Kudu will have good defaults, and begin loading data
> > only to
> > >> find out later that they will need to recreate the table with
> > partitioning
> > >> to achieve good results.
> > >>
> > >> A better default partitioning strategy might be hash partitioning over
> > >> the primary key columns, with a number of hash buckets based on the
> > number
> > >> of tablet servers (perhaps something like 3x the number of tablet
> > >> servers).  This would alleviate the worst scalability issues with the
> > >> current default, however it has a few downsides of its own. Hash
> > >> partitioning is not appropriate for every use case, and any
> > rule-of-thumb
> > >> number of tablets we could come up with will not always be optimal.
> > >>
> > >> Given that there is no bullet-proof default, and that changing
> > >> partitioning strategy after table creation is impossible, and changing
> > the
> > >> default partitioning strategy is a backwards incompatible change, I
> > propose
> > >> we remove the default altogether.  Users would be required to
> explicitly
> > >> specify the table partitioning during creation, and failing to do so
> > would
> > >> result in an illegal argument error.  Users who really do want only a
> > >> single tablet will still be able to do so by explicitly configuring
> > range
> > >> partitioning with no split rows.
> > >>
> > >> I'd like to get community feedback on whether this seems like a good
> > >> direction to take.  I have put together a patch, you can check out the
> > >> changes to test files to see what it looks like to add partitioning
> > >> explicitly in cases where the default was being relied on.
> > >> http://gerrit.cloudera.org:8080/#/c/3131/
> > >>
> > >> - Dan
> > >>
> > >
> > >
> > >
> > > --
> > > Abhi Basu
> > >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Proposal: remove default partitioning for new tables

Posted by Dan Burkert <da...@cloudera.com>.
Hi all,

Thanks for the feedback!  We've made this change, and it will be part of
the upcoming 0.9 release.  Going forward, all create table calls must have
partitioning specified.  Existing tables will not be affected.

- Dan

On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
jordan.birdsell.kdvm@statefarm.com> wrote:

> +1 ...this is a great recommendation
>
> -----Original Message-----
> From: Sand Stone [mailto:sand.m.stone@gmail.com]
> Sent: Thursday, May 19, 2016 10:39 PM
> To: user@kudu.incubator.apache.org
> Cc: dev@kudu.incubator.apache.org
> Subject: Re: Proposal: remove default partitioning for new tables
>
> Agreed that this is a sensible API change.
>
> On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
>
> > I think this a very reasonable feature request. I have recently started
> > working with Kudu and the "default" behavior has already tripped me up a
> > couple times.
> >
> > Thanks,
> >
> > Abhi
> >
> > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > wrote:
> >
> >> Hi all,
> >>
> >> One of the issues that trips up new Kudu users is the uncertainty about
> >> how partitioning works, and how to use partitioning effectively.  Much
> of
> >> this can be addressed with better documentation and explanatory
> materials,
> >> and that should be an area of focus leading up to our 1.0 release.
> However,
> >> the default partitioning behavior is suboptimal, and changing the
> default
> >> could lead to significantly less user confusion and frustration.
> Currently,
> >> when creating a new table, Kudu defaults to using only a single tablet,
> >> which is a known anti-pattern.  This can be painful for users who
> create a
> >> table assuming Kudu will have good defaults, and begin loading data
> only to
> >> find out later that they will need to recreate the table with
> partitioning
> >> to achieve good results.
> >>
> >> A better default partitioning strategy might be hash partitioning over
> >> the primary key columns, with a number of hash buckets based on the
> number
> >> of tablet servers (perhaps something like 3x the number of tablet
> >> servers).  This would alleviate the worst scalability issues with the
> >> current default, however it has a few downsides of its own. Hash
> >> partitioning is not appropriate for every use case, and any
> rule-of-thumb
> >> number of tablets we could come up with will not always be optimal.
> >>
> >> Given that there is no bullet-proof default, and that changing
> >> partitioning strategy after table creation is impossible, and changing
> the
> >> default partitioning strategy is a backwards incompatible change, I
> propose
> >> we remove the default altogether.  Users would be required to explicitly
> >> specify the table partitioning during creation, and failing to do so
> would
> >> result in an illegal argument error.  Users who really do want only a
> >> single tablet will still be able to do so by explicitly configuring
> range
> >> partitioning with no split rows.
> >>
> >> I'd like to get community feedback on whether this seems like a good
> >> direction to take.  I have put together a patch, you can check out the
> >> changes to test files to see what it looks like to add partitioning
> >> explicitly in cases where the default was being relied on.
> >> http://gerrit.cloudera.org:8080/#/c/3131/
> >>
> >> - Dan
> >>
> >
> >
> >
> > --
> > Abhi Basu
> >
>

Re: Proposal: remove default partitioning for new tables

Posted by Dan Burkert <da...@cloudera.com>.
Hi all,

Thanks for the feedback!  We've made this change, and it will be part of
the upcoming 0.9 release.  Going forward, all create table calls must have
partitioning specified.  Existing tables will not be affected.

- Dan

On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
jordan.birdsell.kdvm@statefarm.com> wrote:

> +1 ...this is a great recommendation
>
> -----Original Message-----
> From: Sand Stone [mailto:sand.m.stone@gmail.com]
> Sent: Thursday, May 19, 2016 10:39 PM
> To: user@kudu.incubator.apache.org
> Cc: dev@kudu.incubator.apache.org
> Subject: Re: Proposal: remove default partitioning for new tables
>
> Agreed that this is a sensible API change.
>
> On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
>
> > I think this a very reasonable feature request. I have recently started
> > working with Kudu and the "default" behavior has already tripped me up a
> > couple times.
> >
> > Thanks,
> >
> > Abhi
> >
> > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > wrote:
> >
> >> Hi all,
> >>
> >> One of the issues that trips up new Kudu users is the uncertainty about
> >> how partitioning works, and how to use partitioning effectively.  Much
> of
> >> this can be addressed with better documentation and explanatory
> materials,
> >> and that should be an area of focus leading up to our 1.0 release.
> However,
> >> the default partitioning behavior is suboptimal, and changing the
> default
> >> could lead to significantly less user confusion and frustration.
> Currently,
> >> when creating a new table, Kudu defaults to using only a single tablet,
> >> which is a known anti-pattern.  This can be painful for users who
> create a
> >> table assuming Kudu will have good defaults, and begin loading data
> only to
> >> find out later that they will need to recreate the table with
> partitioning
> >> to achieve good results.
> >>
> >> A better default partitioning strategy might be hash partitioning over
> >> the primary key columns, with a number of hash buckets based on the
> number
> >> of tablet servers (perhaps something like 3x the number of tablet
> >> servers).  This would alleviate the worst scalability issues with the
> >> current default, however it has a few downsides of its own. Hash
> >> partitioning is not appropriate for every use case, and any
> rule-of-thumb
> >> number of tablets we could come up with will not always be optimal.
> >>
> >> Given that there is no bullet-proof default, and that changing
> >> partitioning strategy after table creation is impossible, and changing
> the
> >> default partitioning strategy is a backwards incompatible change, I
> propose
> >> we remove the default altogether.  Users would be required to explicitly
> >> specify the table partitioning during creation, and failing to do so
> would
> >> result in an illegal argument error.  Users who really do want only a
> >> single tablet will still be able to do so by explicitly configuring
> range
> >> partitioning with no split rows.
> >>
> >> I'd like to get community feedback on whether this seems like a good
> >> direction to take.  I have put together a patch, you can check out the
> >> changes to test files to see what it looks like to add partitioning
> >> explicitly in cases where the default was being relied on.
> >> http://gerrit.cloudera.org:8080/#/c/3131/
> >>
> >> - Dan
> >>
> >
> >
> >
> > --
> > Abhi Basu
> >
>

RE: Proposal: remove default partitioning for new tables

Posted by Jordan Birdsell <jo...@statefarm.com>.
+1 ...this is a great recommendation

-----Original Message-----
From: Sand Stone [mailto:sand.m.stone@gmail.com] 
Sent: Thursday, May 19, 2016 10:39 PM
To: user@kudu.incubator.apache.org
Cc: dev@kudu.incubator.apache.org
Subject: Re: Proposal: remove default partitioning for new tables

Agreed that this is a sensible API change.

On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:

> I think this a very reasonable feature request. I have recently started
> working with Kudu and the "default" behavior has already tripped me up a
> couple times.
>
> Thanks,
>
> Abhi
>
> On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> wrote:
>
>> Hi all,
>>
>> One of the issues that trips up new Kudu users is the uncertainty about
>> how partitioning works, and how to use partitioning effectively.  Much of
>> this can be addressed with better documentation and explanatory materials,
>> and that should be an area of focus leading up to our 1.0 release. However,
>> the default partitioning behavior is suboptimal, and changing the default
>> could lead to significantly less user confusion and frustration. Currently,
>> when creating a new table, Kudu defaults to using only a single tablet,
>> which is a known anti-pattern.  This can be painful for users who create a
>> table assuming Kudu will have good defaults, and begin loading data only to
>> find out later that they will need to recreate the table with partitioning
>> to achieve good results.
>>
>> A better default partitioning strategy might be hash partitioning over
>> the primary key columns, with a number of hash buckets based on the number
>> of tablet servers (perhaps something like 3x the number of tablet
>> servers).  This would alleviate the worst scalability issues with the
>> current default, however it has a few downsides of its own. Hash
>> partitioning is not appropriate for every use case, and any rule-of-thumb
>> number of tablets we could come up with will not always be optimal.
>>
>> Given that there is no bullet-proof default, and that changing
>> partitioning strategy after table creation is impossible, and changing the
>> default partitioning strategy is a backwards incompatible change, I propose
>> we remove the default altogether.  Users would be required to explicitly
>> specify the table partitioning during creation, and failing to do so would
>> result in an illegal argument error.  Users who really do want only a
>> single tablet will still be able to do so by explicitly configuring range
>> partitioning with no split rows.
>>
>> I'd like to get community feedback on whether this seems like a good
>> direction to take.  I have put together a patch, you can check out the
>> changes to test files to see what it looks like to add partitioning
>> explicitly in cases where the default was being relied on.
>> http://gerrit.cloudera.org:8080/#/c/3131/
>>
>> - Dan
>>
>
>
>
> --
> Abhi Basu
>

RE: Proposal: remove default partitioning for new tables

Posted by Jordan Birdsell <jo...@statefarm.com>.
+1 ...this is a great recommendation

-----Original Message-----
From: Sand Stone [mailto:sand.m.stone@gmail.com] 
Sent: Thursday, May 19, 2016 10:39 PM
To: user@kudu.incubator.apache.org
Cc: dev@kudu.incubator.apache.org
Subject: Re: Proposal: remove default partitioning for new tables

Agreed that this is a sensible API change.

On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:

> I think this a very reasonable feature request. I have recently started
> working with Kudu and the "default" behavior has already tripped me up a
> couple times.
>
> Thanks,
>
> Abhi
>
> On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> wrote:
>
>> Hi all,
>>
>> One of the issues that trips up new Kudu users is the uncertainty about
>> how partitioning works, and how to use partitioning effectively.  Much of
>> this can be addressed with better documentation and explanatory materials,
>> and that should be an area of focus leading up to our 1.0 release. However,
>> the default partitioning behavior is suboptimal, and changing the default
>> could lead to significantly less user confusion and frustration. Currently,
>> when creating a new table, Kudu defaults to using only a single tablet,
>> which is a known anti-pattern.  This can be painful for users who create a
>> table assuming Kudu will have good defaults, and begin loading data only to
>> find out later that they will need to recreate the table with partitioning
>> to achieve good results.
>>
>> A better default partitioning strategy might be hash partitioning over
>> the primary key columns, with a number of hash buckets based on the number
>> of tablet servers (perhaps something like 3x the number of tablet
>> servers).  This would alleviate the worst scalability issues with the
>> current default, however it has a few downsides of its own. Hash
>> partitioning is not appropriate for every use case, and any rule-of-thumb
>> number of tablets we could come up with will not always be optimal.
>>
>> Given that there is no bullet-proof default, and that changing
>> partitioning strategy after table creation is impossible, and changing the
>> default partitioning strategy is a backwards incompatible change, I propose
>> we remove the default altogether.  Users would be required to explicitly
>> specify the table partitioning during creation, and failing to do so would
>> result in an illegal argument error.  Users who really do want only a
>> single tablet will still be able to do so by explicitly configuring range
>> partitioning with no split rows.
>>
>> I'd like to get community feedback on whether this seems like a good
>> direction to take.  I have put together a patch, you can check out the
>> changes to test files to see what it looks like to add partitioning
>> explicitly in cases where the default was being relied on.
>> http://gerrit.cloudera.org:8080/#/c/3131/
>>
>> - Dan
>>
>
>
>
> --
> Abhi Basu
>

Re: Proposal: remove default partitioning for new tables

Posted by Sand Stone <sa...@gmail.com>.
Agreed that this is a sensible API change.

On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:

> I think this a very reasonable feature request. I have recently started
> working with Kudu and the "default" behavior has already tripped me up a
> couple times.
>
> Thanks,
>
> Abhi
>
> On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> wrote:
>
>> Hi all,
>>
>> One of the issues that trips up new Kudu users is the uncertainty about
>> how partitioning works, and how to use partitioning effectively.  Much of
>> this can be addressed with better documentation and explanatory materials,
>> and that should be an area of focus leading up to our 1.0 release. However,
>> the default partitioning behavior is suboptimal, and changing the default
>> could lead to significantly less user confusion and frustration. Currently,
>> when creating a new table, Kudu defaults to using only a single tablet,
>> which is a known anti-pattern.  This can be painful for users who create a
>> table assuming Kudu will have good defaults, and begin loading data only to
>> find out later that they will need to recreate the table with partitioning
>> to achieve good results.
>>
>> A better default partitioning strategy might be hash partitioning over
>> the primary key columns, with a number of hash buckets based on the number
>> of tablet servers (perhaps something like 3x the number of tablet
>> servers).  This would alleviate the worst scalability issues with the
>> current default, however it has a few downsides of its own. Hash
>> partitioning is not appropriate for every use case, and any rule-of-thumb
>> number of tablets we could come up with will not always be optimal.
>>
>> Given that there is no bullet-proof default, and that changing
>> partitioning strategy after table creation is impossible, and changing the
>> default partitioning strategy is a backwards incompatible change, I propose
>> we remove the default altogether.  Users would be required to explicitly
>> specify the table partitioning during creation, and failing to do so would
>> result in an illegal argument error.  Users who really do want only a
>> single tablet will still be able to do so by explicitly configuring range
>> partitioning with no split rows.
>>
>> I'd like to get community feedback on whether this seems like a good
>> direction to take.  I have put together a patch, you can check out the
>> changes to test files to see what it looks like to add partitioning
>> explicitly in cases where the default was being relied on.
>> http://gerrit.cloudera.org:8080/#/c/3131/
>>
>> - Dan
>>
>
>
>
> --
> Abhi Basu
>

Re: Proposal: remove default partitioning for new tables

Posted by Sand Stone <sa...@gmail.com>.
Agreed that this is a sensible API change.

On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:

> I think this a very reasonable feature request. I have recently started
> working with Kudu and the "default" behavior has already tripped me up a
> couple times.
>
> Thanks,
>
> Abhi
>
> On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> wrote:
>
>> Hi all,
>>
>> One of the issues that trips up new Kudu users is the uncertainty about
>> how partitioning works, and how to use partitioning effectively.  Much of
>> this can be addressed with better documentation and explanatory materials,
>> and that should be an area of focus leading up to our 1.0 release. However,
>> the default partitioning behavior is suboptimal, and changing the default
>> could lead to significantly less user confusion and frustration. Currently,
>> when creating a new table, Kudu defaults to using only a single tablet,
>> which is a known anti-pattern.  This can be painful for users who create a
>> table assuming Kudu will have good defaults, and begin loading data only to
>> find out later that they will need to recreate the table with partitioning
>> to achieve good results.
>>
>> A better default partitioning strategy might be hash partitioning over
>> the primary key columns, with a number of hash buckets based on the number
>> of tablet servers (perhaps something like 3x the number of tablet
>> servers).  This would alleviate the worst scalability issues with the
>> current default, however it has a few downsides of its own. Hash
>> partitioning is not appropriate for every use case, and any rule-of-thumb
>> number of tablets we could come up with will not always be optimal.
>>
>> Given that there is no bullet-proof default, and that changing
>> partitioning strategy after table creation is impossible, and changing the
>> default partitioning strategy is a backwards incompatible change, I propose
>> we remove the default altogether.  Users would be required to explicitly
>> specify the table partitioning during creation, and failing to do so would
>> result in an illegal argument error.  Users who really do want only a
>> single tablet will still be able to do so by explicitly configuring range
>> partitioning with no split rows.
>>
>> I'd like to get community feedback on whether this seems like a good
>> direction to take.  I have put together a patch, you can check out the
>> changes to test files to see what it looks like to add partitioning
>> explicitly in cases where the default was being relied on.
>> http://gerrit.cloudera.org:8080/#/c/3131/
>>
>> - Dan
>>
>
>
>
> --
> Abhi Basu
>

Re: Proposal: remove default partitioning for new tables

Posted by Abhi Basu <90...@gmail.com>.
I think this a very reasonable feature request. I have recently started
working with Kudu and the "default" behavior has already tripped me up a
couple times.

Thanks,

Abhi

On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org> wrote:

> Hi all,
>
> One of the issues that trips up new Kudu users is the uncertainty about
> how partitioning works, and how to use partitioning effectively.  Much of
> this can be addressed with better documentation and explanatory materials,
> and that should be an area of focus leading up to our 1.0 release. However,
> the default partitioning behavior is suboptimal, and changing the default
> could lead to significantly less user confusion and frustration. Currently,
> when creating a new table, Kudu defaults to using only a single tablet,
> which is a known anti-pattern.  This can be painful for users who create a
> table assuming Kudu will have good defaults, and begin loading data only to
> find out later that they will need to recreate the table with partitioning
> to achieve good results.
>
> A better default partitioning strategy might be hash partitioning over the
> primary key columns, with a number of hash buckets based on the number of
> tablet servers (perhaps something like 3x the number of tablet servers).
> This would alleviate the worst scalability issues with the current default,
> however it has a few downsides of its own. Hash partitioning is not
> appropriate for every use case, and any rule-of-thumb number of tablets we
> could come up with will not always be optimal.
>
> Given that there is no bullet-proof default, and that changing
> partitioning strategy after table creation is impossible, and changing the
> default partitioning strategy is a backwards incompatible change, I propose
> we remove the default altogether.  Users would be required to explicitly
> specify the table partitioning during creation, and failing to do so would
> result in an illegal argument error.  Users who really do want only a
> single tablet will still be able to do so by explicitly configuring range
> partitioning with no split rows.
>
> I'd like to get community feedback on whether this seems like a good
> direction to take.  I have put together a patch, you can check out the
> changes to test files to see what it looks like to add partitioning
> explicitly in cases where the default was being relied on.
> http://gerrit.cloudera.org:8080/#/c/3131/
>
> - Dan
>



-- 
Abhi Basu

Re: Proposal: remove default partitioning for new tables

Posted by Abhi Basu <90...@gmail.com>.
I think this a very reasonable feature request. I have recently started
working with Kudu and the "default" behavior has already tripped me up a
couple times.

Thanks,

Abhi

On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org> wrote:

> Hi all,
>
> One of the issues that trips up new Kudu users is the uncertainty about
> how partitioning works, and how to use partitioning effectively.  Much of
> this can be addressed with better documentation and explanatory materials,
> and that should be an area of focus leading up to our 1.0 release. However,
> the default partitioning behavior is suboptimal, and changing the default
> could lead to significantly less user confusion and frustration. Currently,
> when creating a new table, Kudu defaults to using only a single tablet,
> which is a known anti-pattern.  This can be painful for users who create a
> table assuming Kudu will have good defaults, and begin loading data only to
> find out later that they will need to recreate the table with partitioning
> to achieve good results.
>
> A better default partitioning strategy might be hash partitioning over the
> primary key columns, with a number of hash buckets based on the number of
> tablet servers (perhaps something like 3x the number of tablet servers).
> This would alleviate the worst scalability issues with the current default,
> however it has a few downsides of its own. Hash partitioning is not
> appropriate for every use case, and any rule-of-thumb number of tablets we
> could come up with will not always be optimal.
>
> Given that there is no bullet-proof default, and that changing
> partitioning strategy after table creation is impossible, and changing the
> default partitioning strategy is a backwards incompatible change, I propose
> we remove the default altogether.  Users would be required to explicitly
> specify the table partitioning during creation, and failing to do so would
> result in an illegal argument error.  Users who really do want only a
> single tablet will still be able to do so by explicitly configuring range
> partitioning with no split rows.
>
> I'd like to get community feedback on whether this seems like a good
> direction to take.  I have put together a patch, you can check out the
> changes to test files to see what it looks like to add partitioning
> explicitly in cases where the default was being relied on.
> http://gerrit.cloudera.org:8080/#/c/3131/
>
> - Dan
>



-- 
Abhi Basu