You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kudu.apache.org by Todd Lipcon <to...@cloudera.com> on 2016/06/02 15:40:42 UTC

Re: Proposal: remove default partitioning for new tables

Hey Dan,

One quick thing I just stumbled upon... it seems like the old behavior was
that you could do the following:

CreateTableOptions builder = new CreateTableOptions();
builder.addSplitRow(...);
builder.addSplitRow(...);
...
client.createTable("foo", schema, builder);

and it would assume that this was range partitioning based on the whole
primary key. The user in this case _is_ specifying split rows, so I figured
this counted as an explicit partitioning choice and thus wouldn't be
affected by the change mentioned above. Instead, I'm getting an error that
no range partition columns were specified.

Was this on purpose? Of course I can call setRangePartitionColumns to work
around it, but didn't know if it was intentional.

-Todd

On Thu, May 26, 2016 at 11:53 PM, Dan Burkert <da...@cloudera.com> wrote:

> Hi all,
>
> Thanks for the feedback!  We've made this change, and it will be part of
> the upcoming 0.9 release.  Going forward, all create table calls must have
> partitioning specified.  Existing tables will not be affected.
>
> - Dan
>
> On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> > +1 ...this is a great recommendation
> >
> > -----Original Message-----
> > From: Sand Stone [mailto:sand.m.stone@gmail.com]
> > Sent: Thursday, May 19, 2016 10:39 PM
> > To: user@kudu.incubator.apache.org
> > Cc: dev@kudu.incubator.apache.org
> > Subject: Re: Proposal: remove default partitioning for new tables
> >
> > Agreed that this is a sensible API change.
> >
> > On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
> >
> > > I think this a very reasonable feature request. I have recently started
> > > working with Kudu and the "default" behavior has already tripped me up
> a
> > > couple times.
> > >
> > > Thanks,
> > >
> > > Abhi
> > >
> > > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> One of the issues that trips up new Kudu users is the uncertainty
> about
> > >> how partitioning works, and how to use partitioning effectively.  Much
> > of
> > >> this can be addressed with better documentation and explanatory
> > materials,
> > >> and that should be an area of focus leading up to our 1.0 release.
> > However,
> > >> the default partitioning behavior is suboptimal, and changing the
> > default
> > >> could lead to significantly less user confusion and frustration.
> > Currently,
> > >> when creating a new table, Kudu defaults to using only a single
> tablet,
> > >> which is a known anti-pattern.  This can be painful for users who
> > create a
> > >> table assuming Kudu will have good defaults, and begin loading data
> > only to
> > >> find out later that they will need to recreate the table with
> > partitioning
> > >> to achieve good results.
> > >>
> > >> A better default partitioning strategy might be hash partitioning over
> > >> the primary key columns, with a number of hash buckets based on the
> > number
> > >> of tablet servers (perhaps something like 3x the number of tablet
> > >> servers).  This would alleviate the worst scalability issues with the
> > >> current default, however it has a few downsides of its own. Hash
> > >> partitioning is not appropriate for every use case, and any
> > rule-of-thumb
> > >> number of tablets we could come up with will not always be optimal.
> > >>
> > >> Given that there is no bullet-proof default, and that changing
> > >> partitioning strategy after table creation is impossible, and changing
> > the
> > >> default partitioning strategy is a backwards incompatible change, I
> > propose
> > >> we remove the default altogether.  Users would be required to
> explicitly
> > >> specify the table partitioning during creation, and failing to do so
> > would
> > >> result in an illegal argument error.  Users who really do want only a
> > >> single tablet will still be able to do so by explicitly configuring
> > range
> > >> partitioning with no split rows.
> > >>
> > >> I'd like to get community feedback on whether this seems like a good
> > >> direction to take.  I have put together a patch, you can check out the
> > >> changes to test files to see what it looks like to add partitioning
> > >> explicitly in cases where the default was being relied on.
> > >> http://gerrit.cloudera.org:8080/#/c/3131/
> > >>
> > >> - Dan
> > >>
> > >
> > >
> > >
> > > --
> > > Abhi Basu
> > >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Proposal: remove default partitioning for new tables

Posted by Dan Burkert <da...@cloudera.com>.
This change was intentional, since I wanted to thoroughly remove the
defaults in one go.  We can always add back implicit range columns when
range splits are specified later if we decide it's too much of a burden.  I
personally like the symmetry with hash partitioning, where setting the
columns explicitly is required.  It also makes it clear that you _do not_
get range partitioning if you do not set the range partitioning columns.  I
can see it  being somewhat confusing if we implicitly set range
partitioning columns when there are range splits, but not when there are
hash partitions.

 - Dan

On Thu, Jun 2, 2016 at 8:40 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Dan,
>
> One quick thing I just stumbled upon... it seems like the old behavior was
> that you could do the following:
>
> CreateTableOptions builder = new CreateTableOptions();
> builder.addSplitRow(...);
> builder.addSplitRow(...);
> ...
> client.createTable("foo", schema, builder);
>
> and it would assume that this was range partitioning based on the whole
> primary key. The user in this case _is_ specifying split rows, so I figured
> this counted as an explicit partitioning choice and thus wouldn't be
> affected by the change mentioned above. Instead, I'm getting an error that
> no range partition columns were specified.
>
> Was this on purpose? Of course I can call setRangePartitionColumns to work
> around it, but didn't know if it was intentional.
>
> -Todd
>
> On Thu, May 26, 2016 at 11:53 PM, Dan Burkert <da...@cloudera.com> wrote:
>
> > Hi all,
> >
> > Thanks for the feedback!  We've made this change, and it will be part of
> > the upcoming 0.9 release.  Going forward, all create table calls must
> have
> > partitioning specified.  Existing tables will not be affected.
> >
> > - Dan
> >
> > On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
> > jordan.birdsell.kdvm@statefarm.com> wrote:
> >
> > > +1 ...this is a great recommendation
> > >
> > > -----Original Message-----
> > > From: Sand Stone [mailto:sand.m.stone@gmail.com]
> > > Sent: Thursday, May 19, 2016 10:39 PM
> > > To: user@kudu.incubator.apache.org
> > > Cc: dev@kudu.incubator.apache.org
> > > Subject: Re: Proposal: remove default partitioning for new tables
> > >
> > > Agreed that this is a sensible API change.
> > >
> > > On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
> > >
> > > > I think this a very reasonable feature request. I have recently
> started
> > > > working with Kudu and the "default" behavior has already tripped me
> up
> > a
> > > > couple times.
> > > >
> > > > Thanks,
> > > >
> > > > Abhi
> > > >
> > > > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> One of the issues that trips up new Kudu users is the uncertainty
> > about
> > > >> how partitioning works, and how to use partitioning effectively.
> Much
> > > of
> > > >> this can be addressed with better documentation and explanatory
> > > materials,
> > > >> and that should be an area of focus leading up to our 1.0 release.
> > > However,
> > > >> the default partitioning behavior is suboptimal, and changing the
> > > default
> > > >> could lead to significantly less user confusion and frustration.
> > > Currently,
> > > >> when creating a new table, Kudu defaults to using only a single
> > tablet,
> > > >> which is a known anti-pattern.  This can be painful for users who
> > > create a
> > > >> table assuming Kudu will have good defaults, and begin loading data
> > > only to
> > > >> find out later that they will need to recreate the table with
> > > partitioning
> > > >> to achieve good results.
> > > >>
> > > >> A better default partitioning strategy might be hash partitioning
> over
> > > >> the primary key columns, with a number of hash buckets based on the
> > > number
> > > >> of tablet servers (perhaps something like 3x the number of tablet
> > > >> servers).  This would alleviate the worst scalability issues with
> the
> > > >> current default, however it has a few downsides of its own. Hash
> > > >> partitioning is not appropriate for every use case, and any
> > > rule-of-thumb
> > > >> number of tablets we could come up with will not always be optimal.
> > > >>
> > > >> Given that there is no bullet-proof default, and that changing
> > > >> partitioning strategy after table creation is impossible, and
> changing
> > > the
> > > >> default partitioning strategy is a backwards incompatible change, I
> > > propose
> > > >> we remove the default altogether.  Users would be required to
> > explicitly
> > > >> specify the table partitioning during creation, and failing to do so
> > > would
> > > >> result in an illegal argument error.  Users who really do want only
> a
> > > >> single tablet will still be able to do so by explicitly configuring
> > > range
> > > >> partitioning with no split rows.
> > > >>
> > > >> I'd like to get community feedback on whether this seems like a good
> > > >> direction to take.  I have put together a patch, you can check out
> the
> > > >> changes to test files to see what it looks like to add partitioning
> > > >> explicitly in cases where the default was being relied on.
> > > >> http://gerrit.cloudera.org:8080/#/c/3131/
> > > >>
> > > >> - Dan
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Abhi Basu
> > > >
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Proposal: remove default partitioning for new tables

Posted by Dan Burkert <da...@cloudera.com>.
This change was intentional, since I wanted to thoroughly remove the
defaults in one go.  We can always add back implicit range columns when
range splits are specified later if we decide it's too much of a burden.  I
personally like the symmetry with hash partitioning, where setting the
columns explicitly is required.  It also makes it clear that you _do not_
get range partitioning if you do not set the range partitioning columns.  I
can see it  being somewhat confusing if we implicitly set range
partitioning columns when there are range splits, but not when there are
hash partitions.

 - Dan

On Thu, Jun 2, 2016 at 8:40 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Dan,
>
> One quick thing I just stumbled upon... it seems like the old behavior was
> that you could do the following:
>
> CreateTableOptions builder = new CreateTableOptions();
> builder.addSplitRow(...);
> builder.addSplitRow(...);
> ...
> client.createTable("foo", schema, builder);
>
> and it would assume that this was range partitioning based on the whole
> primary key. The user in this case _is_ specifying split rows, so I figured
> this counted as an explicit partitioning choice and thus wouldn't be
> affected by the change mentioned above. Instead, I'm getting an error that
> no range partition columns were specified.
>
> Was this on purpose? Of course I can call setRangePartitionColumns to work
> around it, but didn't know if it was intentional.
>
> -Todd
>
> On Thu, May 26, 2016 at 11:53 PM, Dan Burkert <da...@cloudera.com> wrote:
>
> > Hi all,
> >
> > Thanks for the feedback!  We've made this change, and it will be part of
> > the upcoming 0.9 release.  Going forward, all create table calls must
> have
> > partitioning specified.  Existing tables will not be affected.
> >
> > - Dan
> >
> > On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
> > jordan.birdsell.kdvm@statefarm.com> wrote:
> >
> > > +1 ...this is a great recommendation
> > >
> > > -----Original Message-----
> > > From: Sand Stone [mailto:sand.m.stone@gmail.com]
> > > Sent: Thursday, May 19, 2016 10:39 PM
> > > To: user@kudu.incubator.apache.org
> > > Cc: dev@kudu.incubator.apache.org
> > > Subject: Re: Proposal: remove default partitioning for new tables
> > >
> > > Agreed that this is a sensible API change.
> > >
> > > On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <90...@gmail.com> wrote:
> > >
> > > > I think this a very reasonable feature request. I have recently
> started
> > > > working with Kudu and the "default" behavior has already tripped me
> up
> > a
> > > > couple times.
> > > >
> > > > Thanks,
> > > >
> > > > Abhi
> > > >
> > > > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <da...@apache.org>
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> One of the issues that trips up new Kudu users is the uncertainty
> > about
> > > >> how partitioning works, and how to use partitioning effectively.
> Much
> > > of
> > > >> this can be addressed with better documentation and explanatory
> > > materials,
> > > >> and that should be an area of focus leading up to our 1.0 release.
> > > However,
> > > >> the default partitioning behavior is suboptimal, and changing the
> > > default
> > > >> could lead to significantly less user confusion and frustration.
> > > Currently,
> > > >> when creating a new table, Kudu defaults to using only a single
> > tablet,
> > > >> which is a known anti-pattern.  This can be painful for users who
> > > create a
> > > >> table assuming Kudu will have good defaults, and begin loading data
> > > only to
> > > >> find out later that they will need to recreate the table with
> > > partitioning
> > > >> to achieve good results.
> > > >>
> > > >> A better default partitioning strategy might be hash partitioning
> over
> > > >> the primary key columns, with a number of hash buckets based on the
> > > number
> > > >> of tablet servers (perhaps something like 3x the number of tablet
> > > >> servers).  This would alleviate the worst scalability issues with
> the
> > > >> current default, however it has a few downsides of its own. Hash
> > > >> partitioning is not appropriate for every use case, and any
> > > rule-of-thumb
> > > >> number of tablets we could come up with will not always be optimal.
> > > >>
> > > >> Given that there is no bullet-proof default, and that changing
> > > >> partitioning strategy after table creation is impossible, and
> changing
> > > the
> > > >> default partitioning strategy is a backwards incompatible change, I
> > > propose
> > > >> we remove the default altogether.  Users would be required to
> > explicitly
> > > >> specify the table partitioning during creation, and failing to do so
> > > would
> > > >> result in an illegal argument error.  Users who really do want only
> a
> > > >> single tablet will still be able to do so by explicitly configuring
> > > range
> > > >> partitioning with no split rows.
> > > >>
> > > >> I'd like to get community feedback on whether this seems like a good
> > > >> direction to take.  I have put together a patch, you can check out
> the
> > > >> changes to test files to see what it looks like to add partitioning
> > > >> explicitly in cases where the default was being relied on.
> > > >> http://gerrit.cloudera.org:8080/#/c/3131/
> > > >>
> > > >> - Dan
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Abhi Basu
> > > >
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>