You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Sachin Jain <sa...@gmail.com> on 2016/11/28 08:42:01 UTC

Creating HBase table with presplits

Hi,

I was going though pre-splitting a table article [0] and it is mentioned
that it is generally best practice to presplit your table. But don't we
need to know the data in advance in order to presplit it.

Question: What should be the best practice when we don't know what data is
going to be inserted into HBase. Essentially I don't know the key range so
if I specify wrong splits, then either first or last split can be a hot
region in my system.

[0]: https://hbase.apache.org/book.html#rowkey.regionsplits

Thanks
-Sachin

Re: Creating HBase table with presplits

Posted by Sachin Jain <sa...@gmail.com>.

Thanks Saad!!

This is exactly similar to what I had planned to implement i.e to map your
known keyspack to known keyspace by using a hash algorithm like MD5. Then
split the table. Thanks once again!!


On Fri, Dec 2, 2016 at 7:18 PM, Saad Mufti <sa...@gmail.com> wrote:

> Forgot to mention in above example you would presplit into 1024 regions,
> starting from "0000" to "1023" (start keys).
>
> Cheers.
>
> ----
> Saad
>
>
> On Fri, Dec 2, 2016 at 8:47 AM, Saad Mufti <sa...@gmail.com> wrote:
>
> > One way to do this without knowing your data (still need some idea of
> size
> > of keyspace) is to prepend a fixed numeric prefix from a suitable range
> > based on a good hash like MD5. For example, let us say you can predict
> your
> > data will fit in about 1024 regions. You can decide to prepend a prefix
> > from 0000 to 1024 to all you keys based on a suitable hash.
> >
> > The pros:
> >
> > 1. you get to pre-split without knowing your keyspace
> > 2. very hard if not impossible for unknown data providers to send you
> data
> > in some order that generates hotspots (unless of course the same key is
> > repeated over and over, still have to watch out for that)
> >
> > The cons:
> >
> > 1. lose the ability to do scan in "natural" sorted order of your keyspace
> > as that order is not preserved anymore in HBase
> > 2. if you miscalculate your keyspace size by a lot, you are stuck with
> the
> > hash function and range you selected even if you later get more regions
> > unless you're willing to do complete migration to a new table
> >
> > Hope above helps.
> >
> > ----
> > Saad
> >
> >
> > On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain <sa...@gmail.com>
> > wrote:
> >
> >> Thanks Dave for your suggestions!
> >> Will let you know if I find some approach to tackle this situation.
> >>
> >> Regards
> >>
> >> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >> > If you truly have no way to predict anything about the distribution of
> >> your
> >> > data across the row key space, then you are correct that there is no
> >> way to
> >> > presplit your regions in an effective way.  Either you need to make
> some
> >> > starting guess, such as a small number of uniform splits, or wait
> until
> >> you
> >> > have some information about what the data will look like.
> >> >
> >> > Dave
> >> >
> >> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <
> sachinjain024@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I was going though pre-splitting a table article [0] and it is
> >> mentioned
> >> > > that it is generally best practice to presplit your table. But don't
> >> we
> >> > > need to know the data in advance in order to presplit it.
> >> > >
> >> > > Question: What should be the best practice when we don't know what
> >> data
> >> > is
> >> > > going to be inserted into HBase. Essentially I don't know the key
> >> range
> >> > so
> >> > > if I specify wrong splits, then either first or last split can be a
> >> hot
> >> > > region in my system.
> >> > >
> >> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
> >> > >
> >> > > Thanks
> >> > > -Sachin
> >> > >
> >> >
> >>
> >
> >
>

Re: Creating HBase table with presplits

Posted by Saad Mufti <sa...@gmail.com>.

Forgot to mention in above example you would presplit into 1024 regions,
starting from "0000" to "1023" (start keys).

Cheers.

----
Saad


On Fri, Dec 2, 2016 at 8:47 AM, Saad Mufti <sa...@gmail.com> wrote:

> One way to do this without knowing your data (still need some idea of size
> of keyspace) is to prepend a fixed numeric prefix from a suitable range
> based on a good hash like MD5. For example, let us say you can predict your
> data will fit in about 1024 regions. You can decide to prepend a prefix
> from 0000 to 1024 to all you keys based on a suitable hash.
>
> The pros:
>
> 1. you get to pre-split without knowing your keyspace
> 2. very hard if not impossible for unknown data providers to send you data
> in some order that generates hotspots (unless of course the same key is
> repeated over and over, still have to watch out for that)
>
> The cons:
>
> 1. lose the ability to do scan in "natural" sorted order of your keyspace
> as that order is not preserved anymore in HBase
> 2. if you miscalculate your keyspace size by a lot, you are stuck with the
> hash function and range you selected even if you later get more regions
> unless you're willing to do complete migration to a new table
>
> Hope above helps.
>
> ----
> Saad
>
>
> On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain <sa...@gmail.com>
> wrote:
>
>> Thanks Dave for your suggestions!
>> Will let you know if I find some approach to tackle this situation.
>>
>> Regards
>>
>> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> > If you truly have no way to predict anything about the distribution of
>> your
>> > data across the row key space, then you are correct that there is no
>> way to
>> > presplit your regions in an effective way.  Either you need to make some
>> > starting guess, such as a small number of uniform splits, or wait until
>> you
>> > have some information about what the data will look like.
>> >
>> > Dave
>> >
>> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <sa...@gmail.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I was going though pre-splitting a table article [0] and it is
>> mentioned
>> > > that it is generally best practice to presplit your table. But don't
>> we
>> > > need to know the data in advance in order to presplit it.
>> > >
>> > > Question: What should be the best practice when we don't know what
>> data
>> > is
>> > > going to be inserted into HBase. Essentially I don't know the key
>> range
>> > so
>> > > if I specify wrong splits, then either first or last split can be a
>> hot
>> > > region in my system.
>> > >
>> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
>> > >
>> > > Thanks
>> > > -Sachin
>> > >
>> >
>>
>
>

Re: Creating HBase table with presplits

Posted by Saad Mufti <sa...@gmail.com>.

One way to do this without knowing your data (still need some idea of size
of keyspace) is to prepend a fixed numeric prefix from a suitable range
based on a good hash like MD5. For example, let us say you can predict your
data will fit in about 1024 regions. You can decide to prepend a prefix
from 0000 to 1024 to all you keys based on a suitable hash.

The pros:

1. you get to pre-split without knowing your keyspace
2. very hard if not impossible for unknown data providers to send you data
in some order that generates hotspots (unless of course the same key is
repeated over and over, still have to watch out for that)

The cons:

1. lose the ability to do scan in "natural" sorted order of your keyspace
as that order is not preserved anymore in HBase
2. if you miscalculate your keyspace size by a lot, you are stuck with the
hash function and range you selected even if you later get more regions
unless you're willing to do complete migration to a new table

Hope above helps.

----
Saad

On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain <sa...@gmail.com>
wrote:

> Thanks Dave for your suggestions!
> Will let you know if I find some approach to tackle this situation.
>
> Regards
>
> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <la...@davelink.net> wrote:
>
> > If you truly have no way to predict anything about the distribution of
> your
> > data across the row key space, then you are correct that there is no way
> to
> > presplit your regions in an effective way.  Either you need to make some
> > starting guess, such as a small number of uniform splits, or wait until
> you
> > have some information about what the data will look like.
> >
> > Dave
> >
> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <sa...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I was going though pre-splitting a table article [0] and it is
> mentioned
> > > that it is generally best practice to presplit your table. But don't we
> > > need to know the data in advance in order to presplit it.
> > >
> > > Question: What should be the best practice when we don't know what data
> > is
> > > going to be inserted into HBase. Essentially I don't know the key range
> > so
> > > if I specify wrong splits, then either first or last split can be a hot
> > > region in my system.
> > >
> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
> > >
> > > Thanks
> > > -Sachin
> > >
> >
>

Re: Creating HBase table with presplits

Posted by Sachin Jain <sa...@gmail.com>.

Thanks Dave for your suggestions!
Will let you know if I find some approach to tackle this situation.

Regards

On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <la...@davelink.net> wrote:

> If you truly have no way to predict anything about the distribution of your
> data across the row key space, then you are correct that there is no way to
> presplit your regions in an effective way.  Either you need to make some
> starting guess, such as a small number of uniform splits, or wait until you
> have some information about what the data will look like.
>
> Dave
>
> On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <sa...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I was going though pre-splitting a table article [0] and it is mentioned
> > that it is generally best practice to presplit your table. But don't we
> > need to know the data in advance in order to presplit it.
> >
> > Question: What should be the best practice when we don't know what data
> is
> > going to be inserted into HBase. Essentially I don't know the key range
> so
> > if I specify wrong splits, then either first or last split can be a hot
> > region in my system.
> >
> > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
> >
> > Thanks
> > -Sachin
> >
>

Re: Creating HBase table with presplits

Posted by Dave Latham <la...@davelink.net>.

If you truly have no way to predict anything about the distribution of your
data across the row key space, then you are correct that there is no way to
presplit your regions in an effective way.  Either you need to make some
starting guess, such as a small number of uniform splits, or wait until you
have some information about what the data will look like.

Dave

On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <sa...@gmail.com>
wrote:

> Hi,
>
> I was going though pre-splitting a table article [0] and it is mentioned
> that it is generally best practice to presplit your table. But don't we
> need to know the data in advance in order to presplit it.
>
> Question: What should be the best practice when we don't know what data is
> going to be inserted into HBase. Essentially I don't know the key range so
> if I specify wrong splits, then either first or last split can be a hot
> region in my system.
>
> [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
>
> Thanks
> -Sachin
>