You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Pal Konyves <pa...@gmail.com> on 2013/04/20 19:52:08 UTC

default region splitting on which value?

Hi,

I am just reading about region splitting. By default - as I understand -
Hbase handles splitting the regions. I just don't know how to imagine on
which key it splits the regions.

1) For example when I write MD5 hash of rowkeys, they are most probably
evenly distributed from
000000... to FFFFF... right? When  Hbase starts with one region, all the
writes goes into that region, and when the HFile get's too big, it just
gets for example the median value of the stored keys, and split the region
by this?

2) I want to bulk load tons of data with the HBase java client API put
operations. I want it to perform well. My keys are numeric sequential
values (which I know from this post, I cannot load into Hbase sequentially,
because the Hbase tables are going to be sad
http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
 )
So I thought I would pre-split the table into regions, and load the data
randomized. This way I will get good distribution among region servers in
terms of network IO from the beginning. Is that a good idea?

3) If my rowkeys are not evenly distributed in the keyspace, but they show
some peaks or bursts. e.g. 000-999, but most of the keys gather around 020
and 060 values, is it a good idea to have the pre region splits at those
peaks?

Thanks in advance,
Pal

Re: default region splitting on which value?

Posted by Pal Konyves <pa...@gmail.com>.

Hi, no it's not important, only the stops are.


On Sun, Apr 21, 2013 at 3:34 AM, Ted Yu <yu...@gmail.com> wrote:

> Thanks for sharing the information below.
>
> How do you plan to store time (when the bus gets to each stop) in the row ?
> Or maybe it is not of importance to you ?
>
> On Sat, Apr 20, 2013 at 2:24 PM, Pal Konyves <pa...@gmail.com>
> wrote:
>
> > I am making a paper for school about HBase, so the data I chose is not a
> > real usable example. I am familiar with GTFS that is a de facto standard
> > for storing information about public transportation schedules: when
> vehicle
> > arrives to a stop and where it goes toward.
> >
> > I chose to genrate the rows on the fly, where each row represents a
> > sequence of 'bus' stops that make a route from the first stop until the
> > last stop.
> > e.g.: [first_stop_id,last_stop_id],string_sequence_of_stops
> > where within the [...] is the rowkey.
> >
> > So long story short, I generate the data. I want to use the HBase java
> > client api to store the rows with Put. I plan to randomize it by picking
> > random first_stop_id-s, and use more threads.
> >
> > the rowkeys will still have a sequence, because the way I generate the
> rows
> > will output about 100-1000 rows starting with the same first_stop_id
> within
> > the rowkey. The total ammount of rows will be about billions, and would
> > take up about 1TB.
> >
> >
> > On Sat, Apr 20, 2013 at 10:54 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > The answer to your first question is yes - midkey of the key range
> would
> > > be chosen as split key.
> > >
> > > For #2, can you tell us how you plan to randomize the loading ?
> > > Bulk load normally means preparing HFiles which would be loaded
> directly
> > > into your table.
> > >
> > > Cheers
> > >
> > > On Apr 20, 2013, at 1:11 PM, Pal Konyves <pa...@gmail.com>
> wrote:
> > >
> > > > Hi Ted,
> > > > Only one family, my data is very simple key-value, although I want to
> > > make
> > > > sequential scan, so making a hash of the key is not an option.
> > > >
> > > >
> > > >
> > > > On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >
> > > >> How many column families do you have ?
> > > >>
> > > >> For #3, per-splitting table at the row keys corresponding to peaks
> > makes
> > > >> sense.
> > > >>
> > > >> On Apr 20, 2013, at 10:52 AM, Pal Konyves <pa...@gmail.com>
> > > wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> I am just reading about region splitting. By default - as I
> > understand
> > > -
> > > >>> Hbase handles splitting the regions. I just don't know how to
> imagine
> > > on
> > > >>> which key it splits the regions.
> > > >>>
> > > >>> 1) For example when I write MD5 hash of rowkeys, they are most
> > probably
> > > >>> evenly distributed from
> > > >>> 000000... to FFFFF... right? When  Hbase starts with one region,
> all
> > > the
> > > >>> writes goes into that region, and when the HFile get's too big, it
> > just
> > > >>> gets for example the median value of the stored keys, and split the
> > > >> region
> > > >>> by this?
> > > >>>
> > > >>> 2) I want to bulk load tons of data with the HBase java client API
> > put
> > > >>> operations. I want it to perform well. My keys are numeric
> sequential
> > > >>> values (which I know from this post, I cannot load into Hbase
> > > >> sequentially,
> > > >>> because the Hbase tables are going to be sad
> > > >>
> > >
> >
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
> > > >>> )
> > > >>> So I thought I would pre-split the table into regions, and load the
> > > data
> > > >>> randomized. This way I will get good distribution among region
> > servers
> > > in
> > > >>> terms of network IO from the beginning. Is that a good idea?
> > > >>>
> > > >>> 3) If my rowkeys are not evenly distributed in the keyspace, but
> they
> > > >> show
> > > >>> some peaks or bursts. e.g. 000-999, but most of the keys gather
> > around
> > > >> 020
> > > >>> and 060 values, is it a good idea to have the pre region splits at
> > > those
> > > >>> peaks?
> > > >>>
> > > >>> Thanks in advance,
> > > >>> Pal
> > > >>
> > >
> >
>

Re: default region splitting on which value?

Posted by Ted Yu <yu...@gmail.com>.

Thanks for sharing the information below.

How do you plan to store time (when the bus gets to each stop) in the row ?
Or maybe it is not of importance to you ?

On Sat, Apr 20, 2013 at 2:24 PM, Pal Konyves <pa...@gmail.com> wrote:

> I am making a paper for school about HBase, so the data I chose is not a
> real usable example. I am familiar with GTFS that is a de facto standard
> for storing information about public transportation schedules: when vehicle
> arrives to a stop and where it goes toward.
>
> I chose to genrate the rows on the fly, where each row represents a
> sequence of 'bus' stops that make a route from the first stop until the
> last stop.
> e.g.: [first_stop_id,last_stop_id],string_sequence_of_stops
> where within the [...] is the rowkey.
>
> So long story short, I generate the data. I want to use the HBase java
> client api to store the rows with Put. I plan to randomize it by picking
> random first_stop_id-s, and use more threads.
>
> the rowkeys will still have a sequence, because the way I generate the rows
> will output about 100-1000 rows starting with the same first_stop_id within
> the rowkey. The total ammount of rows will be about billions, and would
> take up about 1TB.
>
>
> On Sat, Apr 20, 2013 at 10:54 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > The answer to your first question is yes - midkey of the key range would
> > be chosen as split key.
> >
> > For #2, can you tell us how you plan to randomize the loading ?
> > Bulk load normally means preparing HFiles which would be loaded directly
> > into your table.
> >
> > Cheers
> >
> > On Apr 20, 2013, at 1:11 PM, Pal Konyves <pa...@gmail.com> wrote:
> >
> > > Hi Ted,
> > > Only one family, my data is very simple key-value, although I want to
> > make
> > > sequential scan, so making a hash of the key is not an option.
> > >
> > >
> > >
> > > On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> How many column families do you have ?
> > >>
> > >> For #3, per-splitting table at the row keys corresponding to peaks
> makes
> > >> sense.
> > >>
> > >> On Apr 20, 2013, at 10:52 AM, Pal Konyves <pa...@gmail.com>
> > wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am just reading about region splitting. By default - as I
> understand
> > -
> > >>> Hbase handles splitting the regions. I just don't know how to imagine
> > on
> > >>> which key it splits the regions.
> > >>>
> > >>> 1) For example when I write MD5 hash of rowkeys, they are most
> probably
> > >>> evenly distributed from
> > >>> 000000... to FFFFF... right? When  Hbase starts with one region, all
> > the
> > >>> writes goes into that region, and when the HFile get's too big, it
> just
> > >>> gets for example the median value of the stored keys, and split the
> > >> region
> > >>> by this?
> > >>>
> > >>> 2) I want to bulk load tons of data with the HBase java client API
> put
> > >>> operations. I want it to perform well. My keys are numeric sequential
> > >>> values (which I know from this post, I cannot load into Hbase
> > >> sequentially,
> > >>> because the Hbase tables are going to be sad
> > >>
> >
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
> > >>> )
> > >>> So I thought I would pre-split the table into regions, and load the
> > data
> > >>> randomized. This way I will get good distribution among region
> servers
> > in
> > >>> terms of network IO from the beginning. Is that a good idea?
> > >>>
> > >>> 3) If my rowkeys are not evenly distributed in the keyspace, but they
> > >> show
> > >>> some peaks or bursts. e.g. 000-999, but most of the keys gather
> around
> > >> 020
> > >>> and 060 values, is it a good idea to have the pre region splits at
> > those
> > >>> peaks?
> > >>>
> > >>> Thanks in advance,
> > >>> Pal
> > >>
> >
>

Re: default region splitting on which value?

Posted by Pal Konyves <pa...@gmail.com>.

I am making a paper for school about HBase, so the data I chose is not a
real usable example. I am familiar with GTFS that is a de facto standard
for storing information about public transportation schedules: when vehicle
arrives to a stop and where it goes toward.

I chose to genrate the rows on the fly, where each row represents a
sequence of 'bus' stops that make a route from the first stop until the
last stop.
e.g.: [first_stop_id,last_stop_id],string_sequence_of_stops
where within the [...] is the rowkey.

So long story short, I generate the data. I want to use the HBase java
client api to store the rows with Put. I plan to randomize it by picking
random first_stop_id-s, and use more threads.

the rowkeys will still have a sequence, because the way I generate the rows
will output about 100-1000 rows starting with the same first_stop_id within
the rowkey. The total ammount of rows will be about billions, and would
take up about 1TB.


On Sat, Apr 20, 2013 at 10:54 PM, Ted Yu <yu...@gmail.com> wrote:

> The answer to your first question is yes - midkey of the key range would
> be chosen as split key.
>
> For #2, can you tell us how you plan to randomize the loading ?
> Bulk load normally means preparing HFiles which would be loaded directly
> into your table.
>
> Cheers
>
> On Apr 20, 2013, at 1:11 PM, Pal Konyves <pa...@gmail.com> wrote:
>
> > Hi Ted,
> > Only one family, my data is very simple key-value, although I want to
> make
> > sequential scan, so making a hash of the key is not an option.
> >
> >
> >
> > On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> How many column families do you have ?
> >>
> >> For #3, per-splitting table at the row keys corresponding to peaks makes
> >> sense.
> >>
> >> On Apr 20, 2013, at 10:52 AM, Pal Konyves <pa...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am just reading about region splitting. By default - as I understand
> -
> >>> Hbase handles splitting the regions. I just don't know how to imagine
> on
> >>> which key it splits the regions.
> >>>
> >>> 1) For example when I write MD5 hash of rowkeys, they are most probably
> >>> evenly distributed from
> >>> 000000... to FFFFF... right? When  Hbase starts with one region, all
> the
> >>> writes goes into that region, and when the HFile get's too big, it just
> >>> gets for example the median value of the stored keys, and split the
> >> region
> >>> by this?
> >>>
> >>> 2) I want to bulk load tons of data with the HBase java client API put
> >>> operations. I want it to perform well. My keys are numeric sequential
> >>> values (which I know from this post, I cannot load into Hbase
> >> sequentially,
> >>> because the Hbase tables are going to be sad
> >>
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
> >>> )
> >>> So I thought I would pre-split the table into regions, and load the
> data
> >>> randomized. This way I will get good distribution among region servers
> in
> >>> terms of network IO from the beginning. Is that a good idea?
> >>>
> >>> 3) If my rowkeys are not evenly distributed in the keyspace, but they
> >> show
> >>> some peaks or bursts. e.g. 000-999, but most of the keys gather around
> >> 020
> >>> and 060 values, is it a good idea to have the pre region splits at
> those
> >>> peaks?
> >>>
> >>> Thanks in advance,
> >>> Pal
> >>
>

Re: default region splitting on which value?

Posted by Ted Yu <yu...@gmail.com>.

The answer to your first question is yes - midkey of the key range would be chosen as split key.

For #2, can you tell us how you plan to randomize the loading ?
Bulk load normally means preparing HFiles which would be loaded directly into your table. 

Cheers

On Apr 20, 2013, at 1:11 PM, Pal Konyves <pa...@gmail.com> wrote:

> Hi Ted,
> Only one family, my data is very simple key-value, although I want to make
> sequential scan, so making a hash of the key is not an option.
> 
> 
> 
> On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> How many column families do you have ?
>> 
>> For #3, per-splitting table at the row keys corresponding to peaks makes
>> sense.
>> 
>> On Apr 20, 2013, at 10:52 AM, Pal Konyves <pa...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I am just reading about region splitting. By default - as I understand -
>>> Hbase handles splitting the regions. I just don't know how to imagine on
>>> which key it splits the regions.
>>> 
>>> 1) For example when I write MD5 hash of rowkeys, they are most probably
>>> evenly distributed from
>>> 000000... to FFFFF... right? When  Hbase starts with one region, all the
>>> writes goes into that region, and when the HFile get's too big, it just
>>> gets for example the median value of the stored keys, and split the
>> region
>>> by this?
>>> 
>>> 2) I want to bulk load tons of data with the HBase java client API put
>>> operations. I want it to perform well. My keys are numeric sequential
>>> values (which I know from this post, I cannot load into Hbase
>> sequentially,
>>> because the Hbase tables are going to be sad
>> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
>>> )
>>> So I thought I would pre-split the table into regions, and load the data
>>> randomized. This way I will get good distribution among region servers in
>>> terms of network IO from the beginning. Is that a good idea?
>>> 
>>> 3) If my rowkeys are not evenly distributed in the keyspace, but they
>> show
>>> some peaks or bursts. e.g. 000-999, but most of the keys gather around
>> 020
>>> and 060 values, is it a good idea to have the pre region splits at those
>>> peaks?
>>> 
>>> Thanks in advance,
>>> Pal
>>

Re: default region splitting on which value?

Posted by Pal Konyves <pa...@gmail.com>.

Hi Ted,
Only one family, my data is very simple key-value, although I want to make
sequential scan, so making a hash of the key is not an option.



On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <yu...@gmail.com> wrote:

> How many column families do you have ?
>
> For #3, per-splitting table at the row keys corresponding to peaks makes
> sense.
>
> On Apr 20, 2013, at 10:52 AM, Pal Konyves <pa...@gmail.com> wrote:
>
> > Hi,
> >
> > I am just reading about region splitting. By default - as I understand -
> > Hbase handles splitting the regions. I just don't know how to imagine on
> > which key it splits the regions.
> >
> > 1) For example when I write MD5 hash of rowkeys, they are most probably
> > evenly distributed from
> > 000000... to FFFFF... right? When  Hbase starts with one region, all the
> > writes goes into that region, and when the HFile get's too big, it just
> > gets for example the median value of the stored keys, and split the
> region
> > by this?
> >
> > 2) I want to bulk load tons of data with the HBase java client API put
> > operations. I want it to perform well. My keys are numeric sequential
> > values (which I know from this post, I cannot load into Hbase
> sequentially,
> > because the Hbase tables are going to be sad
> >
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
> > )
> > So I thought I would pre-split the table into regions, and load the data
> > randomized. This way I will get good distribution among region servers in
> > terms of network IO from the beginning. Is that a good idea?
> >
> > 3) If my rowkeys are not evenly distributed in the keyspace, but they
> show
> > some peaks or bursts. e.g. 000-999, but most of the keys gather around
> 020
> > and 060 values, is it a good idea to have the pre region splits at those
> > peaks?
> >
> > Thanks in advance,
> > Pal
>

Re: default region splitting on which value?

Posted by Ted Yu <yu...@gmail.com>.

How many column families do you have ?

For #3, per-splitting table at the row keys corresponding to peaks makes sense. 

On Apr 20, 2013, at 10:52 AM, Pal Konyves <pa...@gmail.com> wrote:

> Hi,
> 
> I am just reading about region splitting. By default - as I understand -
> Hbase handles splitting the regions. I just don't know how to imagine on
> which key it splits the regions.
> 
> 1) For example when I write MD5 hash of rowkeys, they are most probably
> evenly distributed from
> 000000... to FFFFF... right? When  Hbase starts with one region, all the
> writes goes into that region, and when the HFile get's too big, it just
> gets for example the median value of the stored keys, and split the region
> by this?
> 
> 2) I want to bulk load tons of data with the HBase java client API put
> operations. I want it to perform well. My keys are numeric sequential
> values (which I know from this post, I cannot load into Hbase sequentially,
> because the Hbase tables are going to be sad
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
> )
> So I thought I would pre-split the table into regions, and load the data
> randomized. This way I will get good distribution among region servers in
> terms of network IO from the beginning. Is that a good idea?
> 
> 3) If my rowkeys are not evenly distributed in the keyspace, but they show
> some peaks or bursts. e.g. 000-999, but most of the keys gather around 020
> and 060 values, is it a good idea to have the pre region splits at those
> peaks?
> 
> Thanks in advance,
> Pal