You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Lior Schachter <li...@infolinks.com> on 2011/03/20 18:06:35 UTC

hash function per table

Hi,
What is the API or configuration for changing the default hash function for
a specific htable.

thanks,
Lior

Re: hash function per table

Posted by Chris Tarnas <cf...@email.com>.

This question fairly common on the list, for example:

http://search-hadoop.com/m/jusKg172GBC/timestamp+hash++key/v=threaded

-chris

On Mar 20, 2011, at 12:16 PM, Niels Nuyttens wrote:

> Hi guys,
> 
> this is an interesting discussion, please excuse me for hijacking it and
> posing an examplatory problem:
> 
> suppose one is getting data from monitoring devices. A composite key
> could be made using <date>_<monitoring_type>. Would this lead to
> hotspots? Could hashing then solve this problem, and won't I lose the
> advantage of being able to list my monitoring data chronologically?
> 
> Thanks in advance,
> 
> Niels
> 
> 
> On Sun, 2011-03-20 at 11:57 -0700, Chris Tarnas wrote:
>> There is none - HBase uses a total order partitioner. The straight key value itself determines which region a row is put into. This allows for very rapid scans of sequential data, among other things but does mean it is easier to hotspot regions. Key design is very important.
>> 
>> -chris
>> 
>> On Mar 20, 2011, at 11:41 AM, Lior Schachter wrote:
>> 
>>> the hash function that distributes the rows between the regions.
>>> 
>>> On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net> wrote:
>>> 
>>>> Hash?  Which hash are you referring to sir?
>>>> St.Ack
>>>> 
>>>> On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter <li...@infolinks.com>
>>>> wrote:
>>>>> Hi,
>>>>> What is the API or configuration for changing the default hash function
>>>> for
>>>>> a specific htable.
>>>>> 
>>>>> thanks,
>>>>> Lior
>>>>> 
>>>> 
>> 
> 
>

Re: hash function per table

Posted by Niels Nuyttens <ni...@gmail.com>.

Hi guys,

this is an interesting discussion, please excuse me for hijacking it and
posing an examplatory problem:

suppose one is getting data from monitoring devices. A composite key
could be made using <date>_<monitoring_type>. Would this lead to
hotspots? Could hashing then solve this problem, and won't I lose the
advantage of being able to list my monitoring data chronologically?

Thanks in advance,

Niels


On Sun, 2011-03-20 at 11:57 -0700, Chris Tarnas wrote:
> There is none - HBase uses a total order partitioner. The straight key value itself determines which region a row is put into. This allows for very rapid scans of sequential data, among other things but does mean it is easier to hotspot regions. Key design is very important.
> 
> -chris
> 
> On Mar 20, 2011, at 11:41 AM, Lior Schachter wrote:
> 
> > the hash function that distributes the rows between the regions.
> > 
> > On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net> wrote:
> > 
> >> Hash?  Which hash are you referring to sir?
> >> St.Ack
> >> 
> >> On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter <li...@infolinks.com>
> >> wrote:
> >>> Hi,
> >>> What is the API or configuration for changing the default hash function
> >> for
> >>> a specific htable.
> >>> 
> >>> thanks,
> >>> Lior
> >>> 
> >> 
>

Re: hash function per table

Posted by Oleg Ruchovets <or...@gmail.com>.

Can you share more information about your tests?



   I still have  couple of issues that I don't understand :
    1)     public Scan setTimeRange(long minStamp, long maxStamp) vs
startKey , endKey approach , what is the better approach and does  one  has
significant time execution difference compare to another.
    2)     Suppose I am inserting data I try to distribute it across the
regions and I will create index at the same time. Will  Index help me to
improve the scan process?




On Sun, Mar 20, 2011 at 10:03 PM, Pete Haidinyak <ja...@cox.net> wrote:

> I went through this discussion a month or so ago and came away with the
> opinion that you can either have an efficient load with random key but then
> have an inefficient 'scan' not using start and end rows, or have an
> inefficient import with sequential key and then scan using start and end
> rows.
>
> -Pete
>
>
>
> On Sun, 20 Mar 2011 12:52:24 -0700, Oleg Ruchovets <or...@gmail.com>
> wrote:
>
>  Actually discussion started from this post:
>>
>>
>>
>> http://search-hadoop.com/m/XX3nW68JsY1/hbase+insertion+optimisation&subj=hbase+insertion+optimisation+
>>
>> Simply inserting the data in which row key <date>_<somedata> I noticed
>> that
>> only one node works (region to which data were writing). In case we have
>> 10-15 nodes I think it is inefficient to write data to only one region. I
>> want to get an effect that data will be inserted to  as much as possible
>> nodes  simultaneously. Correct me guys ,  but in this case  writing job
>> will take less time , am I write?
>>
>> Oleg.
>>
>> On Sun, Mar 20, 2011 at 8:57 PM, Chris Tarnas <cf...@email.com> wrote:
>>
>>  There is none - HBase uses a total order partitioner. The straight key
>>> value itself determines which region a row is put into. This allows for
>>> very
>>> rapid scans of sequential data, among other things but does mean it is
>>> easier to hotspot regions. Key design is very important.
>>>
>>> -chris
>>>
>>> On Mar 20, 2011, at 11:41 AM, Lior Schachter wrote:
>>>
>>> > the hash function that distributes the rows between the regions.
>>> >
>>> > On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net> wrote:
>>> >
>>> >> Hash?  Which hash are you referring to sir?
>>> >> St.Ack
>>> >>
>>> >> On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter <liors@infolinks.com
>>> >
>>> >> wrote:
>>> >>> Hi,
>>> >>> What is the API or configuration for changing the default hash
>>> function
>>> >> for
>>> >>> a specific htable.
>>> >>>
>>> >>> thanks,
>>> >>> Lior
>>> >>>
>>> >>
>>>
>>>
>

Re: hash function per table

Posted by Andrew Purtell <ap...@apache.org>.

Or use a bulk load process to import sequential data as new stores all in one shot.

   - Andy


--- On Sun, 3/20/11, Pete Haidinyak <ja...@cox.net> wrote:

> From: Pete Haidinyak <ja...@cox.net>
> Subject: Re: hash function per table
> To: user@hbase.apache.org
> Date: Sunday, March 20, 2011, 1:03 PM
> I went through this discussion a
> month or so ago and came away with the opinion that you can
> either have an efficient load with random key but then have
> an inefficient 'scan' not using start and end rows, or have
> an inefficient import with sequential key and then scan
> using start and end rows.
> 
> -Pete
> 
> 
> On Sun, 20 Mar 2011 12:52:24 -0700, Oleg Ruchovets <or...@gmail.com>
> wrote:
> 
> > Actually discussion started from this post:
> > 
> > 
> > http://search-hadoop.com/m/XX3nW68JsY1/hbase+insertion+optimisation&subj=hbase+insertion+optimisation+
> > 
> > Simply inserting the data in which row key
> <date>_<somedata> I noticed that
> > only one node works (region to which data were
> writing). In case we have
> > 10-15 nodes I think it is inefficient to write data to
> only one region. I
> > want to get an effect that data will be inserted
> to  as much as possible
> > nodes  simultaneously. Correct me guys , 
> but in this case  writing job
> > will take less time , am I write?
> > 
> > Oleg.
> > 
> > On Sun, Mar 20, 2011 at 8:57 PM, Chris Tarnas <cf...@email.com>
> wrote:
> > 
> >> There is none - HBase uses a total order
> partitioner. The straight key
> >> value itself determines which region a row is put
> into. This allows for very
> >> rapid scans of sequential data, among other things
> but does mean it is
> >> easier to hotspot regions. Key design is very
> important.
> >> 
> >> -chris
> >> 
> >> On Mar 20, 2011, at 11:41 AM, Lior Schachter
> wrote:
> >> 
> >> > the hash function that distributes the rows
> between the regions.
> >> >
> >> > On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net>
> wrote:
> >> >
> >> >> Hash?  Which hash are you referring
> to sir?
> >> >> St.Ack
> >> >>
> >> >> On Sun, Mar 20, 2011 at 10:06 AM, Lior
> Schachter <li...@infolinks.com>
> >> >> wrote:
> >> >>> Hi,
> >> >>> What is the API or configuration for
> changing the default hash function
> >> >> for
> >> >>> a specific htable.
> >> >>>
> >> >>> thanks,
> >> >>> Lior
> >> >>>
> >> >>
> >> 
> 
>

Re: hash function per table

Posted by Lior Schachter <li...@infolinks.com>.

What's the performance penalty  when scanning with row prefix filter instead
of with start/end key ?
Can it still work (in reasonable processing time) when the table contains
billions of records ?




On Sun, Mar 20, 2011 at 10:03 PM, Pete Haidinyak <ja...@cox.net> wrote:

> I went through this discussion a month or so ago and came away with the
> opinion that you can either have an efficient load with random key but then
> have an inefficient 'scan' not using start and end rows, or have an
> inefficient import with sequential key and then scan using start and end
> rows.
>
> -Pete
>
>
>
> On Sun, 20 Mar 2011 12:52:24 -0700, Oleg Ruchovets <or...@gmail.com>
> wrote:
>
>  Actually discussion started from this post:
>>
>>
>>
>> http://search-hadoop.com/m/XX3nW68JsY1/hbase+insertion+optimisation&subj=hbase+insertion+optimisation+
>>
>> Simply inserting the data in which row key <date>_<somedata> I noticed
>> that
>> only one node works (region to which data were writing). In case we have
>> 10-15 nodes I think it is inefficient to write data to only one region. I
>> want to get an effect that data will be inserted to  as much as possible
>> nodes  simultaneously. Correct me guys ,  but in this case  writing job
>> will take less time , am I write?
>>
>> Oleg.
>>
>> On Sun, Mar 20, 2011 at 8:57 PM, Chris Tarnas <cf...@email.com> wrote:
>>
>>  There is none - HBase uses a total order partitioner. The straight key
>>> value itself determines which region a row is put into. This allows for
>>> very
>>> rapid scans of sequential data, among other things but does mean it is
>>> easier to hotspot regions. Key design is very important.
>>>
>>> -chris
>>>
>>> On Mar 20, 2011, at 11:41 AM, Lior Schachter wrote:
>>>
>>> > the hash function that distributes the rows between the regions.
>>> >
>>> > On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net> wrote:
>>> >
>>> >> Hash?  Which hash are you referring to sir?
>>> >> St.Ack
>>> >>
>>> >> On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter <liors@infolinks.com
>>> >
>>> >> wrote:
>>> >>> Hi,
>>> >>> What is the API or configuration for changing the default hash
>>> function
>>> >> for
>>> >>> a specific htable.
>>> >>>
>>> >>> thanks,
>>> >>> Lior
>>> >>>
>>> >>
>>>
>>>
>

Re: hash function per table

Posted by Pete Haidinyak <ja...@cox.net>.

I went through this discussion a month or so ago and came away with the  
opinion that you can either have an efficient load with random key but  
then have an inefficient 'scan' not using start and end rows, or have an  
inefficient import with sequential key and then scan using start and end  
rows.

-Pete


On Sun, 20 Mar 2011 12:52:24 -0700, Oleg Ruchovets <or...@gmail.com>  
wrote:

> Actually discussion started from this post:
>
>
> http://search-hadoop.com/m/XX3nW68JsY1/hbase+insertion+optimisation&subj=hbase+insertion+optimisation+
>
> Simply inserting the data in which row key <date>_<somedata> I noticed  
> that
> only one node works (region to which data were writing). In case we have
> 10-15 nodes I think it is inefficient to write data to only one region. I
> want to get an effect that data will be inserted to  as much as possible
> nodes  simultaneously. Correct me guys ,  but in this case  writing job
> will take less time , am I write?
>
> Oleg.
>
> On Sun, Mar 20, 2011 at 8:57 PM, Chris Tarnas <cf...@email.com> wrote:
>
>> There is none - HBase uses a total order partitioner. The straight key
>> value itself determines which region a row is put into. This allows for  
>> very
>> rapid scans of sequential data, among other things but does mean it is
>> easier to hotspot regions. Key design is very important.
>>
>> -chris
>>
>> On Mar 20, 2011, at 11:41 AM, Lior Schachter wrote:
>>
>> > the hash function that distributes the rows between the regions.
>> >
>> > On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net> wrote:
>> >
>> >> Hash?  Which hash are you referring to sir?
>> >> St.Ack
>> >>
>> >> On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter  
>> <li...@infolinks.com>
>> >> wrote:
>> >>> Hi,
>> >>> What is the API or configuration for changing the default hash  
>> function
>> >> for
>> >>> a specific htable.
>> >>>
>> >>> thanks,
>> >>> Lior
>> >>>
>> >>
>>

Re: hash function per table

Posted by Oleg Ruchovets <or...@gmail.com>.

Actually discussion started from this post:

http://search-hadoop.com/m/XX3nW68JsY1/hbase+insertion+optimisation&subj=hbase+insertion+optimisation+

Simply inserting the data in which row key <date>_<somedata> I noticed that
only one node works (region to which data were writing). In case we have
10-15 nodes I think it is inefficient to write data to only one region. I
want to get an effect that data will be inserted to  as much as possible
nodes  simultaneously. Correct me guys ,  but in this case  writing job
will take less time , am I write?

Oleg.

On Sun, Mar 20, 2011 at 8:57 PM, Chris Tarnas <cf...@email.com> wrote:

> There is none - HBase uses a total order partitioner. The straight key
> value itself determines which region a row is put into. This allows for very
> rapid scans of sequential data, among other things but does mean it is
> easier to hotspot regions. Key design is very important.
>
> -chris
>
> On Mar 20, 2011, at 11:41 AM, Lior Schachter wrote:
>
> > the hash function that distributes the rows between the regions.
> >
> > On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net> wrote:
> >
> >> Hash?  Which hash are you referring to sir?
> >> St.Ack
> >>
> >> On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter <li...@infolinks.com>
> >> wrote:
> >>> Hi,
> >>> What is the API or configuration for changing the default hash function
> >> for
> >>> a specific htable.
> >>>
> >>> thanks,
> >>> Lior
> >>>
> >>
>
>

Re: hash function per table

Posted by Chris Tarnas <cf...@email.com>.

There is none - HBase uses a total order partitioner. The straight key value itself determines which region a row is put into. This allows for very rapid scans of sequential data, among other things but does mean it is easier to hotspot regions. Key design is very important.

-chris

On Mar 20, 2011, at 11:41 AM, Lior Schachter wrote:

> the hash function that distributes the rows between the regions.
> 
> On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net> wrote:
> 
>> Hash?  Which hash are you referring to sir?
>> St.Ack
>> 
>> On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter <li...@infolinks.com>
>> wrote:
>>> Hi,
>>> What is the API or configuration for changing the default hash function
>> for
>>> a specific htable.
>>> 
>>> thanks,
>>> Lior
>>> 
>>

Re: hash function per table

Posted by Lior Schachter <li...@infolinks.com>.

the hash function that distributes the rows between the regions.

On Sun, Mar 20, 2011 at 8:36 PM, Stack <st...@duboce.net> wrote:

> Hash?  Which hash are you referring to sir?
> St.Ack
>
> On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter <li...@infolinks.com>
> wrote:
> > Hi,
> > What is the API or configuration for changing the default hash function
> for
> > a specific htable.
> >
> > thanks,
> > Lior
> >
>

Re: hash function per table

Posted by Stack <st...@duboce.net>.

Hash?  Which hash are you referring to sir?
St.Ack

On Sun, Mar 20, 2011 at 10:06 AM, Lior Schachter <li...@infolinks.com> wrote:
> Hi,
> What is the API or configuration for changing the default hash function for
> a specific htable.
>
> thanks,
> Lior
>