You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Asaf Mesika <as...@gmail.com> on 2013/11/01 08:25:25 UTC
Re: row filter - binary comparator at certain range

Bucket seems like a rather good name for it. The method for generating
could be Hash, running sequence modded, etc. So HashBucket,
RoundRobinBucket, etc.

On Tuesday, October 22, 2013, James Taylor wrote:

> One thing I neglected to mention is that the table is pre-split at the
> "prepending-row-key-with-single-hashed-byte" boundaries, so the expectation
> is that you'd allocate enough buckets that you don't end up needing to
> splitting the regions. But if you under allocate (i.e. allocate too small a
> SALT_BUCKETS value), then I see your point.
>
> Thanks,
> James
>
>
> On Mon, Oct 21, 2013 at 5:58 PM, Michael Segel <michael_segel@hotmail.com<javascript:;>
> >wrote:
>
> > James,
> >
> > Its evenly distributed, however... because its a time stamp, its a 'tail
> > end charlie' addition.
> > So when you split a region, the top half is never added to, so you end up
> > with all regions half filled except for the last region in each 'modded'
> > value.
> >
> > I wouldn't say its a bad thing if you plan for it.
> >
> > On Oct 21, 2013, at 5:07 PM, James Taylor <jt...@salesforce.com>
> wrote:
> >
> > > We don't truncate the hash, we mod it. Why would you expect that data
> > > wouldn't be evenly distributed? We've not seen this to be the case.
> > >
> > >
> > >
> > > On Mon, Oct 21, 2013 at 1:48 PM, Michael Segel <
> > msegel_hadoop@hotmail.com>wrote:
> > >
> > >> What do you call hashing the row key?
> > >> Or hashing the row key and then appending the row key to the hash?
> > >> Or hashing the row key, truncating the hash value to some subset and
> > then
> > >> appending the row key to the value?
> > >>
> > >> The problem is that there is specific meaning to the term salt.
> Re-using
> > >> it here will cause confusion because you're implying something you
> don't
> > >> mean to imply.
> > >>
> > >> you could say prepend a truncated hash of the key, however… is
> prepend a
> > >> real word? ;-) (I am sorry, I am not a grammar nazi, nor an English
> > major. )
> > >>
> > >> So even outside of Phoenix, the concept is the same.
> > >> Even with a truncated hash, you will find that over time, all but the
> > tail
> > >> N regions will only be half full.
> > >> This could be both good and bad.
> > >>
> > >> (Where N is your number 8 or 16 allowable hash values.)
> > >>
> > >> You've solved potentially one problem… but still have other issues
> that
> > >> you need to address.
> > >> I guess the simple answer is to double the region sizes and not care
> > that
> > >> most of your regions will be 1/2 the max size…  but the size you
> really
> > >> want and 8-16 regions will be up to twice as big.
> > >>
> > >>
> > >>
> > >> On Oct 21, 2013, at 3:26 PM, James Taylor <jt...@salesforce.com>
> > wrote:
> > >>
> > >>> What do you think it should be called, because
> > >>> "prepending-row-key-with-single-hashed-byte" doesn't have a very good
> > >> ring
> > >>> to it. :-)
> > >>>
> > >>> Agree that getting the row key design right is crucial.
> > >>>
> > >>> The range of "prepending-row-key-with-single-hashed-byte" is
> > declarative
> > >>> when you create your table in Phoenix, so you typically declare an
> > upper
> > >>> bound based on your cluster size (not 255, but maybe 8 or 16). We've
> > run
> > >>> the numbers and it's typically faster, but as with most things, not
> > >> always.
> > >>>
> > >>> HTH,
> > >>> James
> > >>>
> > >>>
> > >>> On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel <
> > >> msegel_hadoop@hotmail.com>wrote:
> > >>>
> > >>>> Then its not a SALT. And please don't use the term 'salt' because it
> > has
> > >>>> specific meaning outside to what you want it to mean.  Just like
> > saying
> > >>>> HBase has ACID because you write the entire row as an atomic
> element.
> > >> But
> > >>>> I digress….
> > >>>>
> > >>>> Ok so to your point…
> > >>>>
> > >>>> 1 byte == 255 possible values.
> > >>>>
> > >>>> So which will be faster.
> > >>>>
> > >>>> creating a list of the 1 byte truncated hash of each possible
> > timestamp
> > >> in
> > >>>> your range, or doing 255 separate range scans with the start and
> stop
> > >> range
> > >>>> key set?
> > >>>>
> > >>>> Th