You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Software Dev <st...@gmail.com> on 2014/05/02 22:09:17 UTC

Questions on FuzzyRowFilter

I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
time series data (20140501, 20140502...).  We can prefix all of the
keys with 4 random bytes and then just skip these during scanning. Is
that correct? These *seems* like it will work but Im questioning the
performance of this even if it does work.

Also, is this available via the rest client, shell and/or thrift client?

Also, is there a FuzzyColumn equivalent of this feature?

Re: Questions on FuzzyRowFilter

Posted by James Taylor <jt...@salesforce.com>.

@Software Dev - if you use Phoenix, queries would leverage our Skip Scan
(which supports a superset of the FuzzyRowFilter perf improvements). Take a
look here:
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

Assuming a row key made up of a low cardinality first value (like a byte
representing an enum), followed by a high cardinality second value (like a
date/time value) you'd get a large benefit from the skip scan when you're
only looking a small sliver of your time range.

Another option would be to create a secondary index over your date:
http://phoenix.incubator.apache.org/secondary_indexing.html

Thanks,
James


On Sun, May 18, 2014 at 1:56 PM, James Taylor <jt...@salesforce.com>wrote:

> The top two hits when you Google  for HBase salt are
> - Sematext blog describing "salting" as I described it in my email
> - Phoenix blog again describing "salting" in this same way
> I really don't understand what you're arguing about - the mechanism that
> you're advocating for is exactly the way both these solutions have
> implemented it. I believe we're all in agreement. It seems that you just
> aren't happy with the fact that we've called this technique "salting".
>
>
> On Sun, May 18, 2014 at 11:32 AM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>> @James…
>> You’re not listening. There is a special meaning when you say salt.
>>
>> On May 18, 2014, at 7:16 PM, James Taylor <jt...@salesforce.com> wrote:
>>
>> > @Mike,
>> >
>> > The biggest problem is you're not listening. Please actually read my
>> > response (and you'll understand the what we're calling "salting" is not
>> a
>> > random seed).
>> >
>> > Phoenix already has secondary indexes in two flavors: one optimized for
>> > write-once data and one more general for fully mutable data. Soon we'll
>> > have a third for local indexing.
>> >
>> > James
>> >
>> >
>> > On Sun, May 18, 2014 at 10:27 AM, Michael Segel
>> > <mi...@hotmail.com>wrote:
>> >
>> >> @James,
>> >>
>> >> I know and that’s the biggest problem.
>> >> Salts by definition are random seeds.
>> >>
>> >> Now I have two new phrases.
>> >>
>> >> 1) We want to remain on a sodium free diet.
>> >> 2) Learn to kick the bucket.
>> >>
>> >> When you have data that is coming in on a time series, is the data
>> mutable
>> >> or not?
>> >>
>> >> A better approach would be to redesign a second type of storage to
>> handle
>> >> serial data and how the regions are split and managed.
>> >> Or just not use HBase to store the underlying data in the first place
>> and
>> >> just store the index… ;-)
>> >> (Yes, I thought about this too.)
>> >>
>> >> -Mike
>> >>
>> >> On May 16, 2014, at 7:50 PM, James Taylor <jt...@salesforce.com>
>> wrote:
>> >>
>> >>> Hi Mike,
>> >>> I agree with you - the way you've outlined is exactly the way Phoenix
>> has
>> >>> implemented it. It's a bit of a problem with terminology, though. We
>> call
>> >>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash
>> the
>> >>> key, mod the hash with the SALT_BUCKET value you provide, and prepend
>> the
>> >>> row key with this single byte value. Maybe you can coin a good term
>> for
>> >>> this technique?
>> >>>
>> >>> FWIW, you don't lose the ability to do a range scan when you salt (or
>> >>> hash-the-key and mod by the number of "buckets"), but you do need to
>> run
>> >> a
>> >>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1).
>> Then
>> >>> the client does a merge sort among these scans. It performs well.
>> >>>
>> >>> Thanks,
>> >>> James
>> >>>
>> >>>
>> >>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel <
>> >> michael_segel@hotmail.com>wrote:
>> >>>
>> >>>> 3+ Years on and a bad idea is being propagated again.
>> >>>>
>> >>>> Now repeat after me… DO NO USE A SALT.
>> >>>>
>> >>>> Having a low sodium diet, especially for HBase is really good for
>> your
>> >>>> health and sanity.
>> >>>>
>> >>>> The salt is going to be orthogonal to the row key (Key).
>> >>>> There is no relationship to the specific Key.
>> >>>>
>> >>>> Using a salt means you now use the ability to randomly spread the
>> >>>> distribution of data to avoid HOT SPOTTING.
>> >>>> However you lose the ability to seek for a specific row.
>> >>>>
>> >>>> YOU HASH THE KEY.
>> >>>>
>> >>>> The hash whether you use SHA-1 or MD-5 is going to yield the same
>> result
>> >>>> each and every time you provide the key.
>> >>>>
>> >>>> But wait, the generated hash is 160 bits long. We don’t need that!
>> >>>> Absolutely true if you just want to randomize the key to avoid hot
>> >>>> spotting. There’s this concept called truncating the hash to the
>> desired
>> >>>> length.
>> >>>> So to Adrien’s point, you can truncate it to a single byte which
>> would
>> >> be
>> >>>> sufficient….
>> >>>> Now when you want to seek for a specific row, you can find it.
>> >>>>
>> >>>> The downside to any solution is that you lose the ability to do a
>> range
>> >>>> scan.
>> >>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO
>> FETCH A
>> >>>> SINGLE ROW VIA A get() CALL.
>> >>>>
>> >>>> <rant>
>> >>>> This simple fact has been pointed out several years ago, yet for some
>> >>>> reason, the use of a salt persists.
>> >>>> I’ve actually made that part of the HBase course I wrote and use it
>> in
>> >> my
>> >>>> presentation(s) on HBase.
>> >>>>
>> >>>> It amazes me that the committers and regulars who post here still
>> don’t
>> >>>> grok the fact that if you’re going to ‘SALT’ a row, you might as well
>> >> not
>> >>>> use HBase and stick with Hive.
>> >>>> I remember Ed C’s rant about how preferential treatment on Hive
>> patches
>> >>>> was given to vendors’ committers… that preferential treatment seems
>> to
>> >> also
>> >>>> be extended speakers at conferences. It wouldn’t be a problem if
>> those
>> >> said
>> >>>> speakers actually knew the topic… ;-)
>> >>>>
>> >>>> Propagation of bad ideas means that you’re leaving a lot of
>> performance
>> >> on
>> >>>> the table and it can kill or cripple projects.
>> >>>>
>> >>>> </rant>
>> >>>>
>> >>>> Sorry for the rant…
>> >>>>
>> >>>> -Mike
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On May 3, 2014, at 4:39 PM, Software Dev <st...@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Ok so there is no way around the FuzzyRowFilter checking every
>> single
>> >>>>> row in the table correct? If so, what is a valid use case for that
>> >>>>> filter?
>> >>>>>
>> >>>>> Ok so salt to a low enough prefix that makes scanning reasonable.
>> Our
>> >>>>> client for accessing these tables is a Rails (not JRuby) application
>> >>>>> so we are stuck with either the Thrift or Rails client. Can either
>> of
>> >>>>> these perform multiple gets/scans?
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <
>> >> adrien.mogenet@gmail.com>
>> >>>> wrote:
>> >>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data
>> can
>> >>>> be
>> >>>>>> split enough among all the possible regions, but you won't be able
>> to
>> >>>>>> easily benefit from distributed scans to gather what you want.
>> >>>>>>
>> >>>>>> Let say you want to split (time+login) with a salted key and you
>> >> expect
>> >>>> to
>> >>>>>> be able to retrieve events from 20140429 pretty fast. Then I would
>> >> split
>> >>>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
>> >>>> `$random
>> >>>>>> % 10'). To retrieve ordered data, I would parallelize Scans over
>> the
>> >> 10
>> >>>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort
>> >> everything
>> >>>>>> until I've got all the expected results.
>> >>>>>>
>> >>>>>> So in term of performances this looks "a little bit" faster than
>> your
>> >>>> 2^32
>> >>>>>> randomization.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
>> >>>> static.void.dev@gmail.com>wrote:
>> >>>>>>
>> >>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of
>> our
>> >>>>>>> time series data (20140501, 20140502...).  We can prefix all of
>> the
>> >>>>>>> keys with 4 random bytes and then just skip these during
>> scanning. Is
>> >>>>>>> that correct? These *seems* like it will work but Im questioning
>> the
>> >>>>>>> performance of this even if it does work.
>> >>>>>>>
>> >>>>>>> Also, is this available via the rest client, shell and/or thrift
>> >>>> client?
>> >>>>>>>
>> >>>>>>> Also, is there a FuzzyColumn equivalent of this feature?
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Adrien Mogenet
>> >>>>>> http://www.borntosegfault.com
>> >>>>>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Questions on FuzzyRowFilter

Posted by James Taylor <jt...@salesforce.com>.

The top two hits when you Google  for HBase salt are
- Sematext blog describing "salting" as I described it in my email
- Phoenix blog again describing "salting" in this same way
I really don't understand what you're arguing about - the mechanism that
you're advocating for is exactly the way both these solutions have
implemented it. I believe we're all in agreement. It seems that you just
aren't happy with the fact that we've called this technique "salting".


On Sun, May 18, 2014 at 11:32 AM, Michael Segel
<mi...@hotmail.com>wrote:

> @James…
> You’re not listening. There is a special meaning when you say salt.
>
> On May 18, 2014, at 7:16 PM, James Taylor <jt...@salesforce.com> wrote:
>
> > @Mike,
> >
> > The biggest problem is you're not listening. Please actually read my
> > response (and you'll understand the what we're calling "salting" is not a
> > random seed).
> >
> > Phoenix already has secondary indexes in two flavors: one optimized for
> > write-once data and one more general for fully mutable data. Soon we'll
> > have a third for local indexing.
> >
> > James
> >
> >
> > On Sun, May 18, 2014 at 10:27 AM, Michael Segel
> > <mi...@hotmail.com>wrote:
> >
> >> @James,
> >>
> >> I know and that’s the biggest problem.
> >> Salts by definition are random seeds.
> >>
> >> Now I have two new phrases.
> >>
> >> 1) We want to remain on a sodium free diet.
> >> 2) Learn to kick the bucket.
> >>
> >> When you have data that is coming in on a time series, is the data
> mutable
> >> or not?
> >>
> >> A better approach would be to redesign a second type of storage to
> handle
> >> serial data and how the regions are split and managed.
> >> Or just not use HBase to store the underlying data in the first place
> and
> >> just store the index… ;-)
> >> (Yes, I thought about this too.)
> >>
> >> -Mike
> >>
> >> On May 16, 2014, at 7:50 PM, James Taylor <jt...@salesforce.com>
> wrote:
> >>
> >>> Hi Mike,
> >>> I agree with you - the way you've outlined is exactly the way Phoenix
> has
> >>> implemented it. It's a bit of a problem with terminology, though. We
> call
> >>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash
> the
> >>> key, mod the hash with the SALT_BUCKET value you provide, and prepend
> the
> >>> row key with this single byte value. Maybe you can coin a good term for
> >>> this technique?
> >>>
> >>> FWIW, you don't lose the ability to do a range scan when you salt (or
> >>> hash-the-key and mod by the number of "buckets"), but you do need to
> run
> >> a
> >>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1).
> Then
> >>> the client does a merge sort among these scans. It performs well.
> >>>
> >>> Thanks,
> >>> James
> >>>
> >>>
> >>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel <
> >> michael_segel@hotmail.com>wrote:
> >>>
> >>>> 3+ Years on and a bad idea is being propagated again.
> >>>>
> >>>> Now repeat after me… DO NO USE A SALT.
> >>>>
> >>>> Having a low sodium diet, especially for HBase is really good for your
> >>>> health and sanity.
> >>>>
> >>>> The salt is going to be orthogonal to the row key (Key).
> >>>> There is no relationship to the specific Key.
> >>>>
> >>>> Using a salt means you now use the ability to randomly spread the
> >>>> distribution of data to avoid HOT SPOTTING.
> >>>> However you lose the ability to seek for a specific row.
> >>>>
> >>>> YOU HASH THE KEY.
> >>>>
> >>>> The hash whether you use SHA-1 or MD-5 is going to yield the same
> result
> >>>> each and every time you provide the key.
> >>>>
> >>>> But wait, the generated hash is 160 bits long. We don’t need that!
> >>>> Absolutely true if you just want to randomize the key to avoid hot
> >>>> spotting. There’s this concept called truncating the hash to the
> desired
> >>>> length.
> >>>> So to Adrien’s point, you can truncate it to a single byte which would
> >> be
> >>>> sufficient….
> >>>> Now when you want to seek for a specific row, you can find it.
> >>>>
> >>>> The downside to any solution is that you lose the ability to do a
> range
> >>>> scan.
> >>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO
> FETCH A
> >>>> SINGLE ROW VIA A get() CALL.
> >>>>
> >>>> <rant>
> >>>> This simple fact has been pointed out several years ago, yet for some
> >>>> reason, the use of a salt persists.
> >>>> I’ve actually made that part of the HBase course I wrote and use it in
> >> my
> >>>> presentation(s) on HBase.
> >>>>
> >>>> It amazes me that the committers and regulars who post here still
> don’t
> >>>> grok the fact that if you’re going to ‘SALT’ a row, you might as well
> >> not
> >>>> use HBase and stick with Hive.
> >>>> I remember Ed C’s rant about how preferential treatment on Hive
> patches
> >>>> was given to vendors’ committers… that preferential treatment seems to
> >> also
> >>>> be extended speakers at conferences. It wouldn’t be a problem if those
> >> said
> >>>> speakers actually knew the topic… ;-)
> >>>>
> >>>> Propagation of bad ideas means that you’re leaving a lot of
> performance
> >> on
> >>>> the table and it can kill or cripple projects.
> >>>>
> >>>> </rant>
> >>>>
> >>>> Sorry for the rant…
> >>>>
> >>>> -Mike
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On May 3, 2014, at 4:39 PM, Software Dev <st...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Ok so there is no way around the FuzzyRowFilter checking every single
> >>>>> row in the table correct? If so, what is a valid use case for that
> >>>>> filter?
> >>>>>
> >>>>> Ok so salt to a low enough prefix that makes scanning reasonable. Our
> >>>>> client for accessing these tables is a Rails (not JRuby) application
> >>>>> so we are stuck with either the Thrift or Rails client. Can either of
> >>>>> these perform multiple gets/scans?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <
> >> adrien.mogenet@gmail.com>
> >>>> wrote:
> >>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data
> can
> >>>> be
> >>>>>> split enough among all the possible regions, but you won't be able
> to
> >>>>>> easily benefit from distributed scans to gather what you want.
> >>>>>>
> >>>>>> Let say you want to split (time+login) with a salted key and you
> >> expect
> >>>> to
> >>>>>> be able to retrieve events from 20140429 pretty fast. Then I would
> >> split
> >>>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
> >>>> `$random
> >>>>>> % 10'). To retrieve ordered data, I would parallelize Scans over the
> >> 10
> >>>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort
> >> everything
> >>>>>> until I've got all the expected results.
> >>>>>>
> >>>>>> So in term of performances this looks "a little bit" faster than
> your
> >>>> 2^32
> >>>>>> randomization.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
> >>>> static.void.dev@gmail.com>wrote:
> >>>>>>
> >>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of
> our
> >>>>>>> time series data (20140501, 20140502...).  We can prefix all of the
> >>>>>>> keys with 4 random bytes and then just skip these during scanning.
> Is
> >>>>>>> that correct? These *seems* like it will work but Im questioning
> the
> >>>>>>> performance of this even if it does work.
> >>>>>>>
> >>>>>>> Also, is this available via the rest client, shell and/or thrift
> >>>> client?
> >>>>>>>
> >>>>>>> Also, is there a FuzzyColumn equivalent of this feature?
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Adrien Mogenet
> >>>>>> http://www.borntosegfault.com
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Questions on FuzzyRowFilter

Posted by Michael Segel <mi...@hotmail.com>.

@James…
You’re not listening. There is a special meaning when you say salt.

On May 18, 2014, at 7:16 PM, James Taylor <jt...@salesforce.com> wrote:

> @Mike,
> 
> The biggest problem is you're not listening. Please actually read my
> response (and you'll understand the what we're calling "salting" is not a
> random seed).
> 
> Phoenix already has secondary indexes in two flavors: one optimized for
> write-once data and one more general for fully mutable data. Soon we'll
> have a third for local indexing.
> 
> James
> 
> 
> On Sun, May 18, 2014 at 10:27 AM, Michael Segel
> <mi...@hotmail.com>wrote:
> 
>> @James,
>> 
>> I know and that’s the biggest problem.
>> Salts by definition are random seeds.
>> 
>> Now I have two new phrases.
>> 
>> 1) We want to remain on a sodium free diet.
>> 2) Learn to kick the bucket.
>> 
>> When you have data that is coming in on a time series, is the data mutable
>> or not?
>> 
>> A better approach would be to redesign a second type of storage to handle
>> serial data and how the regions are split and managed.
>> Or just not use HBase to store the underlying data in the first place and
>> just store the index… ;-)
>> (Yes, I thought about this too.)
>> 
>> -Mike
>> 
>> On May 16, 2014, at 7:50 PM, James Taylor <jt...@salesforce.com> wrote:
>> 
>>> Hi Mike,
>>> I agree with you - the way you've outlined is exactly the way Phoenix has
>>> implemented it. It's a bit of a problem with terminology, though. We call
>>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
>>> key, mod the hash with the SALT_BUCKET value you provide, and prepend the
>>> row key with this single byte value. Maybe you can coin a good term for
>>> this technique?
>>> 
>>> FWIW, you don't lose the ability to do a range scan when you salt (or
>>> hash-the-key and mod by the number of "buckets"), but you do need to run
>> a
>>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
>>> the client does a merge sort among these scans. It performs well.
>>> 
>>> Thanks,
>>> James
>>> 
>>> 
>>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel <
>> michael_segel@hotmail.com>wrote:
>>> 
>>>> 3+ Years on and a bad idea is being propagated again.
>>>> 
>>>> Now repeat after me… DO NO USE A SALT.
>>>> 
>>>> Having a low sodium diet, especially for HBase is really good for your
>>>> health and sanity.
>>>> 
>>>> The salt is going to be orthogonal to the row key (Key).
>>>> There is no relationship to the specific Key.
>>>> 
>>>> Using a salt means you now use the ability to randomly spread the
>>>> distribution of data to avoid HOT SPOTTING.
>>>> However you lose the ability to seek for a specific row.
>>>> 
>>>> YOU HASH THE KEY.
>>>> 
>>>> The hash whether you use SHA-1 or MD-5 is going to yield the same result
>>>> each and every time you provide the key.
>>>> 
>>>> But wait, the generated hash is 160 bits long. We don’t need that!
>>>> Absolutely true if you just want to randomize the key to avoid hot
>>>> spotting. There’s this concept called truncating the hash to the desired
>>>> length.
>>>> So to Adrien’s point, you can truncate it to a single byte which would
>> be
>>>> sufficient….
>>>> Now when you want to seek for a specific row, you can find it.
>>>> 
>>>> The downside to any solution is that you lose the ability to do a range
>>>> scan.
>>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A
>>>> SINGLE ROW VIA A get() CALL.
>>>> 
>>>> <rant>
>>>> This simple fact has been pointed out several years ago, yet for some
>>>> reason, the use of a salt persists.
>>>> I’ve actually made that part of the HBase course I wrote and use it in
>> my
>>>> presentation(s) on HBase.
>>>> 
>>>> It amazes me that the committers and regulars who post here still don’t
>>>> grok the fact that if you’re going to ‘SALT’ a row, you might as well
>> not
>>>> use HBase and stick with Hive.
>>>> I remember Ed C’s rant about how preferential treatment on Hive patches
>>>> was given to vendors’ committers… that preferential treatment seems to
>> also
>>>> be extended speakers at conferences. It wouldn’t be a problem if those
>> said
>>>> speakers actually knew the topic… ;-)
>>>> 
>>>> Propagation of bad ideas means that you’re leaving a lot of performance
>> on
>>>> the table and it can kill or cripple projects.
>>>> 
>>>> </rant>
>>>> 
>>>> Sorry for the rant…
>>>> 
>>>> -Mike
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On May 3, 2014, at 4:39 PM, Software Dev <st...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Ok so there is no way around the FuzzyRowFilter checking every single
>>>>> row in the table correct? If so, what is a valid use case for that
>>>>> filter?
>>>>> 
>>>>> Ok so salt to a low enough prefix that makes scanning reasonable. Our
>>>>> client for accessing these tables is a Rails (not JRuby) application
>>>>> so we are stuck with either the Thrift or Rails client. Can either of
>>>>> these perform multiple gets/scans?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <
>> adrien.mogenet@gmail.com>
>>>> wrote:
>>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can
>>>> be
>>>>>> split enough among all the possible regions, but you won't be able to
>>>>>> easily benefit from distributed scans to gather what you want.
>>>>>> 
>>>>>> Let say you want to split (time+login) with a salted key and you
>> expect
>>>> to
>>>>>> be able to retrieve events from 20140429 pretty fast. Then I would
>> split
>>>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
>>>> `$random
>>>>>> % 10'). To retrieve ordered data, I would parallelize Scans over the
>> 10
>>>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort
>> everything
>>>>>> until I've got all the expected results.
>>>>>> 
>>>>>> So in term of performances this looks "a little bit" faster than your
>>>> 2^32
>>>>>> randomization.
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
>>>> static.void.dev@gmail.com>wrote:
>>>>>> 
>>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>>>>>>> time series data (20140501, 20140502...).  We can prefix all of the
>>>>>>> keys with 4 random bytes and then just skip these during scanning. Is
>>>>>>> that correct? These *seems* like it will work but Im questioning the
>>>>>>> performance of this even if it does work.
>>>>>>> 
>>>>>>> Also, is this available via the rest client, shell and/or thrift
>>>> client?
>>>>>>> 
>>>>>>> Also, is there a FuzzyColumn equivalent of this feature?
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Adrien Mogenet
>>>>>> http://www.borntosegfault.com
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Questions on FuzzyRowFilter

Posted by James Taylor <jt...@salesforce.com>.

@Mike,

The biggest problem is you're not listening. Please actually read my
response (and you'll understand the what we're calling "salting" is not a
random seed).

Phoenix already has secondary indexes in two flavors: one optimized for
write-once data and one more general for fully mutable data. Soon we'll
have a third for local indexing.

James


On Sun, May 18, 2014 at 10:27 AM, Michael Segel
<mi...@hotmail.com>wrote:

> @James,
>
> I know and that’s the biggest problem.
> Salts by definition are random seeds.
>
> Now I have two new phrases.
>
> 1) We want to remain on a sodium free diet.
> 2) Learn to kick the bucket.
>
> When you have data that is coming in on a time series, is the data mutable
> or not?
>
> A better approach would be to redesign a second type of storage to handle
> serial data and how the regions are split and managed.
> Or just not use HBase to store the underlying data in the first place and
> just store the index… ;-)
> (Yes, I thought about this too.)
>
> -Mike
>
> On May 16, 2014, at 7:50 PM, James Taylor <jt...@salesforce.com> wrote:
>
> > Hi Mike,
> > I agree with you - the way you've outlined is exactly the way Phoenix has
> > implemented it. It's a bit of a problem with terminology, though. We call
> > it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
> > key, mod the hash with the SALT_BUCKET value you provide, and prepend the
> > row key with this single byte value. Maybe you can coin a good term for
> > this technique?
> >
> > FWIW, you don't lose the ability to do a range scan when you salt (or
> > hash-the-key and mod by the number of "buckets"), but you do need to run
> a
> > scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
> > the client does a merge sort among these scans. It performs well.
> >
> > Thanks,
> > James
> >
> >
> > On Fri, May 9, 2014 at 11:57 PM, Michael Segel <
> michael_segel@hotmail.com>wrote:
> >
> >> 3+ Years on and a bad idea is being propagated again.
> >>
> >> Now repeat after me… DO NO USE A SALT.
> >>
> >> Having a low sodium diet, especially for HBase is really good for your
> >> health and sanity.
> >>
> >> The salt is going to be orthogonal to the row key (Key).
> >> There is no relationship to the specific Key.
> >>
> >> Using a salt means you now use the ability to randomly spread the
> >> distribution of data to avoid HOT SPOTTING.
> >> However you lose the ability to seek for a specific row.
> >>
> >> YOU HASH THE KEY.
> >>
> >> The hash whether you use SHA-1 or MD-5 is going to yield the same result
> >> each and every time you provide the key.
> >>
> >> But wait, the generated hash is 160 bits long. We don’t need that!
> >> Absolutely true if you just want to randomize the key to avoid hot
> >> spotting. There’s this concept called truncating the hash to the desired
> >> length.
> >> So to Adrien’s point, you can truncate it to a single byte which would
> be
> >> sufficient….
> >> Now when you want to seek for a specific row, you can find it.
> >>
> >> The downside to any solution is that you lose the ability to do a range
> >> scan.
> >> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A
> >> SINGLE ROW VIA A get() CALL.
> >>
> >> <rant>
> >> This simple fact has been pointed out several years ago, yet for some
> >> reason, the use of a salt persists.
> >> I’ve actually made that part of the HBase course I wrote and use it in
> my
> >> presentation(s) on HBase.
> >>
> >> It amazes me that the committers and regulars who post here still don’t
> >> grok the fact that if you’re going to ‘SALT’ a row, you might as well
> not
> >> use HBase and stick with Hive.
> >> I remember Ed C’s rant about how preferential treatment on Hive patches
> >> was given to vendors’ committers… that preferential treatment seems to
> also
> >> be extended speakers at conferences. It wouldn’t be a problem if those
> said
> >> speakers actually knew the topic… ;-)
> >>
> >> Propagation of bad ideas means that you’re leaving a lot of performance
> on
> >> the table and it can kill or cripple projects.
> >>
> >> </rant>
> >>
> >> Sorry for the rant…
> >>
> >> -Mike
> >>
> >>
> >>
> >>
> >> On May 3, 2014, at 4:39 PM, Software Dev <st...@gmail.com>
> >> wrote:
> >>
> >>> Ok so there is no way around the FuzzyRowFilter checking every single
> >>> row in the table correct? If so, what is a valid use case for that
> >>> filter?
> >>>
> >>> Ok so salt to a low enough prefix that makes scanning reasonable. Our
> >>> client for accessing these tables is a Rails (not JRuby) application
> >>> so we are stuck with either the Thrift or Rails client. Can either of
> >>> these perform multiple gets/scans?
> >>>
> >>>
> >>>
> >>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <
> adrien.mogenet@gmail.com>
> >> wrote:
> >>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can
> >> be
> >>>> split enough among all the possible regions, but you won't be able to
> >>>> easily benefit from distributed scans to gather what you want.
> >>>>
> >>>> Let say you want to split (time+login) with a salted key and you
> expect
> >> to
> >>>> be able to retrieve events from 20140429 pretty fast. Then I would
> split
> >>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
> >> `$random
> >>>> % 10'). To retrieve ordered data, I would parallelize Scans over the
> 10
> >>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort
> everything
> >>>> until I've got all the expected results.
> >>>>
> >>>> So in term of performances this looks "a little bit" faster than your
> >> 2^32
> >>>> randomization.
> >>>>
> >>>>
> >>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
> >> static.void.dev@gmail.com>wrote:
> >>>>
> >>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
> >>>>> time series data (20140501, 20140502...).  We can prefix all of the
> >>>>> keys with 4 random bytes and then just skip these during scanning. Is
> >>>>> that correct? These *seems* like it will work but Im questioning the
> >>>>> performance of this even if it does work.
> >>>>>
> >>>>> Also, is this available via the rest client, shell and/or thrift
> >> client?
> >>>>>
> >>>>> Also, is there a FuzzyColumn equivalent of this feature?
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Adrien Mogenet
> >>>> http://www.borntosegfault.com
> >>>
> >>
> >>
>
>

Re: Questions on FuzzyRowFilter

Posted by Michael Segel <mi...@hotmail.com>.

@James, 

I know and that’s the biggest problem. 
Salts by definition are random seeds. 

Now I have two new phrases. 

1) We want to remain on a sodium free diet. 
2) Learn to kick the bucket. 

When you have data that is coming in on a time series, is the data mutable or not? 

A better approach would be to redesign a second type of storage to handle serial data and how the regions are split and managed. 
Or just not use HBase to store the underlying data in the first place and just store the index… ;-)
(Yes, I thought about this too.)

-Mike

On May 16, 2014, at 7:50 PM, James Taylor <jt...@salesforce.com> wrote:

> Hi Mike,
> I agree with you - the way you've outlined is exactly the way Phoenix has
> implemented it. It's a bit of a problem with terminology, though. We call
> it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
> key, mod the hash with the SALT_BUCKET value you provide, and prepend the
> row key with this single byte value. Maybe you can coin a good term for
> this technique?
> 
> FWIW, you don't lose the ability to do a range scan when you salt (or
> hash-the-key and mod by the number of "buckets"), but you do need to run a
> scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
> the client does a merge sort among these scans. It performs well.
> 
> Thanks,
> James
> 
> 
> On Fri, May 9, 2014 at 11:57 PM, Michael Segel <mi...@hotmail.com>wrote:
> 
>> 3+ Years on and a bad idea is being propagated again.
>> 
>> Now repeat after me… DO NO USE A SALT.
>> 
>> Having a low sodium diet, especially for HBase is really good for your
>> health and sanity.
>> 
>> The salt is going to be orthogonal to the row key (Key).
>> There is no relationship to the specific Key.
>> 
>> Using a salt means you now use the ability to randomly spread the
>> distribution of data to avoid HOT SPOTTING.
>> However you lose the ability to seek for a specific row.
>> 
>> YOU HASH THE KEY.
>> 
>> The hash whether you use SHA-1 or MD-5 is going to yield the same result
>> each and every time you provide the key.
>> 
>> But wait, the generated hash is 160 bits long. We don’t need that!
>> Absolutely true if you just want to randomize the key to avoid hot
>> spotting. There’s this concept called truncating the hash to the desired
>> length.
>> So to Adrien’s point, you can truncate it to a single byte which would be
>> sufficient….
>> Now when you want to seek for a specific row, you can find it.
>> 
>> The downside to any solution is that you lose the ability to do a range
>> scan.
>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A
>> SINGLE ROW VIA A get() CALL.
>> 
>> <rant>
>> This simple fact has been pointed out several years ago, yet for some
>> reason, the use of a salt persists.
>> I’ve actually made that part of the HBase course I wrote and use it in my
>> presentation(s) on HBase.
>> 
>> It amazes me that the committers and regulars who post here still don’t
>> grok the fact that if you’re going to ‘SALT’ a row, you might as well not
>> use HBase and stick with Hive.
>> I remember Ed C’s rant about how preferential treatment on Hive patches
>> was given to vendors’ committers… that preferential treatment seems to also
>> be extended speakers at conferences. It wouldn’t be a problem if those said
>> speakers actually knew the topic… ;-)
>> 
>> Propagation of bad ideas means that you’re leaving a lot of performance on
>> the table and it can kill or cripple projects.
>> 
>> </rant>
>> 
>> Sorry for the rant…
>> 
>> -Mike
>> 
>> 
>> 
>> 
>> On May 3, 2014, at 4:39 PM, Software Dev <st...@gmail.com>
>> wrote:
>> 
>>> Ok so there is no way around the FuzzyRowFilter checking every single
>>> row in the table correct? If so, what is a valid use case for that
>>> filter?
>>> 
>>> Ok so salt to a low enough prefix that makes scanning reasonable. Our
>>> client for accessing these tables is a Rails (not JRuby) application
>>> so we are stuck with either the Thrift or Rails client. Can either of
>>> these perform multiple gets/scans?
>>> 
>>> 
>>> 
>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <ad...@gmail.com>
>> wrote:
>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can
>> be
>>>> split enough among all the possible regions, but you won't be able to
>>>> easily benefit from distributed scans to gather what you want.
>>>> 
>>>> Let say you want to split (time+login) with a salted key and you expect
>> to
>>>> be able to retrieve events from 20140429 pretty fast. Then I would split
>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
>> `$random
>>>> % 10'). To retrieve ordered data, I would parallelize Scans over the 10
>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
>>>> until I've got all the expected results.
>>>> 
>>>> So in term of performances this looks "a little bit" faster than your
>> 2^32
>>>> randomization.
>>>> 
>>>> 
>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
>> static.void.dev@gmail.com>wrote:
>>>> 
>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>>>>> time series data (20140501, 20140502...).  We can prefix all of the
>>>>> keys with 4 random bytes and then just skip these during scanning. Is
>>>>> that correct? These *seems* like it will work but Im questioning the
>>>>> performance of this even if it does work.
>>>>> 
>>>>> Also, is this available via the rest client, shell and/or thrift
>> client?
>>>>> 
>>>>> Also, is there a FuzzyColumn equivalent of this feature?
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Adrien Mogenet
>>>> http://www.borntosegfault.com
>>> 
>> 
>>

Re: Questions on FuzzyRowFilter

Posted by James Taylor <jt...@salesforce.com>.

Hi Mike,
I agree with you - the way you've outlined is exactly the way Phoenix has
implemented it. It's a bit of a problem with terminology, though. We call
it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
key, mod the hash with the SALT_BUCKET value you provide, and prepend the
row key with this single byte value. Maybe you can coin a good term for
this technique?

FWIW, you don't lose the ability to do a range scan when you salt (or
hash-the-key and mod by the number of "buckets"), but you do need to run a
scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
the client does a merge sort among these scans. It performs well.

Thanks,
James


On Fri, May 9, 2014 at 11:57 PM, Michael Segel <mi...@hotmail.com>wrote:

> 3+ Years on and a bad idea is being propagated again.
>
> Now repeat after me… DO NO USE A SALT.
>
> Having a low sodium diet, especially for HBase is really good for your
> health and sanity.
>
> The salt is going to be orthogonal to the row key (Key).
> There is no relationship to the specific Key.
>
> Using a salt means you now use the ability to randomly spread the
> distribution of data to avoid HOT SPOTTING.
> However you lose the ability to seek for a specific row.
>
> YOU HASH THE KEY.
>
> The hash whether you use SHA-1 or MD-5 is going to yield the same result
> each and every time you provide the key.
>
> But wait, the generated hash is 160 bits long. We don’t need that!
> Absolutely true if you just want to randomize the key to avoid hot
> spotting. There’s this concept called truncating the hash to the desired
> length.
> So to Adrien’s point, you can truncate it to a single byte which would be
> sufficient….
> Now when you want to seek for a specific row, you can find it.
>
> The downside to any solution is that you lose the ability to do a range
> scan.
> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A
> SINGLE ROW VIA A get() CALL.
>
> <rant>
> This simple fact has been pointed out several years ago, yet for some
> reason, the use of a salt persists.
> I’ve actually made that part of the HBase course I wrote and use it in my
> presentation(s) on HBase.
>
> It amazes me that the committers and regulars who post here still don’t
> grok the fact that if you’re going to ‘SALT’ a row, you might as well not
> use HBase and stick with Hive.
> I remember Ed C’s rant about how preferential treatment on Hive patches
> was given to vendors’ committers… that preferential treatment seems to also
> be extended speakers at conferences. It wouldn’t be a problem if those said
> speakers actually knew the topic… ;-)
>
> Propagation of bad ideas means that you’re leaving a lot of performance on
> the table and it can kill or cripple projects.
>
> </rant>
>
> Sorry for the rant…
>
> -Mike
>
>
>
>
> On May 3, 2014, at 4:39 PM, Software Dev <st...@gmail.com>
> wrote:
>
> > Ok so there is no way around the FuzzyRowFilter checking every single
> > row in the table correct? If so, what is a valid use case for that
> > filter?
> >
> > Ok so salt to a low enough prefix that makes scanning reasonable. Our
> > client for accessing these tables is a Rails (not JRuby) application
> > so we are stuck with either the Thrift or Rails client. Can either of
> > these perform multiple gets/scans?
> >
> >
> >
> > On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <ad...@gmail.com>
> wrote:
> >> Using 4 random bytes you'll get 2^32 possibilities; thus your data can
> be
> >> split enough among all the possible regions, but you won't be able to
> >> easily benefit from distributed scans to gather what you want.
> >>
> >> Let say you want to split (time+login) with a salted key and you expect
> to
> >> be able to retrieve events from 20140429 pretty fast. Then I would split
> >> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
> `$random
> >> % 10'). To retrieve ordered data, I would parallelize Scans over the 10
> >> span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
> >> until I've got all the expected results.
> >>
> >> So in term of performances this looks "a little bit" faster than your
> 2^32
> >> randomization.
> >>
> >>
> >> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
> static.void.dev@gmail.com>wrote:
> >>
> >>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
> >>> time series data (20140501, 20140502...).  We can prefix all of the
> >>> keys with 4 random bytes and then just skip these during scanning. Is
> >>> that correct? These *seems* like it will work but Im questioning the
> >>> performance of this even if it does work.
> >>>
> >>> Also, is this available via the rest client, shell and/or thrift
> client?
> >>>
> >>> Also, is there a FuzzyColumn equivalent of this feature?
> >>>
> >>
> >>
> >>
> >> --
> >> Adrien Mogenet
> >> http://www.borntosegfault.com
> >
>
>

Re: Questions on FuzzyRowFilter

Posted by Michael Segel <mi...@hotmail.com>.

3+ Years on and a bad idea is being propagated again. 

Now repeat after me… DO NO USE A SALT.

Having a low sodium diet, especially for HBase is really good for your health and sanity.

The salt is going to be orthogonal to the row key (Key). 
There is no relationship to the specific Key. 

Using a salt means you now use the ability to randomly spread the distribution of data to avoid HOT SPOTTING. 
However you lose the ability to seek for a specific row. 

YOU HASH THE KEY.

The hash whether you use SHA-1 or MD-5 is going to yield the same result each and every time you provide the key.

But wait, the generated hash is 160 bits long. We don’t need that!
Absolutely true if you just want to randomize the key to avoid hot spotting. There’s this concept called truncating the hash to the desired length. 
So to Adrien’s point, you can truncate it to a single byte which would be sufficient….
Now when you want to seek for a specific row, you can find it. 

The downside to any solution is that you lose the ability to do a range scan. 
BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A SINGLE ROW VIA A get() CALL.

<rant>
This simple fact has been pointed out several years ago, yet for some reason, the use of a salt persists. 
I’ve actually made that part of the HBase course I wrote and use it in my presentation(s) on HBase. 

It amazes me that the committers and regulars who post here still don’t grok the fact that if you’re going to ‘SALT’ a row, you might as well not use HBase and stick with Hive. 
I remember Ed C’s rant about how preferential treatment on Hive patches was given to vendors’ committers… that preferential treatment seems to also be extended speakers at conferences. It wouldn’t be a problem if those said speakers actually knew the topic… ;-) 

Propagation of bad ideas means that you’re leaving a lot of performance on the table and it can kill or cripple projects.

</rant>

Sorry for the rant…

-Mike

On May 3, 2014, at 4:39 PM, Software Dev <st...@gmail.com> wrote:

> Ok so there is no way around the FuzzyRowFilter checking every single
> row in the table correct? If so, what is a valid use case for that
> filter?
> 
> Ok so salt to a low enough prefix that makes scanning reasonable. Our
> client for accessing these tables is a Rails (not JRuby) application
> so we are stuck with either the Thrift or Rails client. Can either of
> these perform multiple gets/scans?
> 
> 
> 
> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <ad...@gmail.com> wrote:
>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can be
>> split enough among all the possible regions, but you won't be able to
>> easily benefit from distributed scans to gather what you want.
>> 
>> Let say you want to split (time+login) with a salted key and you expect to
>> be able to retrieve events from 20140429 pretty fast. Then I would split
>> input data among 10 "spans", spread over 10 regions and 10 RS (ie: `$random
>> % 10'). To retrieve ordered data, I would parallelize Scans over the 10
>> span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
>> until I've got all the expected results.
>> 
>> So in term of performances this looks "a little bit" faster than your 2^32
>> randomization.
>> 
>> 
>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <st...@gmail.com>wrote:
>> 
>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>>> time series data (20140501, 20140502...).  We can prefix all of the
>>> keys with 4 random bytes and then just skip these during scanning. Is
>>> that correct? These *seems* like it will work but Im questioning the
>>> performance of this even if it does work.
>>> 
>>> Also, is this available via the rest client, shell and/or thrift client?
>>> 
>>> Also, is there a FuzzyColumn equivalent of this feature?
>>> 
>> 
>> 
>> 
>> --
>> Adrien Mogenet
>> http://www.borntosegfault.com
>

Re: Questions on FuzzyRowFilter

Posted by Software Dev <st...@gmail.com>.

Edit. I should have mentioned that my access pattern is a bit
different. Ill need to scan between dates... 20140101 -> 20140501, not
an individual date. My table is actually a bunch of increments so as
of right now, there is only 1 row key per timeframe.

On Sat, May 3, 2014 at 8:39 AM, Software Dev <st...@gmail.com> wrote:
> Ok so there is no way around the FuzzyRowFilter checking every single
> row in the table correct? If so, what is a valid use case for that
> filter?
>
> Ok so salt to a low enough prefix that makes scanning reasonable. Our
> client for accessing these tables is a Rails (not JRuby) application
> so we are stuck with either the Thrift or Rails client. Can either of
> these perform multiple gets/scans?
>
>
>
> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <ad...@gmail.com> wrote:
>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can be
>> split enough among all the possible regions, but you won't be able to
>> easily benefit from distributed scans to gather what you want.
>>
>> Let say you want to split (time+login) with a salted key and you expect to
>> be able to retrieve events from 20140429 pretty fast. Then I would split
>> input data among 10 "spans", spread over 10 regions and 10 RS (ie: `$random
>> % 10'). To retrieve ordered data, I would parallelize Scans over the 10
>> span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
>> until I've got all the expected results.
>>
>> So in term of performances this looks "a little bit" faster than your 2^32
>> randomization.
>>
>>
>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <st...@gmail.com>wrote:
>>
>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>>> time series data (20140501, 20140502...).  We can prefix all of the
>>> keys with 4 random bytes and then just skip these during scanning. Is
>>> that correct? These *seems* like it will work but Im questioning the
>>> performance of this even if it does work.
>>>
>>> Also, is this available via the rest client, shell and/or thrift client?
>>>
>>> Also, is there a FuzzyColumn equivalent of this feature?
>>>
>>
>>
>>
>> --
>> Adrien Mogenet
>> http://www.borntosegfault.com

Re: Questions on FuzzyRowFilter

Posted by Software Dev <st...@gmail.com>.

Ok so there is no way around the FuzzyRowFilter checking every single
row in the table correct? If so, what is a valid use case for that
filter?

Ok so salt to a low enough prefix that makes scanning reasonable. Our
client for accessing these tables is a Rails (not JRuby) application
so we are stuck with either the Thrift or Rails client. Can either of
these perform multiple gets/scans?



On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <ad...@gmail.com> wrote:
> Using 4 random bytes you'll get 2^32 possibilities; thus your data can be
> split enough among all the possible regions, but you won't be able to
> easily benefit from distributed scans to gather what you want.
>
> Let say you want to split (time+login) with a salted key and you expect to
> be able to retrieve events from 20140429 pretty fast. Then I would split
> input data among 10 "spans", spread over 10 regions and 10 RS (ie: `$random
> % 10'). To retrieve ordered data, I would parallelize Scans over the 10
> span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
> until I've got all the expected results.
>
> So in term of performances this looks "a little bit" faster than your 2^32
> randomization.
>
>
> On Fri, May 2, 2014 at 10:09 PM, Software Dev <st...@gmail.com>wrote:
>
>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>> time series data (20140501, 20140502...).  We can prefix all of the
>> keys with 4 random bytes and then just skip these during scanning. Is
>> that correct? These *seems* like it will work but Im questioning the
>> performance of this even if it does work.
>>
>> Also, is this available via the rest client, shell and/or thrift client?
>>
>> Also, is there a FuzzyColumn equivalent of this feature?
>>
>
>
>
> --
> Adrien Mogenet
> http://www.borntosegfault.com

Re: Questions on FuzzyRowFilter

Posted by Adrien Mogenet <ad...@gmail.com>.

Using 4 random bytes you'll get 2^32 possibilities; thus your data can be
split enough among all the possible regions, but you won't be able to
easily benefit from distributed scans to gather what you want.

Let say you want to split (time+login) with a salted key and you expect to
be able to retrieve events from 20140429 pretty fast. Then I would split
input data among 10 "spans", spread over 10 regions and 10 RS (ie: `$random
% 10'). To retrieve ordered data, I would parallelize Scans over the 10
span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
until I've got all the expected results.

So in term of performances this looks "a little bit" faster than your 2^32
randomization.

On Fri, May 2, 2014 at 10:09 PM, Software Dev <st...@gmail.com>wrote:

> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
> time series data (20140501, 20140502...).  We can prefix all of the
> keys with 4 random bytes and then just skip these during scanning. Is
> that correct? These *seems* like it will work but Im questioning the
> performance of this even if it does work.
>
> Also, is this available via the rest client, shell and/or thrift client?
>
> Also, is there a FuzzyColumn equivalent of this feature?
>

-- 
Adrien Mogenet
http://www.borntosegfault.com