You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Alex Baranau <al...@gmail.com> on 2012/08/17 22:42:08 UTC

Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

There was a question [1] in
https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
more sense to answer it here.

With the current FuzzyRowFilter I believe the only way to approach the
problem is to add 150 fuzzy rules to the filter: ??????00200, ??????00201,
..., ??????00350.

As for performance of this approach I can say the following:
* there are two "checks" happening for each processed row key (i.e. those
row keys we don't skip)
* first one performs simple check if the given row key satisfies the fuzzy
rule and also determines if there's next row key to advance to (if this one
doesn't satisfy). The check takes up at max O(n), where n is the length of
fuzzy rule. I.e. this is done in one simple loop, which can be broken
before all bytes are checked. For m rules this will be O(m*n).
* second piece calculates the next row key to provide it as a hint for
fast-forwarding. We again check all rules and finding the smallest hint.
Operation is also done in one loop, i.e. O(m*n) here as well.

With 150 fuzzy rules of length 11, the applying filter is equivalent to the
loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a
lot, but can work quite fast. So I'd just try it.

As for extension which will be more efficient, it makes sense to consider
implementing it. Let me think more about it and get back with the JIRA
Issue to you :). But I'd suggest you trying existing FuzzyRowFilter first.
The output (performance) would give us some food for thinking, or may be
even turns out to be acceptable for you (hopefully).

> Can i run this kind of filter on HBase0.92 without doing any significant
update to the cluster

Until the next release, you'll have to use the FuzzyRowFilter as any other
custom filter. Just grab the patch from HBASE-6509 and copy the filter. No
need to patch & rebuild HBase.

Alex Baranau
------
Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

[1]

Anil Gupta added a comment - 18/Aug/12 04:37
Hi Alex,
I have a question related to this filter. I have a similar filtering
requirement which will be an extension to FuzzyFilterRow.
Suppose, i have the following structure of rowkeys: userid_actionid, where
userid is of 6 digit and then actionid is 5 digit. I would like to get all
the rows with actionid between 00200 to 00350. With current FuzzyRowFilter
i can search for all the rows a particular actionid. Instead of searching
for a particular actionid i would like to search for a range of actionid.
Does this use case sounds like an extension to current FuzzyRowFilter? Can
i run this kind of filter on HBase0.92 without doing any significant update
to the cluster. If i develop this kind of filter then what is needed to run
it on all the RS's?
Thanks,
Anil

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

Posted by Alex Baranau <al...@gmail.com>.

Anil,

It really depends on how many HFiles can be skipped entirely. In general,
given that this is like full-table scan with filter, your time is good.
Especially if it is satisfactory to you :). Glad that the idea with setting
manually ts helped. This trick is overlooked too often :(

Alex Baranau
------
Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Wed, Aug 22, 2012 at 2:18 AM, anil gupta <an...@gmail.com> wrote:

> Hi Alex,
>
> Thanks for creating the JIRA.
> On Monday, I completed testing the time range filtering using timestamps
> and IMO the results seems satisfactory(if not great). The table has 34
> million records(average row size is 1.21 KB), in 136 seconds i get the
> entire result of query which had 225 rows.
> I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
> had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
> is hosting 2 Slaves Instance(2 VM's running Datanode,
> NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
> done any modification in the block size of HDFS or HBase. Considering the
> below-par hardware configuration of cluster, does the performance sounds OK
> for timestamp filtering?
>
> Thanks,
> Anil
>
> On Mon, Aug 20, 2012 at 1:07 PM, Alex Baranau <alex.baranov.v@gmail.com
> >wrote:
>
> > Created: https://issues.apache.org/jira/browse/HBASE-6618
> >
> > Alex Baranau
> > ------
> > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> >
> > On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <an...@gmail.com>
> wrote:
> >
> > > Hi Alex,
> > >
> > > Apart from the query which i mentioned in last email. Till now, i have
> > > implemented the following queries using filters and coprocessors:
> > >
> > > 1. Getting all the records for a customer.
> > > 2. Perform min,max,avg,sum aggregation for a customer using
> > coprocessors. I
> > > am storing some of the data as BigDecimal also to do accurate floating
> > > point calculations.
> > > 3. Perform min,max,avg,sum aggregation for a customer within a given
> > > time-range using coprocessors.
> > > 4. Filter that data for a customer within a given time-range on the
> basis
> > > of column values. The filtering on column values can be matching a
> string
> > > value or it can be doing range based numerical comparison.
> > >
> > > Basically, as per our current requirement all the queries have
> customerid
> > > and most of the queries have timerange also. We are not in prod yet.
> All
> > of
> > > this effort is part of a POC.
> > >
> > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to
> your
> > > record by app logic?
> > > Anil: Wow! This sounds like an awesome idea. Actually, my data is
> > > non-mutable so at present i was putting 0 as the timestamp for all the
> > > data. I will definitely try this stuff. Currently, i run bulkloader to
> > load
> > > the data so i think its gonna be a small change.
> > >
> > > Yes, i would love to give a try from my side for developing a range
> based
> > > FuzzyRowFilter. However, first i am going to try putting in the
> > timestamp.
> > >
> > > Thanks for a very helpful discussion. Let me know when you create the
> > JIRA
> > > for range-based FuzzyRowFilter.
> > >
> > > Thanks,
> > > Anil Gupta
> > >
> > > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <
> alex.baranov.v@gmail.com
> > > >wrote:
> > >
> > > > @Michael,
> > > >
> > > > This is not a simple partial key scan. Take this example of rows:
> > > >
> > > > aaaaa_100001_20120801
> > > > aaaaa_100001_20120802
> > > > aaaaa_100001_20120802
> > > > aaaaa_100001_20120803
> > > > aaaaa_100001_20120804
> > > > aaaaa_100001_20120805
> > > > aaaaa_100002_20120801
> > > > aaaaa_100002_20120802
> > > > aaaaa_100002_20120802
> > > > aaaaa_100002_20120803
> > > > aaaaa_100002_20120804
> > > > aaaaa_100002_20120805
> > > >
> > > > where aaaaa is userId, 10000x is actionId and 201208xx is a
> timestamp.
> > If
> > > > the query is to select actions in the range 20120803-20120805 (in
> this
> > > case
> > > > last 3 days), then when scan encounters row:
> > > >
> > > > aaaaa_100001_20120801
> > > >
> > > > it "knows" it can fast forward scanning to "aaaaa_100001_20120803",
> and
> > > > skip some records (in practice, this may mean skipping really a LOT
> of
> > > > recrods).
> > > >
> > > >
> > > > @Anil,
> > > >
> > > > > Sample Query: I want to get all the event which happened in last
> > month.
> > > >
> > > > 1. What other queries do you do? Just trying to understand why this
> row
> > > key
> > > > format was chosen.
> > > >
> > > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to
> > your
> > > > record by app logic? If you can, then this is the first thing to try
> > and
> > > > perform scan with the help of scan.setTimeRange(startTs, stopTs).
> > > Depending
> > > > on how you write the data this may help a lot with the reading speed
> by
> > > ts,
> > > > because that way you may skip the whole HFiles from reading based on
> > ts.
> > > I
> > > > don't know about your data a lot to judge, but:
> > > >   * in case you have not a lot of users most of which are with long
> > > history
> > > > of interaction with you system (i.e. there are a lot of records for
> > > > specific "userX_actionY") and
> > > >   * if you write data with monotonically increasing timestamp
> > > >   * your regions are not too big
> > > > then this might help you, as it will increase the chance that some of
> > the
> > > > HFiles will contain data *all of which* doesn't fell into the time
> > > interval
> > > > you select by. Otherwise, if written data items with different
> > timestamps
> > > > are very well spread across the HFiles the chance that some HFiles
> are
> > > > skipped from reading is very small. I believe Lars George has
> > illustrated
> > > > this in one of his presentations, but couldn't find it quickly.
> > > >
> > > > > something like FuzzyRowFilter with range
> > > >
> > > > Yes, smth like this looks like would be very valuable. It would be
> > > > interesting to implement too. Let's see if I find the time for that
> in
> > my
> > > > work plan. If you want to try it by yourself, go for it! Let me know
> if
> > > you
> > > > need a help in that case ;)
> > > >
> > > > Alex Baranau
> > > > ------
> > > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> > > Solr
> > > >
> > > > On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <
> > > michael_segel@hotmail.com
> > > > >wrote:
> > > >
> > > > > What row keys are you skipping?
> > > > >
> > > > > Using your example...
> > > > > You have a start row of 00000000200, and an end key of
> > > > > xFFxFFxFFxFFxFFxFF00350.
> > > > > Note that you could also write that end key as xFF(1..6) 01 since
> it
> > > > looks
> > > > > like you're trying to match the 00 in positons 7 and 8 of your
> > numeric
> > > > > string.
> > > > >
> > > > > Assuming that when you say ? you mean that you expect to have a
> > > character
> > > > > in that spot and that your row key is exactly 11 characters in
> > length.
> > > > >
> > > > > While you may not return all the rows in that range, you do have to
> > > still
> > > > > check the row key, unless I am missing something.
> > > > >
> > > > > So what am I missing?
> > > > >
> > > > > On Aug 17, 2012, at 3:42 PM, Alex Baranau <
> alex.baranov.v@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > There was a question [1] in
> > > > > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it
> > > makes
> > > > > > more sense to answer it here.
> > > > > >
> > > > > > With the current FuzzyRowFilter I believe the only way to
> approach
> > > the
> > > > > > problem is to add 150 fuzzy rules to the filter: ??????00200,
> > > > > ??????00201,
> > > > > > ..., ??????00350.
> > > > > >
> > > > > > As for performance of this approach I can say the following:
> > > > > > * there are two "checks" happening for each processed row key
> (i.e.
> > > > those
> > > > > > row keys we don't skip)
> > > > > > * first one performs simple check if the given row key satisfies
> > the
> > > > > fuzzy
> > > > > > rule and also determines if there's next row key to advance to
> (if
> > > this
> > > > > one
> > > > > > doesn't satisfy). The check takes up at max O(n), where n is the
> > > length
> > > > > of
> > > > > > fuzzy rule. I.e. this is done in one simple loop, which can be
> > broken
> > > > > > before all bytes are checked. For m rules this will be O(m*n).
> > > > > > * second piece calculates the next row key to provide it as a
> hint
> > > for
> > > > > > fast-forwarding. We again check all rules and finding the
> smallest
> > > > hint.
> > > > > > Operation is also done in one loop, i.e. O(m*n) here as well.
> > > > > >
> > > > > > With 150 fuzzy rules of length 11, the applying filter is
> > equivalent
> > > to
> > > > > the
> > > > > > loop with simple checks thru 150*11*2 ~ 3000 elements. This might
> > > look
> > > > a
> > > > > > lot, but can work quite fast. So I'd just try it.
> > > > > >
> > > > > > As for extension which will be more efficient, it makes sense to
> > > > consider
> > > > > > implementing it. Let me think more about it and get back with the
> > > JIRA
> > > > > > Issue to you :). But I'd suggest you trying existing
> FuzzyRowFilter
> > > > > first.
> > > > > > The output (performance) would give us some food for thinking, or
> > may
> > > > be
> > > > > > even turns out to be acceptable for you (hopefully).
> > > > > >
> > > > > >> Can i run this kind of filter on HBase0.92 without doing any
> > > > significant
> > > > > > update to the cluster
> > > > > >
> > > > > > Until the next release, you'll have to use the FuzzyRowFilter as
> > any
> > > > > other
> > > > > > custom filter. Just grab the patch from HBASE-6509 and copy the
> > > filter.
> > > > > No
> > > > > > need to patch & rebuild HBase.
> > > > > >
> > > > > > Alex Baranau
> > > > > > ------
> > > > > > Sematext :: http://sematext.com/ :: Hadoop - HBase -
> > ElasticSearch -
> > > > > Solr
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > > Anil Gupta added a comment - 18/Aug/12 04:37
> > > > > > Hi Alex,
> > > > > > I have a question related to this filter. I have a similar
> > filtering
> > > > > > requirement which will be an extension to FuzzyFilterRow.
> > > > > > Suppose, i have the following structure of rowkeys:
> > userid_actionid,
> > > > > where
> > > > > > userid is of 6 digit and then actionid is 5 digit. I would like
> to
> > > get
> > > > > all
> > > > > > the rows with actionid between 00200 to 00350. With current
> > > > > FuzzyRowFilter
> > > > > > i can search for all the rows a particular actionid. Instead of
> > > > searching
> > > > > > for a particular actionid i would like to search for a range of
> > > > actionid.
> > > > > > Does this use case sounds like an extension to current
> > > FuzzyRowFilter?
> > > > > Can
> > > > > > i run this kind of filter on HBase0.92 without doing any
> > significant
> > > > > update
> > > > > > to the cluster. If i develop this kind of filter then what is
> > needed
> > > to
> > > > > run
> > > > > > it on all the RS's?
> > > > > > Thanks,
> > > > > > Anil
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Anil Gupta
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

Posted by anil gupta <an...@gmail.com>.

Hi Alex,

Thanks for creating the JIRA.
On Monday, I completed testing the time range filtering using timestamps
and IMO the results seems satisfactory(if not great). The table has 34
million records(average row size is 1.21 KB), in 136 seconds i get the
entire result of query which had 225 rows.
I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
is hosting 2 Slaves Instance(2 VM's running Datanode,
NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
done any modification in the block size of HDFS or HBase. Considering the
below-par hardware configuration of cluster, does the performance sounds OK
for timestamp filtering?

Thanks,
Anil

On Mon, Aug 20, 2012 at 1:07 PM, Alex Baranau <al...@gmail.com>wrote:

> Created: https://issues.apache.org/jira/browse/HBASE-6618
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <an...@gmail.com> wrote:
>
> > Hi Alex,
> >
> > Apart from the query which i mentioned in last email. Till now, i have
> > implemented the following queries using filters and coprocessors:
> >
> > 1. Getting all the records for a customer.
> > 2. Perform min,max,avg,sum aggregation for a customer using
> coprocessors. I
> > am storing some of the data as BigDecimal also to do accurate floating
> > point calculations.
> > 3. Perform min,max,avg,sum aggregation for a customer within a given
> > time-range using coprocessors.
> > 4. Filter that data for a customer within a given time-range on the basis
> > of column values. The filtering on column values can be matching a string
> > value or it can be doing range based numerical comparison.
> >
> > Basically, as per our current requirement all the queries have customerid
> > and most of the queries have timerange also. We are not in prod yet. All
> of
> > this effort is part of a POC.
> >
> > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> > record by app logic?
> > Anil: Wow! This sounds like an awesome idea. Actually, my data is
> > non-mutable so at present i was putting 0 as the timestamp for all the
> > data. I will definitely try this stuff. Currently, i run bulkloader to
> load
> > the data so i think its gonna be a small change.
> >
> > Yes, i would love to give a try from my side for developing a range based
> > FuzzyRowFilter. However, first i am going to try putting in the
> timestamp.
> >
> > Thanks for a very helpful discussion. Let me know when you create the
> JIRA
> > for range-based FuzzyRowFilter.
> >
> > Thanks,
> > Anil Gupta
> >
> > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <alex.baranov.v@gmail.com
> > >wrote:
> >
> > > @Michael,
> > >
> > > This is not a simple partial key scan. Take this example of rows:
> > >
> > > aaaaa_100001_20120801
> > > aaaaa_100001_20120802
> > > aaaaa_100001_20120802
> > > aaaaa_100001_20120803
> > > aaaaa_100001_20120804
> > > aaaaa_100001_20120805
> > > aaaaa_100002_20120801
> > > aaaaa_100002_20120802
> > > aaaaa_100002_20120802
> > > aaaaa_100002_20120803
> > > aaaaa_100002_20120804
> > > aaaaa_100002_20120805
> > >
> > > where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp.
> If
> > > the query is to select actions in the range 20120803-20120805 (in this
> > case
> > > last 3 days), then when scan encounters row:
> > >
> > > aaaaa_100001_20120801
> > >
> > > it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
> > > skip some records (in practice, this may mean skipping really a LOT of
> > > recrods).
> > >
> > >
> > > @Anil,
> > >
> > > > Sample Query: I want to get all the event which happened in last
> month.
> > >
> > > 1. What other queries do you do? Just trying to understand why this row
> > key
> > > format was chosen.
> > >
> > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to
> your
> > > record by app logic? If you can, then this is the first thing to try
> and
> > > perform scan with the help of scan.setTimeRange(startTs, stopTs).
> > Depending
> > > on how you write the data this may help a lot with the reading speed by
> > ts,
> > > because that way you may skip the whole HFiles from reading based on
> ts.
> > I
> > > don't know about your data a lot to judge, but:
> > >   * in case you have not a lot of users most of which are with long
> > history
> > > of interaction with you system (i.e. there are a lot of records for
> > > specific "userX_actionY") and
> > >   * if you write data with monotonically increasing timestamp
> > >   * your regions are not too big
> > > then this might help you, as it will increase the chance that some of
> the
> > > HFiles will contain data *all of which* doesn't fell into the time
> > interval
> > > you select by. Otherwise, if written data items with different
> timestamps
> > > are very well spread across the HFiles the chance that some HFiles are
> > > skipped from reading is very small. I believe Lars George has
> illustrated
> > > this in one of his presentations, but couldn't find it quickly.
> > >
> > > > something like FuzzyRowFilter with range
> > >
> > > Yes, smth like this looks like would be very valuable. It would be
> > > interesting to implement too. Let's see if I find the time for that in
> my
> > > work plan. If you want to try it by yourself, go for it! Let me know if
> > you
> > > need a help in that case ;)
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> > Solr
> > >
> > > On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <
> > michael_segel@hotmail.com
> > > >wrote:
> > >
> > > > What row keys are you skipping?
> > > >
> > > > Using your example...
> > > > You have a start row of 00000000200, and an end key of
> > > > xFFxFFxFFxFFxFFxFF00350.
> > > > Note that you could also write that end key as xFF(1..6) 01 since it
> > > looks
> > > > like you're trying to match the 00 in positons 7 and 8 of your
> numeric
> > > > string.
> > > >
> > > > Assuming that when you say ? you mean that you expect to have a
> > character
> > > > in that spot and that your row key is exactly 11 characters in
> length.
> > > >
> > > > While you may not return all the rows in that range, you do have to
> > still
> > > > check the row key, unless I am missing something.
> > > >
> > > > So what am I missing?
> > > >
> > > > On Aug 17, 2012, at 3:42 PM, Alex Baranau <al...@gmail.com>
> > > > wrote:
> > > >
> > > > > There was a question [1] in
> > > > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it
> > makes
> > > > > more sense to answer it here.
> > > > >
> > > > > With the current FuzzyRowFilter I believe the only way to approach
> > the
> > > > > problem is to add 150 fuzzy rules to the filter: ??????00200,
> > > > ??????00201,
> > > > > ..., ??????00350.
> > > > >
> > > > > As for performance of this approach I can say the following:
> > > > > * there are two "checks" happening for each processed row key (i.e.
> > > those
> > > > > row keys we don't skip)
> > > > > * first one performs simple check if the given row key satisfies
> the
> > > > fuzzy
> > > > > rule and also determines if there's next row key to advance to (if
> > this
> > > > one
> > > > > doesn't satisfy). The check takes up at max O(n), where n is the
> > length
> > > > of
> > > > > fuzzy rule. I.e. this is done in one simple loop, which can be
> broken
> > > > > before all bytes are checked. For m rules this will be O(m*n).
> > > > > * second piece calculates the next row key to provide it as a hint
> > for
> > > > > fast-forwarding. We again check all rules and finding the smallest
> > > hint.
> > > > > Operation is also done in one loop, i.e. O(m*n) here as well.
> > > > >
> > > > > With 150 fuzzy rules of length 11, the applying filter is
> equivalent
> > to
> > > > the
> > > > > loop with simple checks thru 150*11*2 ~ 3000 elements. This might
> > look
> > > a
> > > > > lot, but can work quite fast. So I'd just try it.
> > > > >
> > > > > As for extension which will be more efficient, it makes sense to
> > > consider
> > > > > implementing it. Let me think more about it and get back with the
> > JIRA
> > > > > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter
> > > > first.
> > > > > The output (performance) would give us some food for thinking, or
> may
> > > be
> > > > > even turns out to be acceptable for you (hopefully).
> > > > >
> > > > >> Can i run this kind of filter on HBase0.92 without doing any
> > > significant
> > > > > update to the cluster
> > > > >
> > > > > Until the next release, you'll have to use the FuzzyRowFilter as
> any
> > > > other
> > > > > custom filter. Just grab the patch from HBASE-6509 and copy the
> > filter.
> > > > No
> > > > > need to patch & rebuild HBase.
> > > > >
> > > > > Alex Baranau
> > > > > ------
> > > > > Sematext :: http://sematext.com/ :: Hadoop - HBase -
> ElasticSearch -
> > > > Solr
> > > > >
> > > > > [1]
> > > > >
> > > > > Anil Gupta added a comment - 18/Aug/12 04:37
> > > > > Hi Alex,
> > > > > I have a question related to this filter. I have a similar
> filtering
> > > > > requirement which will be an extension to FuzzyFilterRow.
> > > > > Suppose, i have the following structure of rowkeys:
> userid_actionid,
> > > > where
> > > > > userid is of 6 digit and then actionid is 5 digit. I would like to
> > get
> > > > all
> > > > > the rows with actionid between 00200 to 00350. With current
> > > > FuzzyRowFilter
> > > > > i can search for all the rows a particular actionid. Instead of
> > > searching
> > > > > for a particular actionid i would like to search for a range of
> > > actionid.
> > > > > Does this use case sounds like an extension to current
> > FuzzyRowFilter?
> > > > Can
> > > > > i run this kind of filter on HBase0.92 without doing any
> significant
> > > > update
> > > > > to the cluster. If i develop this kind of filter then what is
> needed
> > to
> > > > run
> > > > > it on all the RS's?
> > > > > Thanks,
> > > > > Anil
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

Posted by Alex Baranau <al...@gmail.com>.

Created: https://issues.apache.org/jira/browse/HBASE-6618

Alex Baranau
------
Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <an...@gmail.com> wrote:

> Hi Alex,
>
> Apart from the query which i mentioned in last email. Till now, i have
> implemented the following queries using filters and coprocessors:
>
> 1. Getting all the records for a customer.
> 2. Perform min,max,avg,sum aggregation for a customer using coprocessors. I
> am storing some of the data as BigDecimal also to do accurate floating
> point calculations.
> 3. Perform min,max,avg,sum aggregation for a customer within a given
> time-range using coprocessors.
> 4. Filter that data for a customer within a given time-range on the basis
> of column values. The filtering on column values can be matching a string
> value or it can be doing range based numerical comparison.
>
> Basically, as per our current requirement all the queries have customerid
> and most of the queries have timerange also. We are not in prod yet. All of
> this effort is part of a POC.
>
> 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> record by app logic?
> Anil: Wow! This sounds like an awesome idea. Actually, my data is
> non-mutable so at present i was putting 0 as the timestamp for all the
> data. I will definitely try this stuff. Currently, i run bulkloader to load
> the data so i think its gonna be a small change.
>
> Yes, i would love to give a try from my side for developing a range based
> FuzzyRowFilter. However, first i am going to try putting in the timestamp.
>
> Thanks for a very helpful discussion. Let me know when you create the JIRA
> for range-based FuzzyRowFilter.
>
> Thanks,
> Anil Gupta
>
> On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <alex.baranov.v@gmail.com
> >wrote:
>
> > @Michael,
> >
> > This is not a simple partial key scan. Take this example of rows:
> >
> > aaaaa_100001_20120801
> > aaaaa_100001_20120802
> > aaaaa_100001_20120802
> > aaaaa_100001_20120803
> > aaaaa_100001_20120804
> > aaaaa_100001_20120805
> > aaaaa_100002_20120801
> > aaaaa_100002_20120802
> > aaaaa_100002_20120802
> > aaaaa_100002_20120803
> > aaaaa_100002_20120804
> > aaaaa_100002_20120805
> >
> > where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If
> > the query is to select actions in the range 20120803-20120805 (in this
> case
> > last 3 days), then when scan encounters row:
> >
> > aaaaa_100001_20120801
> >
> > it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
> > skip some records (in practice, this may mean skipping really a LOT of
> > recrods).
> >
> >
> > @Anil,
> >
> > > Sample Query: I want to get all the event which happened in last month.
> >
> > 1. What other queries do you do? Just trying to understand why this row
> key
> > format was chosen.
> >
> > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> > record by app logic? If you can, then this is the first thing to try and
> > perform scan with the help of scan.setTimeRange(startTs, stopTs).
> Depending
> > on how you write the data this may help a lot with the reading speed by
> ts,
> > because that way you may skip the whole HFiles from reading based on ts.
> I
> > don't know about your data a lot to judge, but:
> >   * in case you have not a lot of users most of which are with long
> history
> > of interaction with you system (i.e. there are a lot of records for
> > specific "userX_actionY") and
> >   * if you write data with monotonically increasing timestamp
> >   * your regions are not too big
> > then this might help you, as it will increase the chance that some of the
> > HFiles will contain data *all of which* doesn't fell into the time
> interval
> > you select by. Otherwise, if written data items with different timestamps
> > are very well spread across the HFiles the chance that some HFiles are
> > skipped from reading is very small. I believe Lars George has illustrated
> > this in one of his presentations, but couldn't find it quickly.
> >
> > > something like FuzzyRowFilter with range
> >
> > Yes, smth like this looks like would be very valuable. It would be
> > interesting to implement too. Let's see if I find the time for that in my
> > work plan. If you want to try it by yourself, go for it! Let me know if
> you
> > need a help in that case ;)
> >
> > Alex Baranau
> > ------
> > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> >
> > On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <
> michael_segel@hotmail.com
> > >wrote:
> >
> > > What row keys are you skipping?
> > >
> > > Using your example...
> > > You have a start row of 00000000200, and an end key of
> > > xFFxFFxFFxFFxFFxFF00350.
> > > Note that you could also write that end key as xFF(1..6) 01 since it
> > looks
> > > like you're trying to match the 00 in positons 7 and 8 of your numeric
> > > string.
> > >
> > > Assuming that when you say ? you mean that you expect to have a
> character
> > > in that spot and that your row key is exactly 11 characters in length.
> > >
> > > While you may not return all the rows in that range, you do have to
> still
> > > check the row key, unless I am missing something.
> > >
> > > So what am I missing?
> > >
> > > On Aug 17, 2012, at 3:42 PM, Alex Baranau <al...@gmail.com>
> > > wrote:
> > >
> > > > There was a question [1] in
> > > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it
> makes
> > > > more sense to answer it here.
> > > >
> > > > With the current FuzzyRowFilter I believe the only way to approach
> the
> > > > problem is to add 150 fuzzy rules to the filter: ??????00200,
> > > ??????00201,
> > > > ..., ??????00350.
> > > >
> > > > As for performance of this approach I can say the following:
> > > > * there are two "checks" happening for each processed row key (i.e.
> > those
> > > > row keys we don't skip)
> > > > * first one performs simple check if the given row key satisfies the
> > > fuzzy
> > > > rule and also determines if there's next row key to advance to (if
> this
> > > one
> > > > doesn't satisfy). The check takes up at max O(n), where n is the
> length
> > > of
> > > > fuzzy rule. I.e. this is done in one simple loop, which can be broken
> > > > before all bytes are checked. For m rules this will be O(m*n).
> > > > * second piece calculates the next row key to provide it as a hint
> for
> > > > fast-forwarding. We again check all rules and finding the smallest
> > hint.
> > > > Operation is also done in one loop, i.e. O(m*n) here as well.
> > > >
> > > > With 150 fuzzy rules of length 11, the applying filter is equivalent
> to
> > > the
> > > > loop with simple checks thru 150*11*2 ~ 3000 elements. This might
> look
> > a
> > > > lot, but can work quite fast. So I'd just try it.
> > > >
> > > > As for extension which will be more efficient, it makes sense to
> > consider
> > > > implementing it. Let me think more about it and get back with the
> JIRA
> > > > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter
> > > first.
> > > > The output (performance) would give us some food for thinking, or may
> > be
> > > > even turns out to be acceptable for you (hopefully).
> > > >
> > > >> Can i run this kind of filter on HBase0.92 without doing any
> > significant
> > > > update to the cluster
> > > >
> > > > Until the next release, you'll have to use the FuzzyRowFilter as any
> > > other
> > > > custom filter. Just grab the patch from HBASE-6509 and copy the
> filter.
> > > No
> > > > need to patch & rebuild HBase.
> > > >
> > > > Alex Baranau
> > > > ------
> > > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> > > Solr
> > > >
> > > > [1]
> > > >
> > > > Anil Gupta added a comment - 18/Aug/12 04:37
> > > > Hi Alex,
> > > > I have a question related to this filter. I have a similar filtering
> > > > requirement which will be an extension to FuzzyFilterRow.
> > > > Suppose, i have the following structure of rowkeys: userid_actionid,
> > > where
> > > > userid is of 6 digit and then actionid is 5 digit. I would like to
> get
> > > all
> > > > the rows with actionid between 00200 to 00350. With current
> > > FuzzyRowFilter
> > > > i can search for all the rows a particular actionid. Instead of
> > searching
> > > > for a particular actionid i would like to search for a range of
> > actionid.
> > > > Does this use case sounds like an extension to current
> FuzzyRowFilter?
> > > Can
> > > > i run this kind of filter on HBase0.92 without doing any significant
> > > update
> > > > to the cluster. If i develop this kind of filter then what is needed
> to
> > > run
> > > > it on all the RS's?
> > > > Thanks,
> > > > Anil
> > >
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

Posted by anil gupta <an...@gmail.com>.

Hi Alex,

Apart from the query which i mentioned in last email. Till now, i have
implemented the following queries using filters and coprocessors:

1. Getting all the records for a customer.
2. Perform min,max,avg,sum aggregation for a customer using coprocessors. I
am storing some of the data as BigDecimal also to do accurate floating
point calculations.
3. Perform min,max,avg,sum aggregation for a customer within a given
time-range using coprocessors.
4. Filter that data for a customer within a given time-range on the basis
of column values. The filtering on column values can be matching a string
value or it can be doing range based numerical comparison.

Basically, as per our current requirement all the queries have customerid
and most of the queries have timerange also. We are not in prod yet. All of
this effort is part of a POC.

2. Can you set timestamp on Puts the same as timestamp "assigned" to your
record by app logic?
Anil: Wow! This sounds like an awesome idea. Actually, my data is
non-mutable so at present i was putting 0 as the timestamp for all the
data. I will definitely try this stuff. Currently, i run bulkloader to load
the data so i think its gonna be a small change.

Yes, i would love to give a try from my side for developing a range based
FuzzyRowFilter. However, first i am going to try putting in the timestamp.

Thanks for a very helpful discussion. Let me know when you create the JIRA
for range-based FuzzyRowFilter.

Thanks,
Anil Gupta

On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <al...@gmail.com>wrote:

> @Michael,
>
> This is not a simple partial key scan. Take this example of rows:
>
> aaaaa_100001_20120801
> aaaaa_100001_20120802
> aaaaa_100001_20120802
> aaaaa_100001_20120803
> aaaaa_100001_20120804
> aaaaa_100001_20120805
> aaaaa_100002_20120801
> aaaaa_100002_20120802
> aaaaa_100002_20120802
> aaaaa_100002_20120803
> aaaaa_100002_20120804
> aaaaa_100002_20120805
>
> where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If
> the query is to select actions in the range 20120803-20120805 (in this case
> last 3 days), then when scan encounters row:
>
> aaaaa_100001_20120801
>
> it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
> skip some records (in practice, this may mean skipping really a LOT of
> recrods).
>
>
> @Anil,
>
> > Sample Query: I want to get all the event which happened in last month.
>
> 1. What other queries do you do? Just trying to understand why this row key
> format was chosen.
>
> 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> record by app logic? If you can, then this is the first thing to try and
> perform scan with the help of scan.setTimeRange(startTs, stopTs). Depending
> on how you write the data this may help a lot with the reading speed by ts,
> because that way you may skip the whole HFiles from reading based on ts. I
> don't know about your data a lot to judge, but:
>   * in case you have not a lot of users most of which are with long history
> of interaction with you system (i.e. there are a lot of records for
> specific "userX_actionY") and
>   * if you write data with monotonically increasing timestamp
>   * your regions are not too big
> then this might help you, as it will increase the chance that some of the
> HFiles will contain data *all of which* doesn't fell into the time interval
> you select by. Otherwise, if written data items with different timestamps
> are very well spread across the HFiles the chance that some HFiles are
> skipped from reading is very small. I believe Lars George has illustrated
> this in one of his presentations, but couldn't find it quickly.
>
> > something like FuzzyRowFilter with range
>
> Yes, smth like this looks like would be very valuable. It would be
> interesting to implement too. Let's see if I find the time for that in my
> work plan. If you want to try it by yourself, go for it! Let me know if you
> need a help in that case ;)
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <michael_segel@hotmail.com
> >wrote:
>
> > What row keys are you skipping?
> >
> > Using your example...
> > You have a start row of 00000000200, and an end key of
> > xFFxFFxFFxFFxFFxFF00350.
> > Note that you could also write that end key as xFF(1..6) 01 since it
> looks
> > like you're trying to match the 00 in positons 7 and 8 of your numeric
> > string.
> >
> > Assuming that when you say ? you mean that you expect to have a character
> > in that spot and that your row key is exactly 11 characters in length.
> >
> > While you may not return all the rows in that range, you do have to still
> > check the row key, unless I am missing something.
> >
> > So what am I missing?
> >
> > On Aug 17, 2012, at 3:42 PM, Alex Baranau <al...@gmail.com>
> > wrote:
> >
> > > There was a question [1] in
> > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
> > > more sense to answer it here.
> > >
> > > With the current FuzzyRowFilter I believe the only way to approach the
> > > problem is to add 150 fuzzy rules to the filter: ??????00200,
> > ??????00201,
> > > ..., ??????00350.
> > >
> > > As for performance of this approach I can say the following:
> > > * there are two "checks" happening for each processed row key (i.e.
> those
> > > row keys we don't skip)
> > > * first one performs simple check if the given row key satisfies the
> > fuzzy
> > > rule and also determines if there's next row key to advance to (if this
> > one
> > > doesn't satisfy). The check takes up at max O(n), where n is the length
> > of
> > > fuzzy rule. I.e. this is done in one simple loop, which can be broken
> > > before all bytes are checked. For m rules this will be O(m*n).
> > > * second piece calculates the next row key to provide it as a hint for
> > > fast-forwarding. We again check all rules and finding the smallest
> hint.
> > > Operation is also done in one loop, i.e. O(m*n) here as well.
> > >
> > > With 150 fuzzy rules of length 11, the applying filter is equivalent to
> > the
> > > loop with simple checks thru 150*11*2 ~ 3000 elements. This might look
> a
> > > lot, but can work quite fast. So I'd just try it.
> > >
> > > As for extension which will be more efficient, it makes sense to
> consider
> > > implementing it. Let me think more about it and get back with the JIRA
> > > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter
> > first.
> > > The output (performance) would give us some food for thinking, or may
> be
> > > even turns out to be acceptable for you (hopefully).
> > >
> > >> Can i run this kind of filter on HBase0.92 without doing any
> significant
> > > update to the cluster
> > >
> > > Until the next release, you'll have to use the FuzzyRowFilter as any
> > other
> > > custom filter. Just grab the patch from HBASE-6509 and copy the filter.
> > No
> > > need to patch & rebuild HBase.
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> > Solr
> > >
> > > [1]
> > >
> > > Anil Gupta added a comment - 18/Aug/12 04:37
> > > Hi Alex,
> > > I have a question related to this filter. I have a similar filtering
> > > requirement which will be an extension to FuzzyFilterRow.
> > > Suppose, i have the following structure of rowkeys: userid_actionid,
> > where
> > > userid is of 6 digit and then actionid is 5 digit. I would like to get
> > all
> > > the rows with actionid between 00200 to 00350. With current
> > FuzzyRowFilter
> > > i can search for all the rows a particular actionid. Instead of
> searching
> > > for a particular actionid i would like to search for a range of
> actionid.
> > > Does this use case sounds like an extension to current FuzzyRowFilter?
> > Can
> > > i run this kind of filter on HBase0.92 without doing any significant
> > update
> > > to the cluster. If i develop this kind of filter then what is needed to
> > run
> > > it on all the RS's?
> > > Thanks,
> > > Anil
> >
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

Posted by Alex Baranau <al...@gmail.com>.

@Michael,

This is not a simple partial key scan. Take this example of rows:

aaaaa_100001_20120801
aaaaa_100001_20120802
aaaaa_100001_20120802
aaaaa_100001_20120803
aaaaa_100001_20120804
aaaaa_100001_20120805
aaaaa_100002_20120801
aaaaa_100002_20120802
aaaaa_100002_20120802
aaaaa_100002_20120803
aaaaa_100002_20120804
aaaaa_100002_20120805

where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If
the query is to select actions in the range 20120803-20120805 (in this case
last 3 days), then when scan encounters row:

aaaaa_100001_20120801

it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
skip some records (in practice, this may mean skipping really a LOT of
recrods).

@Anil,

> Sample Query: I want to get all the event which happened in last month.

1. What other queries do you do? Just trying to understand why this row key
format was chosen.

2. Can you set timestamp on Puts the same as timestamp "assigned" to your
record by app logic? If you can, then this is the first thing to try and
perform scan with the help of scan.setTimeRange(startTs, stopTs). Depending
on how you write the data this may help a lot with the reading speed by ts,
because that way you may skip the whole HFiles from reading based on ts. I
don't know about your data a lot to judge, but:
  * in case you have not a lot of users most of which are with long history
of interaction with you system (i.e. there are a lot of records for
specific "userX_actionY") and
  * if you write data with monotonically increasing timestamp
  * your regions are not too big
then this might help you, as it will increase the chance that some of the
HFiles will contain data *all of which* doesn't fell into the time interval
you select by. Otherwise, if written data items with different timestamps
are very well spread across the HFiles the chance that some HFiles are
skipped from reading is very small. I believe Lars George has illustrated
this in one of his presentations, but couldn't find it quickly.

> something like FuzzyRowFilter with range

Yes, smth like this looks like would be very valuable. It would be
interesting to implement too. Let's see if I find the time for that in my
work plan. If you want to try it by yourself, go for it! Let me know if you
need a help in that case ;)

Alex Baranau
------
Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <mi...@hotmail.com>wrote:

> What row keys are you skipping?
>
> Using your example...
> You have a start row of 00000000200, and an end key of
> xFFxFFxFFxFFxFFxFF00350.
> Note that you could also write that end key as xFF(1..6) 01 since it looks
> like you're trying to match the 00 in positons 7 and 8 of your numeric
> string.
>
> Assuming that when you say ? you mean that you expect to have a character
> in that spot and that your row key is exactly 11 characters in length.
>
> While you may not return all the rows in that range, you do have to still
> check the row key, unless I am missing something.
>
> So what am I missing?
>
> On Aug 17, 2012, at 3:42 PM, Alex Baranau <al...@gmail.com>
> wrote:
>
> > There was a question [1] in
> > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
> > more sense to answer it here.
> >
> > With the current FuzzyRowFilter I believe the only way to approach the
> > problem is to add 150 fuzzy rules to the filter: ??????00200,
> ??????00201,
> > ..., ??????00350.
> >
> > As for performance of this approach I can say the following:
> > * there are two "checks" happening for each processed row key (i.e. those
> > row keys we don't skip)
> > * first one performs simple check if the given row key satisfies the
> fuzzy
> > rule and also determines if there's next row key to advance to (if this
> one
> > doesn't satisfy). The check takes up at max O(n), where n is the length
> of
> > fuzzy rule. I.e. this is done in one simple loop, which can be broken
> > before all bytes are checked. For m rules this will be O(m*n).
> > * second piece calculates the next row key to provide it as a hint for
> > fast-forwarding. We again check all rules and finding the smallest hint.
> > Operation is also done in one loop, i.e. O(m*n) here as well.
> >
> > With 150 fuzzy rules of length 11, the applying filter is equivalent to
> the
> > loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a
> > lot, but can work quite fast. So I'd just try it.
> >
> > As for extension which will be more efficient, it makes sense to consider
> > implementing it. Let me think more about it and get back with the JIRA
> > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter
> first.
> > The output (performance) would give us some food for thinking, or may be
> > even turns out to be acceptable for you (hopefully).
> >
> >> Can i run this kind of filter on HBase0.92 without doing any significant
> > update to the cluster
> >
> > Until the next release, you'll have to use the FuzzyRowFilter as any
> other
> > custom filter. Just grab the patch from HBASE-6509 and copy the filter.
> No
> > need to patch & rebuild HBase.
> >
> > Alex Baranau
> > ------
> > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> >
> > [1]
> >
> > Anil Gupta added a comment - 18/Aug/12 04:37
> > Hi Alex,
> > I have a question related to this filter. I have a similar filtering
> > requirement which will be an extension to FuzzyFilterRow.
> > Suppose, i have the following structure of rowkeys: userid_actionid,
> where
> > userid is of 6 digit and then actionid is 5 digit. I would like to get
> all
> > the rows with actionid between 00200 to 00350. With current
> FuzzyRowFilter
> > i can search for all the rows a particular actionid. Instead of searching
> > for a particular actionid i would like to search for a range of actionid.
> > Does this use case sounds like an extension to current FuzzyRowFilter?
> Can
> > i run this kind of filter on HBase0.92 without doing any significant
> update
> > to the cluster. If i develop this kind of filter then what is needed to
> run
> > it on all the RS's?
> > Thanks,
> > Anil
>
>

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

Posted by Michael Segel <mi...@hotmail.com>.

What row keys are you skipping?

Using your example... 
You have a start row of 00000000200, and an end key of xFFxFFxFFxFFxFFxFF00350.
Note that you could also write that end key as xFF(1..6) 01 since it looks like you're trying to match the 00 in positons 7 and 8 of your numeric string. 

Assuming that when you say ? you mean that you expect to have a character in that spot and that your row key is exactly 11 characters in length. 

While you may not return all the rows in that range, you do have to still check the row key, unless I am missing something. 

So what am I missing? 

On Aug 17, 2012, at 3:42 PM, Alex Baranau <al...@gmail.com> wrote:

> There was a question [1] in
> https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
> more sense to answer it here.
> 
> With the current FuzzyRowFilter I believe the only way to approach the
> problem is to add 150 fuzzy rules to the filter: ??????00200, ??????00201,
> ..., ??????00350.
> 
> As for performance of this approach I can say the following:
> * there are two "checks" happening for each processed row key (i.e. those
> row keys we don't skip)
> * first one performs simple check if the given row key satisfies the fuzzy
> rule and also determines if there's next row key to advance to (if this one
> doesn't satisfy). The check takes up at max O(n), where n is the length of
> fuzzy rule. I.e. this is done in one simple loop, which can be broken
> before all bytes are checked. For m rules this will be O(m*n).
> * second piece calculates the next row key to provide it as a hint for
> fast-forwarding. We again check all rules and finding the smallest hint.
> Operation is also done in one loop, i.e. O(m*n) here as well.
> 
> With 150 fuzzy rules of length 11, the applying filter is equivalent to the
> loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a
> lot, but can work quite fast. So I'd just try it.
> 
> As for extension which will be more efficient, it makes sense to consider
> implementing it. Let me think more about it and get back with the JIRA
> Issue to you :). But I'd suggest you trying existing FuzzyRowFilter first.
> The output (performance) would give us some food for thinking, or may be
> even turns out to be acceptable for you (hopefully).
> 
>> Can i run this kind of filter on HBase0.92 without doing any significant
> update to the cluster
> 
> Until the next release, you'll have to use the FuzzyRowFilter as any other
> custom filter. Just grab the patch from HBASE-6509 and copy the filter. No
> need to patch & rebuild HBase.
> 
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
> 
> [1]
> 
> Anil Gupta added a comment - 18/Aug/12 04:37
> Hi Alex,
> I have a question related to this filter. I have a similar filtering
> requirement which will be an extension to FuzzyFilterRow.
> Suppose, i have the following structure of rowkeys: userid_actionid, where
> userid is of 6 digit and then actionid is 5 digit. I would like to get all
> the rows with actionid between 00200 to 00350. With current FuzzyRowFilter
> i can search for all the rows a particular actionid. Instead of searching
> for a particular actionid i would like to search for a range of actionid.
> Does this use case sounds like an extension to current FuzzyRowFilter? Can
> i run this kind of filter on HBase0.92 without doing any significant update
> to the cluster. If i develop this kind of filter then what is needed to run
> it on all the RS's?
> Thanks,
> Anil

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

Posted by anil gupta <an...@gmail.com>.

Hi Alex,

Thanks for the answer. I have successfully compiled FuzzyRowFilter class
with HBase0.92. To try out FuzzyRowFilter, i'll need to make some changes
to my RowKey. So, i'll get back to you with performance numbers after
loading the data and trying out FuzzyRowFilter for a particular value.

The range example i told in my original post is very small. In my real use
case the range can lie from 0 to 31536000. So, in my opinion using the
current FuzzyRowFilter might not be a good idea. I agree with you that
extension is the right way for solving this.

Here is my real use case :
I have a table in which is store event from customers using
customerid+timestamp.
Sample Query: I want to get all the event which happened in last month.
Current Possible Solutions:
1. I can do this filtering by using a filter checking the column value of
"timestamp" column. I think this will be highly inefficient.
2. Other possible way i think is to use RegexComparator with RowFilter to
get all the row with a certain numeric range of timestamp. In this case
also every rowkey of the table will be checked.

So, the most optimum way is to use something like FuzzyRowFilter with
range. Also, my range will always be numerical and this can be really handy
for others storing timestamp in the rowkey and wants to do time based
queries using the RowKey.

Thanks,
Anil Gupta

On Fri, Aug 17, 2012 at 1:42 PM, Alex Baranau <al...@gmail.com>wrote:

> There was a question [1] in
> https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
> more sense to answer it here.
>
> With the current FuzzyRowFilter I believe the only way to approach the
> problem is to add 150 fuzzy rules to the filter: ??????00200, ??????00201,
> ..., ??????00350.
>
> As for performance of this approach I can say the following:
> * there are two "checks" happening for each processed row key (i.e. those
> row keys we don't skip)
> * first one performs simple check if the given row key satisfies the fuzzy
> rule and also determines if there's next row key to advance to (if this one
> doesn't satisfy). The check takes up at max O(n), where n is the length of
> fuzzy rule. I.e. this is done in one simple loop, which can be broken
> before all bytes are checked. For m rules this will be O(m*n).
> * second piece calculates the next row key to provide it as a hint for
> fast-forwarding. We again check all rules and finding the smallest hint.
> Operation is also done in one loop, i.e. O(m*n) here as well.
>
> With 150 fuzzy rules of length 11, the applying filter is equivalent to the
> loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a
> lot, but can work quite fast. So I'd just try it.
>
> As for extension which will be more efficient, it makes sense to consider
> implementing it. Let me think more about it and get back with the JIRA
> Issue to you :). But I'd suggest you trying existing FuzzyRowFilter first.
> The output (performance) would give us some food for thinking, or may be
> even turns out to be acceptable for you (hopefully).
>
> > Can i run this kind of filter on HBase0.92 without doing any significant
> update to the cluster
>
> Until the next release, you'll have to use the FuzzyRowFilter as any other
> custom filter. Just grab the patch from HBASE-6509 and copy the filter. No
> need to patch & rebuild HBase.
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> [1]
>
> Anil Gupta added a comment - 18/Aug/12 04:37
> Hi Alex,
> I have a question related to this filter. I have a similar filtering
> requirement which will be an extension to FuzzyFilterRow.
> Suppose, i have the following structure of rowkeys: userid_actionid, where
> userid is of 6 digit and then actionid is 5 digit. I would like to get all
> the rows with actionid between 00200 to 00350. With current FuzzyRowFilter
> i can search for all the rows a particular actionid. Instead of searching
> for a particular actionid i would like to search for a range of actionid.
> Does this use case sounds like an extension to current FuzzyRowFilter? Can
> i run this kind of filter on HBase0.92 without doing any significant update
> to the cluster. If i develop this kind of filter then what is needed to run
> it on all the RS's?
> Thanks,
> Anil
>

-- 
Thanks & Regards,
Anil Gupta