You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Christian Schäfer <sy...@yahoo.de> on 2012/07/31 17:27:40 UTC

How to query by rowKey-infix

Hello there,

I designed a row key for queries that need best performance (~100 ms) which looks like this:

userId-date-sessionId

These queries(scans) are always based on a userId and sometimes additionally on a date, too.
That's no problem with the key above.

However, another kind of queries shall be based on a given time range whereas the outermost left userId is not given or known.
In this case I need to get all rows covering the given time range with their date to create a daily reporting.

As I can't set wildcards at the beginning of a left-based index for the scan, 
I only see the possibility to scan the index of the whole table to collect the 
rowKeys that are inside the timerange I'm interested in.

Is there a more elegant way to collect rows within time range X? 
(Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.)

Could/should one maybe leverage some kind of row key caching to accelerate the collection process?
Is that covered by the block cache?

Thanks in advance for any advice.

regards
Chris

Re: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.

The point is that I / we want to make reports for each session that could be present on many rows distributed over all regions.

As I expect it to be slower to scan Columns than rowkeys I chose the latter.

I guess I may not (yet) share the schema.

The userID and session stuff mentioned is just there to illustrate an comparable situation.


Thanks,
Chris



----- Ursprüngliche Message -----
Von: Michael Segel <mi...@hotmail.com>
An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
CC: 
Gesendet: 14:21 Freitag, 3.August 2012
Betreff: Re: How to query by rowKey-infix

Hi, 

What does your schema look like? 

Would it make sense to changing the key to user_id '|' timestamp and then use the session_id in the column name? 



On Aug 2, 2012, at 7:23 AM, Christian Schäfer <sy...@yahoo.de> wrote:

> OK,
> 
> at first I will try the scans.
> 
> If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors.
> 
> Currently I'm stuck at the scans because it requires two steps (therefore some kind of filter chaining)
> 
> The key:  userId-dateInMllis-sessionId
> 
> At first I need to extract dateInMllis with regex or substring (using special delimiters for date)
> 
> Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this:
> 
> 
> 
> 
> 
> ----- Ursprüngliche Message -----
> Von: Michael Segel <mi...@hotmail.com>
> An: user@hbase.apache.org
> CC: 
> Gesendet: 13:52 Mittwoch, 1.August 2012
> Betreff: Re: How to query by rowKey-infix
> 
> Actually w coprocessors you can create a secondary index in short order. 
> Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. 
> 
> On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
> 
>> When deciding between a table scan vs secondary index, you should try to
>> estimate what percent of the underlying data blocks will be used in the
>> query.  By default, each block is 64KB.
>> 
>> If each user's data is small and you are fitting multiple users per block,
>> then you're going to need all the blocks, so a tablescan is better because
>> it's simpler.  If each user has 1MB+ data then you will want to pick out
>> the individual blocks relevant to each date.  The secondary index will help
>> you go directly to those sparse blocks, but with a cost in complexity,
>> consistency, and extra denormalized data that knocks primary data out of
>> your block cache.
>> 
>> If latency is not a concern, I would start with the table scan.  If that's
>> too slow you add the secondary index, and if you still need it faster you
>> do the primary key lookups in parallel as Jerry mentions.
>> 
>> Matt
>> 
>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:
>> 
>>> Hi Chris:
>>> 
>>> I'm thinking about building a secondary index for primary key lookup, then
>>> query using the primary keys in parallel.
>>> 
>>> I'm interested to see if there is other option too.
>>> 
>>> Best Regards,
>>> 
>>> Jerry
>>> 
>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>>> wrote:
>>> 
>>>> Hello there,
>>>> 
>>>> I designed a row key for queries that need best performance (~100 ms)
>>>> which looks like this:
>>>> 
>>>> userId-date-sessionId
>>>> 
>>>> These queries(scans) are always based on a userId and sometimes
>>>> additionally on a date, too.
>>>> That's no problem with the key above.
>>>> 
>>>> However, another kind of queries shall be based on a given time range
>>>> whereas the outermost left userId is not given or known.
>>>> In this case I need to get all rows covering the given time range with
>>>> their date to create a daily reporting.
>>>> 
>>>> As I can't set wildcards at the beginning of a left-based index for the
>>>> scan,
>>>> I only see the possibility to scan the index of the whole table to
>>> collect
>>>> the
>>>> rowKeys that are inside the timerange I'm interested in.
>>>> 
>>>> Is there a more elegant way to collect rows within time range X?
>>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>>> stored by hbase automatically.)
>>>> 
>>>> Could/should one maybe leverage some kind of row key caching to
>>> accelerate
>>>> the collection process?
>>>> Is that covered by the block cache?
>>>> 
>>>> Thanks in advance for any advice.
>>>> 
>>>> regards
>>>> Chris
>>>> 
>>> 
>

Re: How to query by rowKey-infix

Posted by Michael Segel <mi...@hotmail.com>.

Hi, 

What does your schema look like? 

Would it make sense to changing the key to user_id '|' timestamp and then use the session_id in the column name? 



On Aug 2, 2012, at 7:23 AM, Christian Schäfer <sy...@yahoo.de> wrote:

> OK,
> 
> at first I will try the scans.
> 
> If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors.
> 
> Currently I'm stuck at the scans because it requires two steps (therefore some kind of filter chaining)
> 
> The key:  userId-dateInMllis-sessionId
> 
> At first I need to extract dateInMllis with regex or substring (using special delimiters for date)
> 
> Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this:
> 
> 
> 
> 
> 
> ----- Ursprüngliche Message -----
> Von: Michael Segel <mi...@hotmail.com>
> An: user@hbase.apache.org
> CC: 
> Gesendet: 13:52 Mittwoch, 1.August 2012
> Betreff: Re: How to query by rowKey-infix
> 
> Actually w coprocessors you can create a secondary index in short order. 
> Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. 
> 
> On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
> 
>> When deciding between a table scan vs secondary index, you should try to
>> estimate what percent of the underlying data blocks will be used in the
>> query.  By default, each block is 64KB.
>> 
>> If each user's data is small and you are fitting multiple users per block,
>> then you're going to need all the blocks, so a tablescan is better because
>> it's simpler.  If each user has 1MB+ data then you will want to pick out
>> the individual blocks relevant to each date.  The secondary index will help
>> you go directly to those sparse blocks, but with a cost in complexity,
>> consistency, and extra denormalized data that knocks primary data out of
>> your block cache.
>> 
>> If latency is not a concern, I would start with the table scan.  If that's
>> too slow you add the secondary index, and if you still need it faster you
>> do the primary key lookups in parallel as Jerry mentions.
>> 
>> Matt
>> 
>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:
>> 
>>> Hi Chris:
>>> 
>>> I'm thinking about building a secondary index for primary key lookup, then
>>> query using the primary keys in parallel.
>>> 
>>> I'm interested to see if there is other option too.
>>> 
>>> Best Regards,
>>> 
>>> Jerry
>>> 
>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>>> wrote:
>>> 
>>>> Hello there,
>>>> 
>>>> I designed a row key for queries that need best performance (~100 ms)
>>>> which looks like this:
>>>> 
>>>> userId-date-sessionId
>>>> 
>>>> These queries(scans) are always based on a userId and sometimes
>>>> additionally on a date, too.
>>>> That's no problem with the key above.
>>>> 
>>>> However, another kind of queries shall be based on a given time range
>>>> whereas the outermost left userId is not given or known.
>>>> In this case I need to get all rows covering the given time range with
>>>> their date to create a daily reporting.
>>>> 
>>>> As I can't set wildcards at the beginning of a left-based index for the
>>>> scan,
>>>> I only see the possibility to scan the index of the whole table to
>>> collect
>>>> the
>>>> rowKeys that are inside the timerange I'm interested in.
>>>> 
>>>> Is there a more elegant way to collect rows within time range X?
>>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>>> stored by hbase automatically.)
>>>> 
>>>> Could/should one maybe leverage some kind of row key caching to
>>> accelerate
>>>> the collection process?
>>>> Is that covered by the block cache?
>>>> 
>>>> Thanks in advance for any advice.
>>>> 
>>>> regards
>>>> Chris
>>>> 
>>> 
>

Re: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.

Hi Matt,

sure I got this in mind as an last option (at least on a limited subset of data).

Due to our estimation of some billions rows a week a selective filtering needs to take place at the server side.

But I agree that one could do fine filtering stuff on the client side on a handy data subset to avoid getting the hbase schema & indexing (by coprocessors) too complicated.

regards
Chris



----- Ursprüngliche Message -----
Von: Matt Corgan <mc...@hotpads.com>
An: user@hbase.apache.org
CC: 
Gesendet: 3:29 Freitag, 3.August 2012
Betreff: Re: How to query by rowKey-infix

Yeah - just thought i'd point it out since people often have small tables
in their cluster alongside the big ones, and when generating reports,
sometimes you don't care if it finishes in 10 minutes vs an hour.


On Thu, Aug 2, 2012 at 6:15 PM, Alex Baranau <al...@gmail.com>wrote:

> I think this is exactly what Christian is trying to (and should be trying
> to) avoid ;).
>
> I can't imagine use-case when you need to filter something and you can do
> it with (at least) server-side filter, and yet in this situation you want
> to try to do it on the client-side... Doing filtering on client-side when
> you can do it on server-side just feels wrong. Esp. given that there's a
> lot of data in HBase (otherwise why would you use it).
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > Also Christian, don't forget you can read all the rows back to the client
> > and do the filtering there using whatever logic you like.  HBase Filters
> > can be thought of as an optimization (predicate push-down) over
> client-side
> > filtering.  Pulling all the rows over the network will be slower, but I
> > don't think we know enough about your data or speed requirements to rule
> it
> > out.
> >
> >
> > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <alex.baranov.v@gmail.com
> > >wrote:
> >
> > > Hi Christian!
> > >
> > > If to put off secondary indexes and assume you are going with "heavy
> > > scans", you can try two following things to make it much faster. If
> this
> > is
> > > appropriate to your situation, of course.
> > >
> > > 1.
> > >
> > > > Is there a more elegant way to collect rows within time range X?
> > > > (Unfortunately, the date attribute is not equal to the timestamp that
> > is
> > > stored by hbase automatically.)
> > >
> > > Can you set timestamp of the Puts to the one you have in row key?
> Instead
> > > of relying on the one that HBase puts automatically (current ts). If
> you
> > > can, this will improve reading speed a lot by setting time range on
> > > scanner. Depending on how you are writing your data of course, but I
> > assume
> > > that you mostly write data in "time-increasing" manner.
> > >
> > > 2.
> > >
> > > If your userId has fixed length, or you can change it so that it has
> > fixed
> > > length, then you can actually use smth like "wildcard"  in row key.
> > There's
> > > a way in Filter implementation to fast-forward to the record with
> > specific
> > > row key and by doing this skip many records. This might be used as
> > follows:
> > > * suppose your userId is 5 characters in length
> > > * suppose you are scanning for records with time between 2012-08-01
> > > and 2012-08-08
> > > * when you scanning records and you face e.g. key
> > > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can
> tell
> > > the scanner from your filter to fast-forward to key "aaaab_
> 2012-08-01".
> > > Because you know that all remained records of user "aaaaa" don't fall
> > into
> > > the interval you need (as the time for its records will be >=
> > 2012-08-09).
> > >
> > > As of now, I believe you will have to implement your custom filter to
> do
> > > that.
> > > Pointer:
> > > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> > > I believe I implemented similar thing some time ago. If this idea works
> > for
> > > you I could look for the implementation and share it if it helps. Or
> may
> > be
> > > even simply add it to HBase codebase.
> > >
> > > Hope this helps,
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> > >
> > > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <
> syrious3000@yahoo.de
> > > >wrote:
> > >
> > > >
> > > >
> > > > Excuse my double posting.
> > > > Here is the complete mail:
> > > >
> > > >
> > > > OK,
> > > >
> > > > at first I will try the scans.
> > > >
> > > > If that's too slow I will have to upgrade hbase (currently
> > 0.90.4-cdh3u2)
> > > > to be able to use coprocessors.
> > > >
> > > >
> > > > Currently I'm stuck at the scans because it requires two steps
> > (therefore
> > > > maybe some kind of filter chaining is required)
> > > >
> > > >
> > > > The key:  userId-dateInMillis-sessionId
> > > >
> > > > At first I need to extract dateInMllis with regex or substring (using
> > > > special delimiters for date)
> > > >
> > > > Second, the extracted value must be parsed to Long and set to a
> > RowFilter
> > > > Comparator like this:
> > > >
> > > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > > > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> > > >
> > > > How to chain that?
> > > > Do I have to write a custom filter?
> > > > (Would like to avoid that due to deployment)
> > > >
> > > > regards
> > > > Chris
> > > >
> > > > ----- Ursprüngliche Message -----
> > > > Von: Michael Segel <mi...@hotmail.com>
> > > > An: user@hbase.apache.org
> > > > CC:
> > > > Gesendet: 13:52 Mittwoch, 1.August 2012
> > > > Betreff: Re: How to query by rowKey-infix
> > > >
> > > > Actually w coprocessors you can create a secondary index in short
> > order.
> > > > Then your cost is going to be 2 fetches. Trying to do a partial table
> > > scan
> > > > will be more expensive.
> > > >
> > > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > > >
> > > > > When deciding between a table scan vs secondary index, you should
> try
> > > to
> > > > > estimate what percent of the underlying data blocks will be used in
> > the
> > > > > query.  By default, each block is 64KB.
> > > > >
> > > > > If each user's data is small and you are fitting multiple users per
> > > > block,
> > > > > then you're going to need all the blocks, so a tablescan is better
> > > > because
> > > > > it's simpler.  If each user has 1MB+ data then you will want to
> pick
> > > out
> > > > > the individual blocks relevant to each date.  The secondary index
> > will
> > > > help
> > > > > you go directly to those sparse blocks, but with a cost in
> > complexity,
> > > > > consistency, and extra denormalized data that knocks primary data
> out
> > > of
> > > > > your block cache.
> > > > >
> > > > > If latency is not a concern, I would start with the table scan.  If
> > > > that's
> > > > > too slow you add the secondary index, and if you still need it
> faster
> > > you
> > > > > do the primary key lookups in parallel as Jerry mentions.
> > > > >
> > > > > Matt
> > > > >
> > > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Hi Chris:
> > > > >>
> > > > >> I'm thinking about building a secondary index for primary key
> > lookup,
> > > > then
> > > > >> query using the primary keys in parallel.
> > > > >>
> > > > >> I'm interested to see if there is other option too.
> > > > >>
> > > > >> Best Regards,
> > > > >>
> > > > >> Jerry
> > > > >>
> > > > >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> > > > syrious3000@yahoo.de
> > > > >>> wrote:
> > > > >>
> > > > >>> Hello there,
> > > > >>>
> > > > >>> I designed a row key for queries that need best performance (~100
> > ms)
> > > > >>> which looks like this:
> > > > >>>
> > > > >>> userId-date-sessionId
> > > > >>>
> > > > >>> These queries(scans) are always based on a userId and sometimes
> > > > >>> additionally on a date, too.
> > > > >>> That's no problem with the key above.
> > > > >>>
> > > > >>> However, another kind of queries shall be based on a given time
> > range
> > > > >>> whereas the outermost left userId is not given or known.
> > > > >>> In this case I need to get all rows covering the given time range
> > > with
> > > > >>> their date to create a daily reporting.
> > > > >>>
> > > > >>> As I can't set wildcards at the beginning of a left-based index
> for
> > > the
> > > > >>> scan,
> > > > >>> I only see the possibility to scan the index of the whole table
> to
> > > > >> collect
> > > > >>> the
> > > > >>> rowKeys that are inside the timerange I'm interested in.
> > > > >>>
> > > > >>> Is there a more elegant way to collect rows within time range X?
> > > > >>> (Unfortunately, the date attribute is not equal to the timestamp
> > that
> > > > is
> > > > >>> stored by hbase automatically.)
> > > > >>>
> > > > >>> Could/should one maybe leverage some kind of row key caching to
> > > > >> accelerate
> > > > >>> the collection process?
> > > > >>> Is that covered by the block cache?
> > > > >>>
> > > > >>> Thanks in advance for any advice.
> > > > >>>
> > > > >>> regards
> > > > >>> Chris
> > > > >>>
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>

Re: How to query by rowKey-infix

Posted by Matt Corgan <mc...@hotpads.com>.

Yeah - just thought i'd point it out since people often have small tables
in their cluster alongside the big ones, and when generating reports,
sometimes you don't care if it finishes in 10 minutes vs an hour.


On Thu, Aug 2, 2012 at 6:15 PM, Alex Baranau <al...@gmail.com>wrote:

> I think this is exactly what Christian is trying to (and should be trying
> to) avoid ;).
>
> I can't imagine use-case when you need to filter something and you can do
> it with (at least) server-side filter, and yet in this situation you want
> to try to do it on the client-side... Doing filtering on client-side when
> you can do it on server-side just feels wrong. Esp. given that there's a
> lot of data in HBase (otherwise why would you use it).
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > Also Christian, don't forget you can read all the rows back to the client
> > and do the filtering there using whatever logic you like.  HBase Filters
> > can be thought of as an optimization (predicate push-down) over
> client-side
> > filtering.  Pulling all the rows over the network will be slower, but I
> > don't think we know enough about your data or speed requirements to rule
> it
> > out.
> >
> >
> > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <alex.baranov.v@gmail.com
> > >wrote:
> >
> > > Hi Christian!
> > >
> > > If to put off secondary indexes and assume you are going with "heavy
> > > scans", you can try two following things to make it much faster. If
> this
> > is
> > > appropriate to your situation, of course.
> > >
> > > 1.
> > >
> > > > Is there a more elegant way to collect rows within time range X?
> > > > (Unfortunately, the date attribute is not equal to the timestamp that
> > is
> > > stored by hbase automatically.)
> > >
> > > Can you set timestamp of the Puts to the one you have in row key?
> Instead
> > > of relying on the one that HBase puts automatically (current ts). If
> you
> > > can, this will improve reading speed a lot by setting time range on
> > > scanner. Depending on how you are writing your data of course, but I
> > assume
> > > that you mostly write data in "time-increasing" manner.
> > >
> > > 2.
> > >
> > > If your userId has fixed length, or you can change it so that it has
> > fixed
> > > length, then you can actually use smth like "wildcard"  in row key.
> > There's
> > > a way in Filter implementation to fast-forward to the record with
> > specific
> > > row key and by doing this skip many records. This might be used as
> > follows:
> > > * suppose your userId is 5 characters in length
> > > * suppose you are scanning for records with time between 2012-08-01
> > > and 2012-08-08
> > > * when you scanning records and you face e.g. key
> > > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can
> tell
> > > the scanner from your filter to fast-forward to key "aaaab_
> 2012-08-01".
> > > Because you know that all remained records of user "aaaaa" don't fall
> > into
> > > the interval you need (as the time for its records will be >=
> > 2012-08-09).
> > >
> > > As of now, I believe you will have to implement your custom filter to
> do
> > > that.
> > > Pointer:
> > > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> > > I believe I implemented similar thing some time ago. If this idea works
> > for
> > > you I could look for the implementation and share it if it helps. Or
> may
> > be
> > > even simply add it to HBase codebase.
> > >
> > > Hope this helps,
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> > >
> > > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <
> syrious3000@yahoo.de
> > > >wrote:
> > >
> > > >
> > > >
> > > > Excuse my double posting.
> > > > Here is the complete mail:
> > > >
> > > >
> > > > OK,
> > > >
> > > > at first I will try the scans.
> > > >
> > > > If that's too slow I will have to upgrade hbase (currently
> > 0.90.4-cdh3u2)
> > > > to be able to use coprocessors.
> > > >
> > > >
> > > > Currently I'm stuck at the scans because it requires two steps
> > (therefore
> > > > maybe some kind of filter chaining is required)
> > > >
> > > >
> > > > The key:  userId-dateInMillis-sessionId
> > > >
> > > > At first I need to extract dateInMllis with regex or substring (using
> > > > special delimiters for date)
> > > >
> > > > Second, the extracted value must be parsed to Long and set to a
> > RowFilter
> > > > Comparator like this:
> > > >
> > > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > > > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> > > >
> > > > How to chain that?
> > > > Do I have to write a custom filter?
> > > > (Would like to avoid that due to deployment)
> > > >
> > > > regards
> > > > Chris
> > > >
> > > > ----- Ursprüngliche Message -----
> > > > Von: Michael Segel <mi...@hotmail.com>
> > > > An: user@hbase.apache.org
> > > > CC:
> > > > Gesendet: 13:52 Mittwoch, 1.August 2012
> > > > Betreff: Re: How to query by rowKey-infix
> > > >
> > > > Actually w coprocessors you can create a secondary index in short
> > order.
> > > > Then your cost is going to be 2 fetches. Trying to do a partial table
> > > scan
> > > > will be more expensive.
> > > >
> > > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > > >
> > > > > When deciding between a table scan vs secondary index, you should
> try
> > > to
> > > > > estimate what percent of the underlying data blocks will be used in
> > the
> > > > > query.  By default, each block is 64KB.
> > > > >
> > > > > If each user's data is small and you are fitting multiple users per
> > > > block,
> > > > > then you're going to need all the blocks, so a tablescan is better
> > > > because
> > > > > it's simpler.  If each user has 1MB+ data then you will want to
> pick
> > > out
> > > > > the individual blocks relevant to each date.  The secondary index
> > will
> > > > help
> > > > > you go directly to those sparse blocks, but with a cost in
> > complexity,
> > > > > consistency, and extra denormalized data that knocks primary data
> out
> > > of
> > > > > your block cache.
> > > > >
> > > > > If latency is not a concern, I would start with the table scan.  If
> > > > that's
> > > > > too slow you add the secondary index, and if you still need it
> faster
> > > you
> > > > > do the primary key lookups in parallel as Jerry mentions.
> > > > >
> > > > > Matt
> > > > >
> > > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Hi Chris:
> > > > >>
> > > > >> I'm thinking about building a secondary index for primary key
> > lookup,
> > > > then
> > > > >> query using the primary keys in parallel.
> > > > >>
> > > > >> I'm interested to see if there is other option too.
> > > > >>
> > > > >> Best Regards,
> > > > >>
> > > > >> Jerry
> > > > >>
> > > > >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> > > > syrious3000@yahoo.de
> > > > >>> wrote:
> > > > >>
> > > > >>> Hello there,
> > > > >>>
> > > > >>> I designed a row key for queries that need best performance (~100
> > ms)
> > > > >>> which looks like this:
> > > > >>>
> > > > >>> userId-date-sessionId
> > > > >>>
> > > > >>> These queries(scans) are always based on a userId and sometimes
> > > > >>> additionally on a date, too.
> > > > >>> That's no problem with the key above.
> > > > >>>
> > > > >>> However, another kind of queries shall be based on a given time
> > range
> > > > >>> whereas the outermost left userId is not given or known.
> > > > >>> In this case I need to get all rows covering the given time range
> > > with
> > > > >>> their date to create a daily reporting.
> > > > >>>
> > > > >>> As I can't set wildcards at the beginning of a left-based index
> for
> > > the
> > > > >>> scan,
> > > > >>> I only see the possibility to scan the index of the whole table
> to
> > > > >> collect
> > > > >>> the
> > > > >>> rowKeys that are inside the timerange I'm interested in.
> > > > >>>
> > > > >>> Is there a more elegant way to collect rows within time range X?
> > > > >>> (Unfortunately, the date attribute is not equal to the timestamp
> > that
> > > > is
> > > > >>> stored by hbase automatically.)
> > > > >>>
> > > > >>> Could/should one maybe leverage some kind of row key caching to
> > > > >> accelerate
> > > > >>> the collection process?
> > > > >>> Is that covered by the block cache?
> > > > >>>
> > > > >>> Thanks in advance for any advice.
> > > > >>>
> > > > >>> regards
> > > > >>> Chris
> > > > >>>
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>

Re: How to query by rowKey-infix

Posted by Alex Baranau <al...@gmail.com>.

I think this is exactly what Christian is trying to (and should be trying
to) avoid ;).

I can't imagine use-case when you need to filter something and you can do
it with (at least) server-side filter, and yet in this situation you want
to try to do it on the client-side... Doing filtering on client-side when
you can do it on server-side just feels wrong. Esp. given that there's a
lot of data in HBase (otherwise why would you use it).

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <mc...@hotpads.com> wrote:

> Also Christian, don't forget you can read all the rows back to the client
> and do the filtering there using whatever logic you like.  HBase Filters
> can be thought of as an optimization (predicate push-down) over client-side
> filtering.  Pulling all the rows over the network will be slower, but I
> don't think we know enough about your data or speed requirements to rule it
> out.
>
>
> On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <alex.baranov.v@gmail.com
> >wrote:
>
> > Hi Christian!
> >
> > If to put off secondary indexes and assume you are going with "heavy
> > scans", you can try two following things to make it much faster. If this
> is
> > appropriate to your situation, of course.
> >
> > 1.
> >
> > > Is there a more elegant way to collect rows within time range X?
> > > (Unfortunately, the date attribute is not equal to the timestamp that
> is
> > stored by hbase automatically.)
> >
> > Can you set timestamp of the Puts to the one you have in row key? Instead
> > of relying on the one that HBase puts automatically (current ts). If you
> > can, this will improve reading speed a lot by setting time range on
> > scanner. Depending on how you are writing your data of course, but I
> assume
> > that you mostly write data in "time-increasing" manner.
> >
> > 2.
> >
> > If your userId has fixed length, or you can change it so that it has
> fixed
> > length, then you can actually use smth like "wildcard"  in row key.
> There's
> > a way in Filter implementation to fast-forward to the record with
> specific
> > row key and by doing this skip many records. This might be used as
> follows:
> > * suppose your userId is 5 characters in length
> > * suppose you are scanning for records with time between 2012-08-01
> > and 2012-08-08
> > * when you scanning records and you face e.g. key
> > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> > Because you know that all remained records of user "aaaaa" don't fall
> into
> > the interval you need (as the time for its records will be >=
> 2012-08-09).
> >
> > As of now, I believe you will have to implement your custom filter to do
> > that.
> > Pointer:
> > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> > I believe I implemented similar thing some time ago. If this idea works
> for
> > you I could look for the implementation and share it if it helps. Or may
> be
> > even simply add it to HBase codebase.
> >
> > Hope this helps,
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
> >
> > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <syrious3000@yahoo.de
> > >wrote:
> >
> > >
> > >
> > > Excuse my double posting.
> > > Here is the complete mail:
> > >
> > >
> > > OK,
> > >
> > > at first I will try the scans.
> > >
> > > If that's too slow I will have to upgrade hbase (currently
> 0.90.4-cdh3u2)
> > > to be able to use coprocessors.
> > >
> > >
> > > Currently I'm stuck at the scans because it requires two steps
> (therefore
> > > maybe some kind of filter chaining is required)
> > >
> > >
> > > The key:  userId-dateInMillis-sessionId
> > >
> > > At first I need to extract dateInMllis with regex or substring (using
> > > special delimiters for date)
> > >
> > > Second, the extracted value must be parsed to Long and set to a
> RowFilter
> > > Comparator like this:
> > >
> > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> > >
> > > How to chain that?
> > > Do I have to write a custom filter?
> > > (Would like to avoid that due to deployment)
> > >
> > > regards
> > > Chris
> > >
> > > ----- Ursprüngliche Message -----
> > > Von: Michael Segel <mi...@hotmail.com>
> > > An: user@hbase.apache.org
> > > CC:
> > > Gesendet: 13:52 Mittwoch, 1.August 2012
> > > Betreff: Re: How to query by rowKey-infix
> > >
> > > Actually w coprocessors you can create a secondary index in short
> order.
> > > Then your cost is going to be 2 fetches. Trying to do a partial table
> > scan
> > > will be more expensive.
> > >
> > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
> > >
> > > > When deciding between a table scan vs secondary index, you should try
> > to
> > > > estimate what percent of the underlying data blocks will be used in
> the
> > > > query.  By default, each block is 64KB.
> > > >
> > > > If each user's data is small and you are fitting multiple users per
> > > block,
> > > > then you're going to need all the blocks, so a tablescan is better
> > > because
> > > > it's simpler.  If each user has 1MB+ data then you will want to pick
> > out
> > > > the individual blocks relevant to each date.  The secondary index
> will
> > > help
> > > > you go directly to those sparse blocks, but with a cost in
> complexity,
> > > > consistency, and extra denormalized data that knocks primary data out
> > of
> > > > your block cache.
> > > >
> > > > If latency is not a concern, I would start with the table scan.  If
> > > that's
> > > > too slow you add the secondary index, and if you still need it faster
> > you
> > > > do the primary key lookups in parallel as Jerry mentions.
> > > >
> > > > Matt
> > > >
> > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Chris:
> > > >>
> > > >> I'm thinking about building a secondary index for primary key
> lookup,
> > > then
> > > >> query using the primary keys in parallel.
> > > >>
> > > >> I'm interested to see if there is other option too.
> > > >>
> > > >> Best Regards,
> > > >>
> > > >> Jerry
> > > >>
> > > >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> > > syrious3000@yahoo.de
> > > >>> wrote:
> > > >>
> > > >>> Hello there,
> > > >>>
> > > >>> I designed a row key for queries that need best performance (~100
> ms)
> > > >>> which looks like this:
> > > >>>
> > > >>> userId-date-sessionId
> > > >>>
> > > >>> These queries(scans) are always based on a userId and sometimes
> > > >>> additionally on a date, too.
> > > >>> That's no problem with the key above.
> > > >>>
> > > >>> However, another kind of queries shall be based on a given time
> range
> > > >>> whereas the outermost left userId is not given or known.
> > > >>> In this case I need to get all rows covering the given time range
> > with
> > > >>> their date to create a daily reporting.
> > > >>>
> > > >>> As I can't set wildcards at the beginning of a left-based index for
> > the
> > > >>> scan,
> > > >>> I only see the possibility to scan the index of the whole table to
> > > >> collect
> > > >>> the
> > > >>> rowKeys that are inside the timerange I'm interested in.
> > > >>>
> > > >>> Is there a more elegant way to collect rows within time range X?
> > > >>> (Unfortunately, the date attribute is not equal to the timestamp
> that
> > > is
> > > >>> stored by hbase automatically.)
> > > >>>
> > > >>> Could/should one maybe leverage some kind of row key caching to
> > > >> accelerate
> > > >>> the collection process?
> > > >>> Is that covered by the block cache?
> > > >>>
> > > >>> Thanks in advance for any advice.
> > > >>>
> > > >>> regards
> > > >>> Chris
> > > >>>
> > > >>
> > >
> >
> >
> >
> > --
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: How to query by rowKey-infix

Posted by Matt Corgan <mc...@hotpads.com>.

Also Christian, don't forget you can read all the rows back to the client
and do the filtering there using whatever logic you like.  HBase Filters
can be thought of as an optimization (predicate push-down) over client-side
filtering.  Pulling all the rows over the network will be slower, but I
don't think we know enough about your data or speed requirements to rule it
out.


On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <al...@gmail.com>wrote:

> Hi Christian!
>
> If to put off secondary indexes and assume you are going with "heavy
> scans", you can try two following things to make it much faster. If this is
> appropriate to your situation, of course.
>
> 1.
>
> > Is there a more elegant way to collect rows within time range X?
> > (Unfortunately, the date attribute is not equal to the timestamp that is
> stored by hbase automatically.)
>
> Can you set timestamp of the Puts to the one you have in row key? Instead
> of relying on the one that HBase puts automatically (current ts). If you
> can, this will improve reading speed a lot by setting time range on
> scanner. Depending on how you are writing your data of course, but I assume
> that you mostly write data in "time-increasing" manner.
>
> 2.
>
> If your userId has fixed length, or you can change it so that it has fixed
> length, then you can actually use smth like "wildcard"  in row key. There's
> a way in Filter implementation to fast-forward to the record with specific
> row key and by doing this skip many records. This might be used as follows:
> * suppose your userId is 5 characters in length
> * suppose you are scanning for records with time between 2012-08-01
> and 2012-08-08
> * when you scanning records and you face e.g. key
> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> Because you know that all remained records of user "aaaaa" don't fall into
> the interval you need (as the time for its records will be >= 2012-08-09).
>
> As of now, I believe you will have to implement your custom filter to do
> that.
> Pointer:
> org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> I believe I implemented similar thing some time ago. If this idea works for
> you I could look for the implementation and share it if it helps. Or may be
> even simply add it to HBase codebase.
>
> Hope this helps,
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
>
> On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <syrious3000@yahoo.de
> >wrote:
>
> >
> >
> > Excuse my double posting.
> > Here is the complete mail:
> >
> >
> > OK,
> >
> > at first I will try the scans.
> >
> > If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2)
> > to be able to use coprocessors.
> >
> >
> > Currently I'm stuck at the scans because it requires two steps (therefore
> > maybe some kind of filter chaining is required)
> >
> >
> > The key:  userId-dateInMillis-sessionId
> >
> > At first I need to extract dateInMllis with regex or substring (using
> > special delimiters for date)
> >
> > Second, the extracted value must be parsed to Long and set to a RowFilter
> > Comparator like this:
> >
> > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> >
> > How to chain that?
> > Do I have to write a custom filter?
> > (Would like to avoid that due to deployment)
> >
> > regards
> > Chris
> >
> > ----- Ursprüngliche Message -----
> > Von: Michael Segel <mi...@hotmail.com>
> > An: user@hbase.apache.org
> > CC:
> > Gesendet: 13:52 Mittwoch, 1.August 2012
> > Betreff: Re: How to query by rowKey-infix
> >
> > Actually w coprocessors you can create a secondary index in short order.
> > Then your cost is going to be 2 fetches. Trying to do a partial table
> scan
> > will be more expensive.
> >
> > On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >
> > > When deciding between a table scan vs secondary index, you should try
> to
> > > estimate what percent of the underlying data blocks will be used in the
> > > query.  By default, each block is 64KB.
> > >
> > > If each user's data is small and you are fitting multiple users per
> > block,
> > > then you're going to need all the blocks, so a tablescan is better
> > because
> > > it's simpler.  If each user has 1MB+ data then you will want to pick
> out
> > > the individual blocks relevant to each date.  The secondary index will
> > help
> > > you go directly to those sparse blocks, but with a cost in complexity,
> > > consistency, and extra denormalized data that knocks primary data out
> of
> > > your block cache.
> > >
> > > If latency is not a concern, I would start with the table scan.  If
> > that's
> > > too slow you add the secondary index, and if you still need it faster
> you
> > > do the primary key lookups in parallel as Jerry mentions.
> > >
> > > Matt
> > >
> > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> > wrote:
> > >
> > >> Hi Chris:
> > >>
> > >> I'm thinking about building a secondary index for primary key lookup,
> > then
> > >> query using the primary keys in parallel.
> > >>
> > >> I'm interested to see if there is other option too.
> > >>
> > >> Best Regards,
> > >>
> > >> Jerry
> > >>
> > >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> > syrious3000@yahoo.de
> > >>> wrote:
> > >>
> > >>> Hello there,
> > >>>
> > >>> I designed a row key for queries that need best performance (~100 ms)
> > >>> which looks like this:
> > >>>
> > >>> userId-date-sessionId
> > >>>
> > >>> These queries(scans) are always based on a userId and sometimes
> > >>> additionally on a date, too.
> > >>> That's no problem with the key above.
> > >>>
> > >>> However, another kind of queries shall be based on a given time range
> > >>> whereas the outermost left userId is not given or known.
> > >>> In this case I need to get all rows covering the given time range
> with
> > >>> their date to create a daily reporting.
> > >>>
> > >>> As I can't set wildcards at the beginning of a left-based index for
> the
> > >>> scan,
> > >>> I only see the possibility to scan the index of the whole table to
> > >> collect
> > >>> the
> > >>> rowKeys that are inside the timerange I'm interested in.
> > >>>
> > >>> Is there a more elegant way to collect rows within time range X?
> > >>> (Unfortunately, the date attribute is not equal to the timestamp that
> > is
> > >>> stored by hbase automatically.)
> > >>>
> > >>> Could/should one maybe leverage some kind of row key caching to
> > >> accelerate
> > >>> the collection process?
> > >>> Is that covered by the block cache?
> > >>>
> > >>> Thanks in advance for any advice.
> > >>>
> > >>> regards
> > >>> Chris
> > >>>
> > >>
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>

Re: How to query by rowKey-infix

Posted by anil gupta <an...@gmail.com>.

Christian: I'm slightly shocked about the processing time of more than 2
mins to return 225 rows.I would actually need a response in 5-10 sec.
Anil: I started getting the response within 1-2 sec of firing the query but
i got all the 225 results in 2 mins. My table was having 34 million rows
and every rows was having 25 columns on an average. Average size of each
row is around 1.21 KB. Size of one replica is ~40 GB in HDFS.
I havent done the comparison of timestamp based filtering and column value
based filtering. However, I strongly believe that timestamp based filtering
will be a winner due to the reason that it can skip Blocks.
Regarding the concern that my query took 2 min, one of the reason is that
the Hardware conf is way below par so i dont really look for blazing fast
performance on this cluster. If you get a really well tuned HBase then your
performance can improve by 3-4x easily(query will be done in 20-30
seconds). But, i dont think you can get blazing fast result like the ones
we get when we do scanning based on RowKey.

Christian: In your  timestamp based filtering, do you check the timestamp
as part of the row key or do you use the put timestamp (as I do)?
Anil: I use the timestamp by using Scan.setTimeRange(long, long). In my use
case i am not using row key at all. So, roughly it is full table scan but
timestamp is doing all the magic. It's a definite advantage if you can use
rowkey in your query.

Christian:Is it a full table scan where each row's key is checked against a
given timestamp/timerange?
Anil: Essentially its a full table scan since i am not using any rowkey or
other filters.

Christian:How many rows are scanned/touched  at your timestamp based
filtering?
Anil: I dont know how to get these stats. Can anyone enlighten me? I am
also curious to know this stat.

I'll try to run the column value based filter also so that we get some more
insights into the best option available. Let me know your thoughts on my
reply.

Thanks,
Anil Gupta


On Thu, Aug 23, 2012 at 1:41 AM, Christian Schäfer <sy...@yahoo.de>wrote:

> Hi Anil,
>
> to restrict data to a certain time window I also set timerange for the
> scan.
>
>
>
> How many rows are scanned/touched  at your timestamp based filtering?
>
>
>
> My use case of obtaining data by substring comparator operates on the row
> key.
> It can't be replaced by setting the time range in my case, really.
>
> Btw. the scan is additionally restricted to a certain timerange to
> increase skipping of irrelevant files and thus improve performance.
>
>
> regards,
> Christian
>
>
>
> ----- Ursprüngliche Message -----
> Von: anil gupta <an...@gmail.com>
> An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
> CC:
> Gesendet: 20:42 Mittwoch, 22.August 2012
> Betreff: Re: How to query by rowKey-infix
>
> Hi Christian,
>
> I had the similar requirements as yours. So, till now i have used
> timestamps for filtering the data and I would say the performance is
> satisfactory. Here are the results of timestamp based filtering:
> The table has 34 million records(average row size is 1.21 KB), in 136
> seconds i get the entire result of query which had 225 rows.
> I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
> had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
> is hosting 2 Slaves Instance(2 VM's running Datanode,
> NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
> done any modification in the block size of HDFS or HBase. Considering the
> below-par hardware configuration of cluster i feel the performance is OK
> and IMO it'll be better than substring comparator of column values since in
> substring comparator filter you are essentially doing a FULL TABLE scan.
> Whereas, in timerange based scan you can *Skip Store Files*.
>
> On a side note, Alex created a JIRA for enhancing the current
> FuzzyRowFilter to do range based filtering also. Here is the link:
> https://issues.apache.org/jira/browse/HBASE-6618 . You are more than
> welcome if you would like to chime in.
>
> HTH,
> Anil Gupta
>
>
> On Thu, Aug 9, 2012 at 1:55 PM, Christian Schäfer <syrious3000@yahoo.de
> >wrote:
>
> > Nice. Thanks Alex for sharing your experiences with that custom filter
> > implementation.
> >
> >
> > Currently I'm still using key filter with substring comparator.
> > As soon as I got a good amount of test data I will measure performance of
> > that naiive substring filter in comparison to your fuzzy row filter.
> >
> > regards,
> > Christian
> >
> >
> >
> > ________________________________
> > Von: Alex Baranau <al...@gmail.com>
> > An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
> > Gesendet: 22:18 Donnerstag, 9.August 2012
> > Betreff: Re: How to query by rowKey-infix
> >
> >
> > jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will
> > add documentation to HBase book very soon [1]
> >
> > Alex Baranau
> > ------
> > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-6526
> >
> > On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <al...@gmail.com>
> > wrote:
> >
> > Good!
> > >
> > >
> > >Submitted initial patch of fuzzy row key filter at
> > https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the
> > filter class and include it in your code and use it in your setup as any
> > other custom filter (no need to patch HBase).
> > >
> > >
> > >Please let me know if you try it out (or post your comments at
> > HBASE-6509).
> > >
> > >
> > >Alex Baranau
> > >------
> > >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> > >
> > >
> > >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <syrious3000@yahoo.de
> >
> > wrote:
> > >
> > >Hi Alex,
> > >>
> > >>thanks a lot for the hint about setting the timestamp of the put.
> > >>I didn't know that this would be possible but that's solving the
> problem
> > (first test was successful).
> > >>So I'm really glad that I don't need to apply a filter to extract the
> > time and so on for every row.
> > >>
> > >>Nevertheless I would like to see your custom filter implementation.
> > >>Would be nice if you could provide it helping me to get a bit into it.
> > >>
> > >>And yes that helped :)
> > >>
> > >>regards
> > >>Chris
> > >>
> > >>
> > >>
> > >>________________________________
> > >>Von: Alex Baranau <al...@gmail.com>
> > >>An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
> > >>Gesendet: 0:57 Freitag, 3.August 2012
> > >>
> > >>Betreff: Re: How to query by rowKey-infix
> > >>
> > >>
> > >>Hi Christian!
> > >>If to put off secondary indexes and assume you are going with "heavy
> > scans", you can try two following things to make it much faster. If this
> is
> > appropriate to your situation, of course.
> > >>
> > >>1.
> > >>
> > >>> Is there a more elegant way to collect rows within time range X?
> > >>> (Unfortunately, the date attribute is not equal to the timestamp that
> > is stored by hbase automatically.)
> > >>
> > >>Can you set timestamp of the Puts to the one you have in row key?
> > Instead of relying on the one that HBase puts automatically (current ts).
> > If you can, this will improve reading speed a lot by setting time range
> on
> > scanner. Depending on how you are writing your data of course, but I
> assume
> > that you mostly write data in "time-increasing" manner.
> > >>
> > >>
> > >>2.
> > >>
> > >>If your userId has fixed length, or you can change it so that it has
> > fixed length, then you can actually use smth like "wildcard"  in row key.
> > There's a way in Filter implementation to fast-forward to the record with
> > specific row key and by doing this skip many records. This might be used
> as
> > follows:
> > >>* suppose your userId is 5 characters in length
> > >>* suppose you are scanning for records with time between 2012-08-01
> > and 2012-08-08
> > >>* when you scanning records and you face e.g. key
> > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> > Because you know that all remained records of user "aaaaa" don't fall
> into
> > the interval you need (as the time for its records will be >=
> 2012-08-09).
> > >>
> > >>As of now, I believe you will have to implement your custom filter to
> do
> > that.
> > Pointer:
> org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> > >>I believe I implemented similar thing some time ago. If this idea works
> > for you I could look for the implementation and share it if it helps. Or
> > may be even simply add it to HBase codebase.
> > >>
> > >>Hope this helps,
> > >>
> > >>
> > >>Alex Baranau
> > >>------
> > >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > - Solr
> > >>
> > >>
> > >>
> > >>On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <
> syrious3000@yahoo.de>
> > wrote:
> > >>
> > >>
> > >>>
> > >>>Excuse my double posting.
> > >>>Here is the complete mail:
> > >>>
> > >>>
> > >>>
> > >>>OK,
> > >>>
> > >>>at first I will try the scans.
> > >>>
> > >>>If that's too slow I will have to upgrade hbase (currently
> > 0.90.4-cdh3u2) to be able to use coprocessors.
> > >>>
> > >>>
> > >>>Currently I'm stuck at the scans because it requires two steps
> > (therefore maybe some kind of filter chaining is required)
> > >>>
> > >>>
> > >>>The key:  userId-dateInMillis-sessionId
> > >>>
> > >>>
> > >>>At first I need to extract dateInMllis with regex or substring (using
> > special delimiters for date)
> > >>>
> > >>>Second, the extracted value must be parsed to Long and set to a
> > RowFilter Comparator like this:
> > >>>
> > >>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> > >>>
> > >>>How to chain that?
> > >>>Do I have to write a custom filter?
> > >>>(Would like to avoid that due to deployment)
> > >>>
> > >>>regards
> > >>>Chris
> > >>>
> > >>>
> > >>>----- Ursprüngliche Message -----
> > >>>Von: Michael Segel <mi...@hotmail.com>
> > >>>An: user@hbase.apache.org
> > >>>CC:
> > >>>Gesendet: 13:52 Mittwoch, 1.August 2012
> > >>>Betreff: Re: How to query by rowKey-infix
> > >>>
> > >>>Actually w coprocessors you can create a secondary index in short
> order.
> > >>>Then your cost is going to be 2 fetches. Trying to do a partial table
> > scan will be more expensive.
> > >>>
> > >>>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > >>>
> > >>>> When deciding between a table scan vs secondary index, you should
> try
> > to
> > >>>> estimate what percent of the underlying data blocks will be used in
> > the
> > >>>> query.  By default, each block is 64KB.
> > >>>>
> > >>>> If each user's data is small and you are fitting multiple users per
> > block,
> > >>>> then you're going to need all the blocks, so a tablescan is better
> > because
> > >>>> it's simpler.  If each user has 1MB+ data then you will want to pick
> > out
> > >>>> the individual blocks relevant to each date.  The secondary index
> > will help
> > >>>> you go directly to those sparse blocks, but with a cost in
> complexity,
> > >>>> consistency, and extra denormalized data that knocks primary data
> out
> > of
> > >>>> your block cache.
> > >>>>
> > >>>> If latency is not a concern, I would start with the table scan.  If
> > that's
> > >>>> too slow you add the secondary index, and if you still need it
> faster
> > you
> > >>>> do the primary key lookups in parallel as Jerry mentions.
> > >>>>
> > >>>> Matt
> > >>>>
> > >>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi Chris:
> > >>>>>
> > >>>>> I'm thinking about building a secondary index for primary key
> > lookup, then
> > >>>>> query using the primary keys in parallel.
> > >>>>>
> > >>>>> I'm interested to see if there is other option too.
> > >>>>>
> > >>>>> Best Regards,
> > >>>>>
> > >>>>> Jerry
> > >>>>>
> > >>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> > syrious3000@yahoo.de
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> Hello there,
> > >>>>>>
> > >>>>>> I designed a row key for queries that need best performance (~100
> > ms)
> > >>>>>> which looks like this:
> > >>>>>>
> > >>>>>> userId-date-sessionId
> > >>>>>>
> > >>>>>> These queries(scans) are always based on a userId and sometimes
> > >>>>>> additionally on a date, too.
> > >>>>>> That's no problem with the key above.
> > >>>>>>
> > >>>>>> However, another kind of queries shall be based on a given time
> > range
> > >>>>>> whereas the outermost left userId is not given or known.
> > >>>>>> In this case I need to get all rows covering the given time range
> > with
> > >>>>>> their date to create a daily reporting.
> > >>>>>>
> > >>>>>> As I can't set wildcards at the beginning of a left-based index
> for
> > the
> > >>>>>> scan,
> > >>>>>> I only see the possibility to scan the index of the whole table to
> > >>>>> collect
> > >>>>>> the
> > >>>>>> rowKeys that are inside the timerange I'm interested in.
> > >>>>>>
> > >>>>>> Is there a more elegant way to collect rows within time range X?
> > >>>>>> (Unfortunately, the date attribute is not equal to the timestamp
> > that is
> > >>>>>> stored by hbase automatically.)
> > >>>>>>
> > >>>>>> Could/should one maybe leverage some kind of row key caching to
> > >>>>> accelerate
> > >>>>>> the collection process?
> > >>>>>> Is that covered by the block cache?
> > >>>>>>
> > >>>>>> Thanks in advance for any advice.
> > >>>>>>
> > >>>>>> regards
> > >>>>>> Chris
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> > >>
> > >>--
> > >>
> > >>Alex Baranau
> > >>------
> > >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > - Solr
> > >>
> > >
> > >
> > >
> > >--
> > >
> > >Alex Baranau
> > >------
> > >Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> > - Solr
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.

Hi Anil,

to restrict data to a certain time window I also set timerange for the scan.

I'm slightly shocked about the processing time of more than 2 mins to return 225 rows.
I would actually need a response in 5-10 sec.
In your   timestamp based filtering, do you check the timestamp as part of the row key or do you use the put timestamp (as I do)?
How many rows are scanned/touched  at your timestamp based filtering? 

Is it a full table scan where each row's key is checked against a given timestamp/timerange?


My use case of obtaining data by substring comparator operates on the row key.
It can't be replaced by setting the time range in my case, really. 

Btw. the scan is additionally restricted to a certain timerange to increase skipping of irrelevant files and thus improve performance.

 
regards,
Christian



----- Ursprüngliche Message -----
Von: anil gupta <an...@gmail.com>
An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
CC: 
Gesendet: 20:42 Mittwoch, 22.August 2012
Betreff: Re: How to query by rowKey-infix

Hi Christian,

I had the similar requirements as yours. So, till now i have used
timestamps for filtering the data and I would say the performance is
satisfactory. Here are the results of timestamp based filtering:
The table has 34 million records(average row size is 1.21 KB), in 136
seconds i get the entire result of query which had 225 rows.
I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
is hosting 2 Slaves Instance(2 VM's running Datanode,
NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
done any modification in the block size of HDFS or HBase. Considering the
below-par hardware configuration of cluster i feel the performance is OK
and IMO it'll be better than substring comparator of column values since in
substring comparator filter you are essentially doing a FULL TABLE scan.
Whereas, in timerange based scan you can *Skip Store Files*.

On a side note, Alex created a JIRA for enhancing the current
FuzzyRowFilter to do range based filtering also. Here is the link:
https://issues.apache.org/jira/browse/HBASE-6618 . You are more than
welcome if you would like to chime in.

HTH,
Anil Gupta


On Thu, Aug 9, 2012 at 1:55 PM, Christian Schäfer <sy...@yahoo.de>wrote:

> Nice. Thanks Alex for sharing your experiences with that custom filter
> implementation.
>
>
> Currently I'm still using key filter with substring comparator.
> As soon as I got a good amount of test data I will measure performance of
> that naiive substring filter in comparison to your fuzzy row filter.
>
> regards,
> Christian
>
>
>
> ________________________________
> Von: Alex Baranau <al...@gmail.com>
> An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
> Gesendet: 22:18 Donnerstag, 9.August 2012
> Betreff: Re: How to query by rowKey-infix
>
>
> jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will
> add documentation to HBase book very soon [1]
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> [1] https://issues.apache.org/jira/browse/HBASE-6526
>
> On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <al...@gmail.com>
> wrote:
>
> Good!
> >
> >
> >Submitted initial patch of fuzzy row key filter at
> https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the
> filter class and include it in your code and use it in your setup as any
> other custom filter (no need to patch HBase).
> >
> >
> >Please let me know if you try it out (or post your comments at
> HBASE-6509).
> >
> >
> >Alex Baranau
> >------
> >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
> >
> >
> >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <sy...@yahoo.de>
> wrote:
> >
> >Hi Alex,
> >>
> >>thanks a lot for the hint about setting the timestamp of the put.
> >>I didn't know that this would be possible but that's solving the problem
> (first test was successful).
> >>So I'm really glad that I don't need to apply a filter to extract the
> time and so on for every row.
> >>
> >>Nevertheless I would like to see your custom filter implementation.
> >>Would be nice if you could provide it helping me to get a bit into it.
> >>
> >>And yes that helped :)
> >>
> >>regards
> >>Chris
> >>
> >>
> >>
> >>________________________________
> >>Von: Alex Baranau <al...@gmail.com>
> >>An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
> >>Gesendet: 0:57 Freitag, 3.August 2012
> >>
> >>Betreff: Re: How to query by rowKey-infix
> >>
> >>
> >>Hi Christian!
> >>If to put off secondary indexes and assume you are going with "heavy
> scans", you can try two following things to make it much faster. If this is
> appropriate to your situation, of course.
> >>
> >>1.
> >>
> >>> Is there a more elegant way to collect rows within time range X?
> >>> (Unfortunately, the date attribute is not equal to the timestamp that
> is stored by hbase automatically.)
> >>
> >>Can you set timestamp of the Puts to the one you have in row key?
> Instead of relying on the one that HBase puts automatically (current ts).
> If you can, this will improve reading speed a lot by setting time range on
> scanner. Depending on how you are writing your data of course, but I assume
> that you mostly write data in "time-increasing" manner.
> >>
> >>
> >>2.
> >>
> >>If your userId has fixed length, or you can change it so that it has
> fixed length, then you can actually use smth like "wildcard"  in row key.
> There's a way in Filter implementation to fast-forward to the record with
> specific row key and by doing this skip many records. This might be used as
> follows:
> >>* suppose your userId is 5 characters in length
> >>* suppose you are scanning for records with time between 2012-08-01
> and 2012-08-08
> >>* when you scanning records and you face e.g. key
> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> Because you know that all remained records of user "aaaaa" don't fall into
> the interval you need (as the time for its records will be >= 2012-08-09).
> >>
> >>As of now, I believe you will have to implement your custom filter to do
> that.
> Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> >>I believe I implemented similar thing some time ago. If this idea works
> for you I could look for the implementation and share it if it helps. Or
> may be even simply add it to HBase codebase.
> >>
> >>Hope this helps,
> >>
> >>
> >>Alex Baranau
> >>------
> >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >>
> >>
> >>
> >>On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <sy...@yahoo.de>
> wrote:
> >>
> >>
> >>>
> >>>Excuse my double posting.
> >>>Here is the complete mail:
> >>>
> >>>
> >>>
> >>>OK,
> >>>
> >>>at first I will try the scans.
> >>>
> >>>If that's too slow I will have to upgrade hbase (currently
> 0.90.4-cdh3u2) to be able to use coprocessors.
> >>>
> >>>
> >>>Currently I'm stuck at the scans because it requires two steps
> (therefore maybe some kind of filter chaining is required)
> >>>
> >>>
> >>>The key:  userId-dateInMillis-sessionId
> >>>
> >>>
> >>>At first I need to extract dateInMllis with regex or substring (using
> special delimiters for date)
> >>>
> >>>Second, the extracted value must be parsed to Long and set to a
> RowFilter Comparator like this:
> >>>
> >>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> >>>
> >>>How to chain that?
> >>>Do I have to write a custom filter?
> >>>(Would like to avoid that due to deployment)
> >>>
> >>>regards
> >>>Chris
> >>>
> >>>
> >>>----- Ursprüngliche Message -----
> >>>Von: Michael Segel <mi...@hotmail.com>
> >>>An: user@hbase.apache.org
> >>>CC:
> >>>Gesendet: 13:52 Mittwoch, 1.August 2012
> >>>Betreff: Re: How to query by rowKey-infix
> >>>
> >>>Actually w coprocessors you can create a secondary index in short order.
> >>>Then your cost is going to be 2 fetches. Trying to do a partial table
> scan will be more expensive.
> >>>
> >>>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >>>
> >>>> When deciding between a table scan vs secondary index, you should try
> to
> >>>> estimate what percent of the underlying data blocks will be used in
> the
> >>>> query.  By default, each block is 64KB.
> >>>>
> >>>> If each user's data is small and you are fitting multiple users per
> block,
> >>>> then you're going to need all the blocks, so a tablescan is better
> because
> >>>> it's simpler.  If each user has 1MB+ data then you will want to pick
> out
> >>>> the individual blocks relevant to each date.  The secondary index
> will help
> >>>> you go directly to those sparse blocks, but with a cost in complexity,
> >>>> consistency, and extra denormalized data that knocks primary data out
> of
> >>>> your block cache.
> >>>>
> >>>> If latency is not a concern, I would start with the table scan.  If
> that's
> >>>> too slow you add the secondary index, and if you still need it faster
> you
> >>>> do the primary key lookups in parallel as Jerry mentions.
> >>>>
> >>>> Matt
> >>>>
> >>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Chris:
> >>>>>
> >>>>> I'm thinking about building a secondary index for primary key
> lookup, then
> >>>>> query using the primary keys in parallel.
> >>>>>
> >>>>> I'm interested to see if there is other option too.
> >>>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> Jerry
> >>>>>
> >>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> syrious3000@yahoo.de
> >>>>>> wrote:
> >>>>>
> >>>>>> Hello there,
> >>>>>>
> >>>>>> I designed a row key for queries that need best performance (~100
> ms)
> >>>>>> which looks like this:
> >>>>>>
> >>>>>> userId-date-sessionId
> >>>>>>
> >>>>>> These queries(scans) are always based on a userId and sometimes
> >>>>>> additionally on a date, too.
> >>>>>> That's no problem with the key above.
> >>>>>>
> >>>>>> However, another kind of queries shall be based on a given time
> range
> >>>>>> whereas the outermost left userId is not given or known.
> >>>>>> In this case I need to get all rows covering the given time range
> with
> >>>>>> their date to create a daily reporting.
> >>>>>>
> >>>>>> As I can't set wildcards at the beginning of a left-based index for
> the
> >>>>>> scan,
> >>>>>> I only see the possibility to scan the index of the whole table to
> >>>>> collect
> >>>>>> the
> >>>>>> rowKeys that are inside the timerange I'm interested in.
> >>>>>>
> >>>>>> Is there a more elegant way to collect rows within time range X?
> >>>>>> (Unfortunately, the date attribute is not equal to the timestamp
> that is
> >>>>>> stored by hbase automatically.)
> >>>>>>
> >>>>>> Could/should one maybe leverage some kind of row key caching to
> >>>>> accelerate
> >>>>>> the collection process?
> >>>>>> Is that covered by the block cache?
> >>>>>>
> >>>>>> Thanks in advance for any advice.
> >>>>>>
> >>>>>> regards
> >>>>>> Chris
> >>>>>>
> >>>>>
> >>>
> >>
> >>
> >>--
> >>
> >>Alex Baranau
> >>------
> >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >>
> >
> >
> >
> >--
> >
> >Alex Baranau
> >------
> >Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: How to query by rowKey-infix

Posted by anil gupta <an...@gmail.com>.

Hi Christian,

I had the similar requirements as yours. So, till now i have used
timestamps for filtering the data and I would say the performance is
satisfactory. Here are the results of timestamp based filtering:
The table has 34 million records(average row size is 1.21 KB), in 136
seconds i get the entire result of query which had 225 rows.
I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
is hosting 2 Slaves Instance(2 VM's running Datanode,
NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
done any modification in the block size of HDFS or HBase. Considering the
below-par hardware configuration of cluster i feel the performance is OK
and IMO it'll be better than substring comparator of column values since in
substring comparator filter you are essentially doing a FULL TABLE scan.
Whereas, in timerange based scan you can *Skip Store Files*.

On a side note, Alex created a JIRA for enhancing the current
FuzzyRowFilter to do range based filtering also. Here is the link:
https://issues.apache.org/jira/browse/HBASE-6618 . You are more than
welcome if you would like to chime in.

HTH,
Anil Gupta


On Thu, Aug 9, 2012 at 1:55 PM, Christian Schäfer <sy...@yahoo.de>wrote:

> Nice. Thanks Alex for sharing your experiences with that custom filter
> implementation.
>
>
> Currently I'm still using key filter with substring comparator.
> As soon as I got a good amount of test data I will measure performance of
> that naiive substring filter in comparison to your fuzzy row filter.
>
> regards,
> Christian
>
>
>
> ________________________________
> Von: Alex Baranau <al...@gmail.com>
> An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
> Gesendet: 22:18 Donnerstag, 9.August 2012
> Betreff: Re: How to query by rowKey-infix
>
>
> jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will
> add documentation to HBase book very soon [1]
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> [1] https://issues.apache.org/jira/browse/HBASE-6526
>
> On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <al...@gmail.com>
> wrote:
>
> Good!
> >
> >
> >Submitted initial patch of fuzzy row key filter at
> https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the
> filter class and include it in your code and use it in your setup as any
> other custom filter (no need to patch HBase).
> >
> >
> >Please let me know if you try it out (or post your comments at
> HBASE-6509).
> >
> >
> >Alex Baranau
> >------
> >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
> >
> >
> >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <sy...@yahoo.de>
> wrote:
> >
> >Hi Alex,
> >>
> >>thanks a lot for the hint about setting the timestamp of the put.
> >>I didn't know that this would be possible but that's solving the problem
> (first test was successful).
> >>So I'm really glad that I don't need to apply a filter to extract the
> time and so on for every row.
> >>
> >>Nevertheless I would like to see your custom filter implementation.
> >>Would be nice if you could provide it helping me to get a bit into it.
> >>
> >>And yes that helped :)
> >>
> >>regards
> >>Chris
> >>
> >>
> >>
> >>________________________________
> >>Von: Alex Baranau <al...@gmail.com>
> >>An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
> >>Gesendet: 0:57 Freitag, 3.August 2012
> >>
> >>Betreff: Re: How to query by rowKey-infix
> >>
> >>
> >>Hi Christian!
> >>If to put off secondary indexes and assume you are going with "heavy
> scans", you can try two following things to make it much faster. If this is
> appropriate to your situation, of course.
> >>
> >>1.
> >>
> >>> Is there a more elegant way to collect rows within time range X?
> >>> (Unfortunately, the date attribute is not equal to the timestamp that
> is stored by hbase automatically.)
> >>
> >>Can you set timestamp of the Puts to the one you have in row key?
> Instead of relying on the one that HBase puts automatically (current ts).
> If you can, this will improve reading speed a lot by setting time range on
> scanner. Depending on how you are writing your data of course, but I assume
> that you mostly write data in "time-increasing" manner.
> >>
> >>
> >>2.
> >>
> >>If your userId has fixed length, or you can change it so that it has
> fixed length, then you can actually use smth like "wildcard"  in row key.
> There's a way in Filter implementation to fast-forward to the record with
> specific row key and by doing this skip many records. This might be used as
> follows:
> >>* suppose your userId is 5 characters in length
> >>* suppose you are scanning for records with time between 2012-08-01
> and 2012-08-08
> >>* when you scanning records and you face e.g. key
> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> Because you know that all remained records of user "aaaaa" don't fall into
> the interval you need (as the time for its records will be >= 2012-08-09).
> >>
> >>As of now, I believe you will have to implement your custom filter to do
> that.
> Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> >>I believe I implemented similar thing some time ago. If this idea works
> for you I could look for the implementation and share it if it helps. Or
> may be even simply add it to HBase codebase.
> >>
> >>Hope this helps,
> >>
> >>
> >>Alex Baranau
> >>------
> >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >>
> >>
> >>
> >>On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <sy...@yahoo.de>
> wrote:
> >>
> >>
> >>>
> >>>Excuse my double posting.
> >>>Here is the complete mail:
> >>>
> >>>
> >>>
> >>>OK,
> >>>
> >>>at first I will try the scans.
> >>>
> >>>If that's too slow I will have to upgrade hbase (currently
> 0.90.4-cdh3u2) to be able to use coprocessors.
> >>>
> >>>
> >>>Currently I'm stuck at the scans because it requires two steps
> (therefore maybe some kind of filter chaining is required)
> >>>
> >>>
> >>>The key:  userId-dateInMillis-sessionId
> >>>
> >>>
> >>>At first I need to extract dateInMllis with regex or substring (using
> special delimiters for date)
> >>>
> >>>Second, the extracted value must be parsed to Long and set to a
> RowFilter Comparator like this:
> >>>
> >>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> >>>
> >>>How to chain that?
> >>>Do I have to write a custom filter?
> >>>(Would like to avoid that due to deployment)
> >>>
> >>>regards
> >>>Chris
> >>>
> >>>
> >>>----- Ursprüngliche Message -----
> >>>Von: Michael Segel <mi...@hotmail.com>
> >>>An: user@hbase.apache.org
> >>>CC:
> >>>Gesendet: 13:52 Mittwoch, 1.August 2012
> >>>Betreff: Re: How to query by rowKey-infix
> >>>
> >>>Actually w coprocessors you can create a secondary index in short order.
> >>>Then your cost is going to be 2 fetches. Trying to do a partial table
> scan will be more expensive.
> >>>
> >>>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >>>
> >>>> When deciding between a table scan vs secondary index, you should try
> to
> >>>> estimate what percent of the underlying data blocks will be used in
> the
> >>>> query.  By default, each block is 64KB.
> >>>>
> >>>> If each user's data is small and you are fitting multiple users per
> block,
> >>>> then you're going to need all the blocks, so a tablescan is better
> because
> >>>> it's simpler.  If each user has 1MB+ data then you will want to pick
> out
> >>>> the individual blocks relevant to each date.  The secondary index
> will help
> >>>> you go directly to those sparse blocks, but with a cost in complexity,
> >>>> consistency, and extra denormalized data that knocks primary data out
> of
> >>>> your block cache.
> >>>>
> >>>> If latency is not a concern, I would start with the table scan.  If
> that's
> >>>> too slow you add the secondary index, and if you still need it faster
> you
> >>>> do the primary key lookups in parallel as Jerry mentions.
> >>>>
> >>>> Matt
> >>>>
> >>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Chris:
> >>>>>
> >>>>> I'm thinking about building a secondary index for primary key
> lookup, then
> >>>>> query using the primary keys in parallel.
> >>>>>
> >>>>> I'm interested to see if there is other option too.
> >>>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> Jerry
> >>>>>
> >>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> syrious3000@yahoo.de
> >>>>>> wrote:
> >>>>>
> >>>>>> Hello there,
> >>>>>>
> >>>>>> I designed a row key for queries that need best performance (~100
> ms)
> >>>>>> which looks like this:
> >>>>>>
> >>>>>> userId-date-sessionId
> >>>>>>
> >>>>>> These queries(scans) are always based on a userId and sometimes
> >>>>>> additionally on a date, too.
> >>>>>> That's no problem with the key above.
> >>>>>>
> >>>>>> However, another kind of queries shall be based on a given time
> range
> >>>>>> whereas the outermost left userId is not given or known.
> >>>>>> In this case I need to get all rows covering the given time range
> with
> >>>>>> their date to create a daily reporting.
> >>>>>>
> >>>>>> As I can't set wildcards at the beginning of a left-based index for
> the
> >>>>>> scan,
> >>>>>> I only see the possibility to scan the index of the whole table to
> >>>>> collect
> >>>>>> the
> >>>>>> rowKeys that are inside the timerange I'm interested in.
> >>>>>>
> >>>>>> Is there a more elegant way to collect rows within time range X?
> >>>>>> (Unfortunately, the date attribute is not equal to the timestamp
> that is
> >>>>>> stored by hbase automatically.)
> >>>>>>
> >>>>>> Could/should one maybe leverage some kind of row key caching to
> >>>>> accelerate
> >>>>>> the collection process?
> >>>>>> Is that covered by the block cache?
> >>>>>>
> >>>>>> Thanks in advance for any advice.
> >>>>>>
> >>>>>> regards
> >>>>>> Chris
> >>>>>>
> >>>>>
> >>>
> >>
> >>
> >>--
> >>
> >>Alex Baranau
> >>------
> >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >>
> >
> >
> >
> >--
> >
> >Alex Baranau
> >------
> >Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.

Nice. Thanks Alex for sharing your experiences with that custom filter implementation.


Currently I'm still using key filter with substring comparator.
As soon as I got a good amount of test data I will measure performance of that naiive substring filter in comparison to your fuzzy row filter.

regards,
Christian



________________________________
Von: Alex Baranau <al...@gmail.com>
An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de> 
Gesendet: 22:18 Donnerstag, 9.August 2012
Betreff: Re: How to query by rowKey-infix


jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will add documentation to HBase book very soon [1]

Alex Baranau
------
Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

[1] https://issues.apache.org/jira/browse/HBASE-6526

On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <al...@gmail.com> wrote:

Good!
>
>
>Submitted initial patch of fuzzy row key filter at https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the filter class and include it in your code and use it in your setup as any other custom filter (no need to patch HBase).
>
>
>Please let me know if you try it out (or post your comments at HBASE-6509).
>
>
>Alex Baranau
>------
>Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
>
>On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <sy...@yahoo.de> wrote:
>
>Hi Alex,
>>
>>thanks a lot for the hint about setting the timestamp of the put.
>>I didn't know that this would be possible but that's solving the problem (first test was successful).
>>So I'm really glad that I don't need to apply a filter to extract the time and so on for every row.
>>
>>Nevertheless I would like to see your custom filter implementation.
>>Would be nice if you could provide it helping me to get a bit into it.
>>
>>And yes that helped :)
>>
>>regards
>>Chris
>>
>>
>>
>>________________________________
>>Von: Alex Baranau <al...@gmail.com>
>>An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
>>Gesendet: 0:57 Freitag, 3.August 2012
>>
>>Betreff: Re: How to query by rowKey-infix
>>
>>
>>Hi Christian!
>>If to put off secondary indexes and assume you are going with "heavy scans", you can try two following things to make it much faster. If this is appropriate to your situation, of course.
>>
>>1.
>>
>>> Is there a more elegant way to collect rows within time range X?
>>> (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.)
>>
>>Can you set timestamp of the Puts to the one you have in row key? Instead of relying on the one that HBase puts automatically (current ts). If you can, this will improve reading speed a lot by setting time range on scanner. Depending on how you are writing your data of course, but I assume that you mostly write data in "time-increasing" manner.
>>
>>
>>2.
>>
>>If your userId has fixed length, or you can change it so that it has fixed length, then you can actually use smth like "wildcard"  in row key. There's a way in Filter implementation to fast-forward to the record with specific row key and by doing this skip many records. This might be used as follows:
>>* suppose your userId is 5 characters in length
>>* suppose you are scanning for records with time between 2012-08-01 and 2012-08-08
>>* when you scanning records and you face e.g. key "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". Because you know that all remained records of user "aaaaa" don't fall into the interval you need (as the time for its records will be >= 2012-08-09).
>>
>>As of now, I believe you will have to implement your custom filter to do that. Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
>>I believe I implemented similar thing some time ago. If this idea works for you I could look for the implementation and share it if it helps. Or may be even simply add it to HBase codebase.
>>
>>Hope this helps,
>>
>>
>>Alex Baranau
>>------
>>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>>
>>
>>
>>On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <sy...@yahoo.de> wrote:
>>
>>
>>>
>>>Excuse my double posting.
>>>Here is the complete mail:
>>>
>>>
>>>
>>>OK,
>>>
>>>at first I will try the scans.
>>>
>>>If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors.
>>>
>>>
>>>Currently I'm stuck at the scans because it requires two steps (therefore maybe some kind of filter chaining is required)
>>>
>>>
>>>The key:  userId-dateInMillis-sessionId
>>>
>>>
>>>At first I need to extract dateInMllis with regex or substring (using special delimiters for date)
>>>
>>>Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this:
>>>
>>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
>>>
>>>How to chain that?
>>>Do I have to write a custom filter?
>>>(Would like to avoid that due to deployment)
>>>
>>>regards
>>>Chris
>>>
>>>
>>>----- Ursprüngliche Message -----
>>>Von: Michael Segel <mi...@hotmail.com>
>>>An: user@hbase.apache.org
>>>CC:
>>>Gesendet: 13:52 Mittwoch, 1.August 2012
>>>Betreff: Re: How to query by rowKey-infix
>>>
>>>Actually w coprocessors you can create a secondary index in short order.
>>>Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive.
>>>
>>>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>>
>>>> When deciding between a table scan vs secondary index, you should try to
>>>> estimate what percent of the underlying data blocks will be used in the
>>>> query.  By default, each block is 64KB.
>>>>
>>>> If each user's data is small and you are fitting multiple users per block,
>>>> then you're going to need all the blocks, so a tablescan is better because
>>>> it's simpler.  If each user has 1MB+ data then you will want to pick out
>>>> the individual blocks relevant to each date.  The secondary index will help
>>>> you go directly to those sparse blocks, but with a cost in complexity,
>>>> consistency, and extra denormalized data that knocks primary data out of
>>>> your block cache.
>>>>
>>>> If latency is not a concern, I would start with the table scan.  If that's
>>>> too slow you add the secondary index, and if you still need it faster you
>>>> do the primary key lookups in parallel as Jerry mentions.
>>>>
>>>> Matt
>>>>
>>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:
>>>>
>>>>> Hi Chris:
>>>>>
>>>>> I'm thinking about building a secondary index for primary key lookup, then
>>>>> query using the primary keys in parallel.
>>>>>
>>>>> I'm interested to see if there is other option too.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>>>>> wrote:
>>>>>
>>>>>> Hello there,
>>>>>>
>>>>>> I designed a row key for queries that need best performance (~100 ms)
>>>>>> which looks like this:
>>>>>>
>>>>>> userId-date-sessionId
>>>>>>
>>>>>> These queries(scans) are always based on a userId and sometimes
>>>>>> additionally on a date, too.
>>>>>> That's no problem with the key above.
>>>>>>
>>>>>> However, another kind of queries shall be based on a given time range
>>>>>> whereas the outermost left userId is not given or known.
>>>>>> In this case I need to get all rows covering the given time range with
>>>>>> their date to create a daily reporting.
>>>>>>
>>>>>> As I can't set wildcards at the beginning of a left-based index for the
>>>>>> scan,
>>>>>> I only see the possibility to scan the index of the whole table to
>>>>> collect
>>>>>> the
>>>>>> rowKeys that are inside the timerange I'm interested in.
>>>>>>
>>>>>> Is there a more elegant way to collect rows within time range X?
>>>>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>>>>> stored by hbase automatically.)
>>>>>>
>>>>>> Could/should one maybe leverage some kind of row key caching to
>>>>> accelerate
>>>>>> the collection process?
>>>>>> Is that covered by the block cache?
>>>>>>
>>>>>> Thanks in advance for any advice.
>>>>>>
>>>>>> regards
>>>>>> Chris
>>>>>>
>>>>>
>>>
>>
>>
>>--
>>
>>Alex Baranau
>>------
>>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr 
>>
>
>
>
>-- 
>
>Alex Baranau
>------
>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>

Re: How to query by rowKey-infix

Posted by Alex Baranau <al...@gmail.com>.

jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will add
documentation to HBase book very soon [1]

Alex Baranau
------
Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

[1] https://issues.apache.org/jira/browse/HBASE-6526

On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <al...@gmail.com>wrote:

> Good!
>
> Submitted initial patch of fuzzy row key filter at
> https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the
> filter class and include it in your code and use it in your setup as any
> other custom filter (no need to patch HBase).
>
> Please let me know if you try it out (or post your comments at HBASE-6509).
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <sy...@yahoo.de>wrote:
>
>> Hi Alex,
>>
>> thanks a lot for the hint about setting the timestamp of the put.
>> I didn't know that this would be possible but that's solving the problem
>> (first test was successful).
>> So I'm really glad that I don't need to apply a filter to extract the
>> time and so on for every row.
>>
>> Nevertheless I would like to see your custom filter implementation.
>> Would be nice if you could provide it helping me to get a bit into it.
>>
>> And yes that helped :)
>>
>> regards
>> Chris
>>
>>
>> ________________________________
>> Von: Alex Baranau <al...@gmail.com>
>> An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
>> Gesendet: 0:57 Freitag, 3.August 2012
>> Betreff: Re: How to query by rowKey-infix
>>
>>
>> Hi Christian!
>> If to put off secondary indexes and assume you are going with "heavy
>> scans", you can try two following things to make it much faster. If this is
>> appropriate to your situation, of course.
>>
>> 1.
>>
>> > Is there a more elegant way to collect rows within time range X?
>> > (Unfortunately, the date attribute is not equal to the timestamp that
>> is stored by hbase automatically.)
>>
>> Can you set timestamp of the Puts to the one you have in row key? Instead
>> of relying on the one that HBase puts automatically (current ts). If you
>> can, this will improve reading speed a lot by setting time range on
>> scanner. Depending on how you are writing your data of course, but I assume
>> that you mostly write data in "time-increasing" manner.
>>
>>
>> 2.
>>
>> If your userId has fixed length, or you can change it so that it has
>> fixed length, then you can actually use smth like "wildcard"  in row key.
>> There's a way in Filter implementation to fast-forward to the record with
>> specific row key and by doing this skip many records. This might be used as
>> follows:
>> * suppose your userId is 5 characters in length
>> * suppose you are scanning for records with time between 2012-08-01
>> and 2012-08-08
>> * when you scanning records and you face e.g. key
>> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
>> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
>> Because you know that all remained records of user "aaaaa" don't fall into
>> the interval you need (as the time for its records will be >= 2012-08-09).
>>
>> As of now, I believe you will have to implement your custom filter to do
>> that.
>> Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
>> I believe I implemented similar thing some time ago. If this idea works
>> for you I could look for the implementation and share it if it helps. Or
>> may be even simply add it to HBase codebase.
>>
>> Hope this helps,
>>
>>
>> Alex Baranau
>> ------
>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
>> - Solr
>>
>>
>>
>> On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <sy...@yahoo.de>
>> wrote:
>>
>>
>> >
>> >Excuse my double posting.
>> >Here is the complete mail:
>> >
>> >
>> >
>> >OK,
>> >
>> >at first I will try the scans.
>> >
>> >If that's too slow I will have to upgrade hbase (currently
>> 0.90.4-cdh3u2) to be able to use coprocessors.
>> >
>> >
>> >Currently I'm stuck at the scans because it requires two steps
>> (therefore maybe some kind of filter chaining is required)
>> >
>> >
>> >The key:  userId-dateInMillis-sessionId
>> >
>> >
>> >At first I need to extract dateInMllis with regex or substring (using
>> special delimiters for date)
>> >
>> >Second, the extracted value must be parsed to Long and set to a
>> RowFilter Comparator like this:
>> >
>> >scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
>> BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
>> >
>> >How to chain that?
>> >Do I have to write a custom filter?
>> >(Would like to avoid that due to deployment)
>> >
>> >regards
>> >Chris
>> >
>> >
>> >----- Ursprüngliche Message -----
>> >Von: Michael Segel <mi...@hotmail.com>
>> >An: user@hbase.apache.org
>> >CC:
>> >Gesendet: 13:52 Mittwoch, 1.August 2012
>> >Betreff: Re: How to query by rowKey-infix
>> >
>> >Actually w coprocessors you can create a secondary index in short order.
>> >Then your cost is going to be 2 fetches. Trying to do a partial table
>> scan will be more expensive.
>> >
>> >On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
>> >
>> >> When deciding between a table scan vs secondary index, you should try
>> to
>> >> estimate what percent of the underlying data blocks will be used in the
>> >> query.  By default, each block is 64KB.
>> >>
>> >> If each user's data is small and you are fitting multiple users per
>> block,
>> >> then you're going to need all the blocks, so a tablescan is better
>> because
>> >> it's simpler.  If each user has 1MB+ data then you will want to pick
>> out
>> >> the individual blocks relevant to each date.  The secondary index will
>> help
>> >> you go directly to those sparse blocks, but with a cost in complexity,
>> >> consistency, and extra denormalized data that knocks primary data out
>> of
>> >> your block cache.
>> >>
>> >> If latency is not a concern, I would start with the table scan.  If
>> that's
>> >> too slow you add the secondary index, and if you still need it faster
>> you
>> >> do the primary key lookups in parallel as Jerry mentions.
>> >>
>> >> Matt
>> >>
>> >> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
>> wrote:
>> >>
>> >>> Hi Chris:
>> >>>
>> >>> I'm thinking about building a secondary index for primary key lookup,
>> then
>> >>> query using the primary keys in parallel.
>> >>>
>> >>> I'm interested to see if there is other option too.
>> >>>
>> >>> Best Regards,
>> >>>
>> >>> Jerry
>> >>>
>> >>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
>> syrious3000@yahoo.de
>> >>>> wrote:
>> >>>
>> >>>> Hello there,
>> >>>>
>> >>>> I designed a row key for queries that need best performance (~100 ms)
>> >>>> which looks like this:
>> >>>>
>> >>>> userId-date-sessionId
>> >>>>
>> >>>> These queries(scans) are always based on a userId and sometimes
>> >>>> additionally on a date, too.
>> >>>> That's no problem with the key above.
>> >>>>
>> >>>> However, another kind of queries shall be based on a given time range
>> >>>> whereas the outermost left userId is not given or known.
>> >>>> In this case I need to get all rows covering the given time range
>> with
>> >>>> their date to create a daily reporting.
>> >>>>
>> >>>> As I can't set wildcards at the beginning of a left-based index for
>> the
>> >>>> scan,
>> >>>> I only see the possibility to scan the index of the whole table to
>> >>> collect
>> >>>> the
>> >>>> rowKeys that are inside the timerange I'm interested in.
>> >>>>
>> >>>> Is there a more elegant way to collect rows within time range X?
>> >>>> (Unfortunately, the date attribute is not equal to the timestamp
>> that is
>> >>>> stored by hbase automatically.)
>> >>>>
>> >>>> Could/should one maybe leverage some kind of row key caching to
>> >>> accelerate
>> >>>> the collection process?
>> >>>> Is that covered by the block cache?
>> >>>>
>> >>>> Thanks in advance for any advice.
>> >>>>
>> >>>> regards
>> >>>> Chris
>> >>>>
>> >>>
>> >
>>
>>
>> --
>>
>> Alex Baranau
>> ------
>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
>> - Solr
>>
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
>

Re: How to query by rowKey-infix

Posted by Alex Baranau <al...@gmail.com>.

Good!

Submitted initial patch of fuzzy row key filter at
https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the
filter class and include it in your code and use it in your setup as any
other custom filter (no need to patch HBase).

Please let me know if you try it out (or post your comments at HBASE-6509).

Alex Baranau
------
Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <sy...@yahoo.de>wrote:

> Hi Alex,
>
> thanks a lot for the hint about setting the timestamp of the put.
> I didn't know that this would be possible but that's solving the problem
> (first test was successful).
> So I'm really glad that I don't need to apply a filter to extract the time
> and so on for every row.
>
> Nevertheless I would like to see your custom filter implementation.
> Would be nice if you could provide it helping me to get a bit into it.
>
> And yes that helped :)
>
> regards
> Chris
>
>
> ________________________________
> Von: Alex Baranau <al...@gmail.com>
> An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de>
> Gesendet: 0:57 Freitag, 3.August 2012
> Betreff: Re: How to query by rowKey-infix
>
>
> Hi Christian!
> If to put off secondary indexes and assume you are going with "heavy
> scans", you can try two following things to make it much faster. If this is
> appropriate to your situation, of course.
>
> 1.
>
> > Is there a more elegant way to collect rows within time range X?
> > (Unfortunately, the date attribute is not equal to the timestamp that is
> stored by hbase automatically.)
>
> Can you set timestamp of the Puts to the one you have in row key? Instead
> of relying on the one that HBase puts automatically (current ts). If you
> can, this will improve reading speed a lot by setting time range on
> scanner. Depending on how you are writing your data of course, but I assume
> that you mostly write data in "time-increasing" manner.
>
>
> 2.
>
> If your userId has fixed length, or you can change it so that it has fixed
> length, then you can actually use smth like "wildcard"  in row key. There's
> a way in Filter implementation to fast-forward to the record with specific
> row key and by doing this skip many records. This might be used as follows:
> * suppose your userId is 5 characters in length
> * suppose you are scanning for records with time between 2012-08-01
> and 2012-08-08
> * when you scanning records and you face e.g. key
> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> Because you know that all remained records of user "aaaaa" don't fall into
> the interval you need (as the time for its records will be >= 2012-08-09).
>
> As of now, I believe you will have to implement your custom filter to do
> that.
> Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> I believe I implemented similar thing some time ago. If this idea works
> for you I could look for the implementation and share it if it helps. Or
> may be even simply add it to HBase codebase.
>
> Hope this helps,
>
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
>
>
> On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <sy...@yahoo.de>
> wrote:
>
>
> >
> >Excuse my double posting.
> >Here is the complete mail:
> >
> >
> >
> >OK,
> >
> >at first I will try the scans.
> >
> >If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2)
> to be able to use coprocessors.
> >
> >
> >Currently I'm stuck at the scans because it requires two steps (therefore
> maybe some kind of filter chaining is required)
> >
> >
> >The key:  userId-dateInMillis-sessionId
> >
> >
> >At first I need to extract dateInMllis with regex or substring (using
> special delimiters for date)
> >
> >Second, the extracted value must be parsed to Long and set to a RowFilter
> Comparator like this:
> >
> >scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> >
> >How to chain that?
> >Do I have to write a custom filter?
> >(Would like to avoid that due to deployment)
> >
> >regards
> >Chris
> >
> >
> >----- Ursprüngliche Message -----
> >Von: Michael Segel <mi...@hotmail.com>
> >An: user@hbase.apache.org
> >CC:
> >Gesendet: 13:52 Mittwoch, 1.August 2012
> >Betreff: Re: How to query by rowKey-infix
> >
> >Actually w coprocessors you can create a secondary index in short order.
> >Then your cost is going to be 2 fetches. Trying to do a partial table
> scan will be more expensive.
> >
> >On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >
> >> When deciding between a table scan vs secondary index, you should try to
> >> estimate what percent of the underlying data blocks will be used in the
> >> query.  By default, each block is 64KB.
> >>
> >> If each user's data is small and you are fitting multiple users per
> block,
> >> then you're going to need all the blocks, so a tablescan is better
> because
> >> it's simpler.  If each user has 1MB+ data then you will want to pick out
> >> the individual blocks relevant to each date.  The secondary index will
> help
> >> you go directly to those sparse blocks, but with a cost in complexity,
> >> consistency, and extra denormalized data that knocks primary data out of
> >> your block cache.
> >>
> >> If latency is not a concern, I would start with the table scan.  If
> that's
> >> too slow you add the secondary index, and if you still need it faster
> you
> >> do the primary key lookups in parallel as Jerry mentions.
> >>
> >> Matt
> >>
> >> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> wrote:
> >>
> >>> Hi Chris:
> >>>
> >>> I'm thinking about building a secondary index for primary key lookup,
> then
> >>> query using the primary keys in parallel.
> >>>
> >>> I'm interested to see if there is other option too.
> >>>
> >>> Best Regards,
> >>>
> >>> Jerry
> >>>
> >>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> syrious3000@yahoo.de
> >>>> wrote:
> >>>
> >>>> Hello there,
> >>>>
> >>>> I designed a row key for queries that need best performance (~100 ms)
> >>>> which looks like this:
> >>>>
> >>>> userId-date-sessionId
> >>>>
> >>>> These queries(scans) are always based on a userId and sometimes
> >>>> additionally on a date, too.
> >>>> That's no problem with the key above.
> >>>>
> >>>> However, another kind of queries shall be based on a given time range
> >>>> whereas the outermost left userId is not given or known.
> >>>> In this case I need to get all rows covering the given time range with
> >>>> their date to create a daily reporting.
> >>>>
> >>>> As I can't set wildcards at the beginning of a left-based index for
> the
> >>>> scan,
> >>>> I only see the possibility to scan the index of the whole table to
> >>> collect
> >>>> the
> >>>> rowKeys that are inside the timerange I'm interested in.
> >>>>
> >>>> Is there a more elegant way to collect rows within time range X?
> >>>> (Unfortunately, the date attribute is not equal to the timestamp that
> is
> >>>> stored by hbase automatically.)
> >>>>
> >>>> Could/should one maybe leverage some kind of row key caching to
> >>> accelerate
> >>>> the collection process?
> >>>> Is that covered by the block cache?
> >>>>
> >>>> Thanks in advance for any advice.
> >>>>
> >>>> regards
> >>>> Chris
> >>>>
> >>>
> >
>
>
> --
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.

Hi,

shouldn't all people who are using hbase for time series data have exactly the same problem when trying to get a time-related subset of their data?

So setting the put timeStamp "manually" is THE way of choice?

It works for me but I would also be interested in alternative approaches that are applied for efficient time-related scans (except coprocessors & full table scans).

----- Ursprüngliche Message -----
Von: Christian Schäfer <sy...@yahoo.de>
An: "user@hbase.apache.org" <us...@hbase.apache.org>
CC: 
Gesendet: 11:23 Freitag, 3.August 2012
Betreff: Re: How to query by rowKey-infix

Hi Alex,

thanks a lot for the hint about setting the timestamp of the put.
I didn't know that this would be possible but that's solving the problem (first test was successful).
So I'm really glad that I don't need to apply a filter to extract the time and so on for every row.

Nevertheless I would like to see your custom filter implementation.
Would be nice if you could provide it helping me to get a bit into it.

And yes that helped :)

regards
Chris

________________________________
Von: Alex Baranau <al...@gmail.com>
An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de> 
Gesendet: 0:57 Freitag, 3.August 2012
Betreff: Re: How to query by rowKey-infix

Hi Christian!
If to put off secondary indexes and assume you are going with "heavy scans", you can try two following things to make it much faster. If this is appropriate to your situation, of course.

1.

> Is there a more elegant way to collect rows within time range X?
> (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.)

Can you set timestamp of the Puts to the one you have in row key? Instead of relying on the one that HBase puts automatically (current ts). If you can, this will improve reading speed a lot by setting time range on scanner. Depending on how you are writing your data of course, but I assume that you mostly write data in "time-increasing" manner.

2.

If your userId has fixed length, or you can change it so that it has fixed length, then you can actually use smth like "wildcard"  in row key. There's a way in Filter implementation to fast-forward to the record with specific row key and by doing this skip many records. This might be used as follows:
* suppose your userId is 5 characters in length
* suppose you are scanning for records with time between 2012-08-01 and 2012-08-08
* when you scanning records and you face e.g. key "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". Because you know that all remained records of user "aaaaa" don't fall into the interval you need (as the time for its records will be >= 2012-08-09).

As of now, I believe you will have to implement your custom filter to do that. Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
I believe I implemented similar thing some time ago. If this idea works for you I could look for the implementation and share it if it helps. Or may be even simply add it to HBase codebase.

Hope this helps,

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <sy...@yahoo.de> wrote:

>
>Excuse my double posting.
>Here is the complete mail:
>
>
>
>OK,
>
>at first I will try the scans.
>
>If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors.
>
>
>Currently I'm stuck at the scans because it requires two steps (therefore maybe some kind of filter chaining is required)
>
>
>The key:  userId-dateInMillis-sessionId
>
>
>At first I need to extract dateInMllis with regex or substring (using special delimiters for date)
>
>Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this:
>
>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
>
>How to chain that?
>Do I have to write a custom filter?
>(Would like to avoid that due to deployment)
>
>regards
>Chris
>
>
>----- Ursprüngliche Message -----
>Von: Michael Segel <mi...@hotmail.com>
>An: user@hbase.apache.org
>CC:
>Gesendet: 13:52 Mittwoch, 1.August 2012
>Betreff: Re: How to query by rowKey-infix
>
>Actually w coprocessors you can create a secondary index in short order.
>Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive.
>
>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
>> When deciding between a table scan vs secondary index, you should try to
>> estimate what percent of the underlying data blocks will be used in the
>> query.  By default, each block is 64KB.
>>
>> If each user's data is small and you are fitting multiple users per block,
>> then you're going to need all the blocks, so a tablescan is better because
>> it's simpler.  If each user has 1MB+ data then you will want to pick out
>> the individual blocks relevant to each date.  The secondary index will help
>> you go directly to those sparse blocks, but with a cost in complexity,
>> consistency, and extra denormalized data that knocks primary data out of
>> your block cache.
>>
>> If latency is not a concern, I would start with the table scan.  If that's
>> too slow you add the secondary index, and if you still need it faster you
>> do the primary key lookups in parallel as Jerry mentions.
>>
>> Matt
>>
>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi Chris:
>>>
>>> I'm thinking about building a secondary index for primary key lookup, then
>>> query using the primary keys in parallel.
>>>
>>> I'm interested to see if there is other option too.
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>>> wrote:
>>>
>>>> Hello there,
>>>>
>>>> I designed a row key for queries that need best performance (~100 ms)
>>>> which looks like this:
>>>>
>>>> userId-date-sessionId
>>>>
>>>> These queries(scans) are always based on a userId and sometimes
>>>> additionally on a date, too.
>>>> That's no problem with the key above.
>>>>
>>>> However, another kind of queries shall be based on a given time range
>>>> whereas the outermost left userId is not given or known.
>>>> In this case I need to get all rows covering the given time range with
>>>> their date to create a daily reporting.
>>>>
>>>> As I can't set wildcards at the beginning of a left-based index for the
>>>> scan,
>>>> I only see the possibility to scan the index of the whole table to
>>> collect
>>>> the
>>>> rowKeys that are inside the timerange I'm interested in.
>>>>
>>>> Is there a more elegant way to collect rows within time range X?
>>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>>> stored by hbase automatically.)
>>>>
>>>> Could/should one maybe leverage some kind of row key caching to
>>> accelerate
>>>> the collection process?
>>>> Is that covered by the block cache?
>>>>
>>>> Thanks in advance for any advice.
>>>>
>>>> regards
>>>> Chris
>>>>
>>>
>

-- 

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

Re: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.

Hi Alex,

thanks a lot for the hint about setting the timestamp of the put.
I didn't know that this would be possible but that's solving the problem (first test was successful).
So I'm really glad that I don't need to apply a filter to extract the time and so on for every row.

Nevertheless I would like to see your custom filter implementation.
Would be nice if you could provide it helping me to get a bit into it.

And yes that helped :)

regards
Chris

________________________________
Von: Alex Baranau <al...@gmail.com>
An: user@hbase.apache.org; Christian Schäfer <sy...@yahoo.de> 
Gesendet: 0:57 Freitag, 3.August 2012
Betreff: Re: How to query by rowKey-infix

Hi Christian!
If to put off secondary indexes and assume you are going with "heavy scans", you can try two following things to make it much faster. If this is appropriate to your situation, of course.

1.

> Is there a more elegant way to collect rows within time range X?
> (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.)

Can you set timestamp of the Puts to the one you have in row key? Instead of relying on the one that HBase puts automatically (current ts). If you can, this will improve reading speed a lot by setting time range on scanner. Depending on how you are writing your data of course, but I assume that you mostly write data in "time-increasing" manner.

2.

If your userId has fixed length, or you can change it so that it has fixed length, then you can actually use smth like "wildcard"  in row key. There's a way in Filter implementation to fast-forward to the record with specific row key and by doing this skip many records. This might be used as follows:
* suppose your userId is 5 characters in length
* suppose you are scanning for records with time between 2012-08-01 and 2012-08-08
* when you scanning records and you face e.g. key "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". Because you know that all remained records of user "aaaaa" don't fall into the interval you need (as the time for its records will be >= 2012-08-09).

As of now, I believe you will have to implement your custom filter to do that. Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
I believe I implemented similar thing some time ago. If this idea works for you I could look for the implementation and share it if it helps. Or may be even simply add it to HBase codebase.

Hope this helps,

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <sy...@yahoo.de> wrote:

>
>Excuse my double posting.
>Here is the complete mail:
>
>
>
>OK,
>
>at first I will try the scans.
>
>If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors.
>
>
>Currently I'm stuck at the scans because it requires two steps (therefore maybe some kind of filter chaining is required)
>
>
>The key:  userId-dateInMillis-sessionId
>
>
>At first I need to extract dateInMllis with regex or substring (using special delimiters for date)
>
>Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this:
>
>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
>
>How to chain that?
>Do I have to write a custom filter?
>(Would like to avoid that due to deployment)
>
>regards
>Chris
>
>
>----- Ursprüngliche Message -----
>Von: Michael Segel <mi...@hotmail.com>
>An: user@hbase.apache.org
>CC:
>Gesendet: 13:52 Mittwoch, 1.August 2012
>Betreff: Re: How to query by rowKey-infix
>
>Actually w coprocessors you can create a secondary index in short order.
>Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive.
>
>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
>> When deciding between a table scan vs secondary index, you should try to
>> estimate what percent of the underlying data blocks will be used in the
>> query.  By default, each block is 64KB.
>>
>> If each user's data is small and you are fitting multiple users per block,
>> then you're going to need all the blocks, so a tablescan is better because
>> it's simpler.  If each user has 1MB+ data then you will want to pick out
>> the individual blocks relevant to each date.  The secondary index will help
>> you go directly to those sparse blocks, but with a cost in complexity,
>> consistency, and extra denormalized data that knocks primary data out of
>> your block cache.
>>
>> If latency is not a concern, I would start with the table scan.  If that's
>> too slow you add the secondary index, and if you still need it faster you
>> do the primary key lookups in parallel as Jerry mentions.
>>
>> Matt
>>
>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi Chris:
>>>
>>> I'm thinking about building a secondary index for primary key lookup, then
>>> query using the primary keys in parallel.
>>>
>>> I'm interested to see if there is other option too.
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>>> wrote:
>>>
>>>> Hello there,
>>>>
>>>> I designed a row key for queries that need best performance (~100 ms)
>>>> which looks like this:
>>>>
>>>> userId-date-sessionId
>>>>
>>>> These queries(scans) are always based on a userId and sometimes
>>>> additionally on a date, too.
>>>> That's no problem with the key above.
>>>>
>>>> However, another kind of queries shall be based on a given time range
>>>> whereas the outermost left userId is not given or known.
>>>> In this case I need to get all rows covering the given time range with
>>>> their date to create a daily reporting.
>>>>
>>>> As I can't set wildcards at the beginning of a left-based index for the
>>>> scan,
>>>> I only see the possibility to scan the index of the whole table to
>>> collect
>>>> the
>>>> rowKeys that are inside the timerange I'm interested in.
>>>>
>>>> Is there a more elegant way to collect rows within time range X?
>>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>>> stored by hbase automatically.)
>>>>
>>>> Could/should one maybe leverage some kind of row key caching to
>>> accelerate
>>>> the collection process?
>>>> Is that covered by the block cache?
>>>>
>>>> Thanks in advance for any advice.
>>>>
>>>> regards
>>>> Chris
>>>>
>>>
>

-- 

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

Re: How to query by rowKey-infix

Posted by Alex Baranau <al...@gmail.com>.

Hi Christian!

If to put off secondary indexes and assume you are going with "heavy
scans", you can try two following things to make it much faster. If this is
appropriate to your situation, of course.

1.

> Is there a more elegant way to collect rows within time range X?
> (Unfortunately, the date attribute is not equal to the timestamp that is
stored by hbase automatically.)

Can you set timestamp of the Puts to the one you have in row key? Instead
of relying on the one that HBase puts automatically (current ts). If you
can, this will improve reading speed a lot by setting time range on
scanner. Depending on how you are writing your data of course, but I assume
that you mostly write data in "time-increasing" manner.

2.

If your userId has fixed length, or you can change it so that it has fixed
length, then you can actually use smth like "wildcard"  in row key. There's
a way in Filter implementation to fast-forward to the record with specific
row key and by doing this skip many records. This might be used as follows:
* suppose your userId is 5 characters in length
* suppose you are scanning for records with time between 2012-08-01
and 2012-08-08
* when you scanning records and you face e.g. key
"aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
Because you know that all remained records of user "aaaaa" don't fall into
the interval you need (as the time for its records will be >= 2012-08-09).

As of now, I believe you will have to implement your custom filter to do
that.
Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
I believe I implemented similar thing some time ago. If this idea works for
you I could look for the implementation and share it if it helps. Or may be
even simply add it to HBase codebase.

Hope this helps,

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr


On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <sy...@yahoo.de>wrote:

>
>
> Excuse my double posting.
> Here is the complete mail:
>
>
> OK,
>
> at first I will try the scans.
>
> If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2)
> to be able to use coprocessors.
>
>
> Currently I'm stuck at the scans because it requires two steps (therefore
> maybe some kind of filter chaining is required)
>
>
> The key:  userId-dateInMillis-sessionId
>
> At first I need to extract dateInMllis with regex or substring (using
> special delimiters for date)
>
> Second, the extracted value must be parsed to Long and set to a RowFilter
> Comparator like this:
>
> scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
>
> How to chain that?
> Do I have to write a custom filter?
> (Would like to avoid that due to deployment)
>
> regards
> Chris
>
> ----- Ursprüngliche Message -----
> Von: Michael Segel <mi...@hotmail.com>
> An: user@hbase.apache.org
> CC:
> Gesendet: 13:52 Mittwoch, 1.August 2012
> Betreff: Re: How to query by rowKey-infix
>
> Actually w coprocessors you can create a secondary index in short order.
> Then your cost is going to be 2 fetches. Trying to do a partial table scan
> will be more expensive.
>
> On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > When deciding between a table scan vs secondary index, you should try to
> > estimate what percent of the underlying data blocks will be used in the
> > query.  By default, each block is 64KB.
> >
> > If each user's data is small and you are fitting multiple users per
> block,
> > then you're going to need all the blocks, so a tablescan is better
> because
> > it's simpler.  If each user has 1MB+ data then you will want to pick out
> > the individual blocks relevant to each date.  The secondary index will
> help
> > you go directly to those sparse blocks, but with a cost in complexity,
> > consistency, and extra denormalized data that knocks primary data out of
> > your block cache.
> >
> > If latency is not a concern, I would start with the table scan.  If
> that's
> > too slow you add the secondary index, and if you still need it faster you
> > do the primary key lookups in parallel as Jerry mentions.
> >
> > Matt
> >
> > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com>
> wrote:
> >
> >> Hi Chris:
> >>
> >> I'm thinking about building a secondary index for primary key lookup,
> then
> >> query using the primary keys in parallel.
> >>
> >> I'm interested to see if there is other option too.
> >>
> >> Best Regards,
> >>
> >> Jerry
> >>
> >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> syrious3000@yahoo.de
> >>> wrote:
> >>
> >>> Hello there,
> >>>
> >>> I designed a row key for queries that need best performance (~100 ms)
> >>> which looks like this:
> >>>
> >>> userId-date-sessionId
> >>>
> >>> These queries(scans) are always based on a userId and sometimes
> >>> additionally on a date, too.
> >>> That's no problem with the key above.
> >>>
> >>> However, another kind of queries shall be based on a given time range
> >>> whereas the outermost left userId is not given or known.
> >>> In this case I need to get all rows covering the given time range with
> >>> their date to create a daily reporting.
> >>>
> >>> As I can't set wildcards at the beginning of a left-based index for the
> >>> scan,
> >>> I only see the possibility to scan the index of the whole table to
> >> collect
> >>> the
> >>> rowKeys that are inside the timerange I'm interested in.
> >>>
> >>> Is there a more elegant way to collect rows within time range X?
> >>> (Unfortunately, the date attribute is not equal to the timestamp that
> is
> >>> stored by hbase automatically.)
> >>>
> >>> Could/should one maybe leverage some kind of row key caching to
> >> accelerate
> >>> the collection process?
> >>> Is that covered by the block cache?
> >>>
> >>> Thanks in advance for any advice.
> >>>
> >>> regards
> >>> Chris
> >>>
> >>
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

WG: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.


Excuse my double posting.
Here is the complete mail:


OK,

at first I will try the scans.

If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors.


Currently I'm stuck at the scans because it requires two steps (therefore maybe some kind of filter chaining is required)


The key:  userId-dateInMillis-sessionId

At first I need to extract dateInMllis with regex or substring (using special delimiters for date)

Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this:

scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes((Long)dateInMillis))));

How to chain that?
Do I have to write a custom filter?
(Would like to avoid that due to deployment)

regards
Chris

----- Ursprüngliche Message -----
Von: Michael Segel <mi...@hotmail.com>
An: user@hbase.apache.org
CC: 
Gesendet: 13:52 Mittwoch, 1.August 2012
Betreff: Re: How to query by rowKey-infix

Actually w coprocessors you can create a secondary index in short order. 
Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. 

On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:

> When deciding between a table scan vs secondary index, you should try to
> estimate what percent of the underlying data blocks will be used in the
> query.  By default, each block is 64KB.
> 
> If each user's data is small and you are fitting multiple users per block,
> then you're going to need all the blocks, so a tablescan is better because
> it's simpler.  If each user has 1MB+ data then you will want to pick out
> the individual blocks relevant to each date.  The secondary index will help
> you go directly to those sparse blocks, but with a cost in complexity,
> consistency, and extra denormalized data that knocks primary data out of
> your block cache.
> 
> If latency is not a concern, I would start with the table scan.  If that's
> too slow you add the secondary index, and if you still need it faster you
> do the primary key lookups in parallel as Jerry mentions.
> 
> Matt
> 
> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:
> 
>> Hi Chris:
>> 
>> I'm thinking about building a secondary index for primary key lookup, then
>> query using the primary keys in parallel.
>> 
>> I'm interested to see if there is other option too.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>> wrote:
>> 
>>> Hello there,
>>> 
>>> I designed a row key for queries that need best performance (~100 ms)
>>> which looks like this:
>>> 
>>> userId-date-sessionId
>>> 
>>> These queries(scans) are always based on a userId and sometimes
>>> additionally on a date, too.
>>> That's no problem with the key above.
>>> 
>>> However, another kind of queries shall be based on a given time range
>>> whereas the outermost left userId is not given or known.
>>> In this case I need to get all rows covering the given time range with
>>> their date to create a daily reporting.
>>> 
>>> As I can't set wildcards at the beginning of a left-based index for the
>>> scan,
>>> I only see the possibility to scan the index of the whole table to
>> collect
>>> the
>>> rowKeys that are inside the timerange I'm interested in.
>>> 
>>> Is there a more elegant way to collect rows within time range X?
>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>> stored by hbase automatically.)
>>> 
>>> Could/should one maybe leverage some kind of row key caching to
>> accelerate
>>> the collection process?
>>> Is that covered by the block cache?
>>> 
>>> Thanks in advance for any advice.
>>> 
>>> regards
>>> Chris
>>> 
>>

Re: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.

OK,

at first I will try the scans.

If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors.

Currently I'm stuck at the scans because it requires two steps (therefore some kind of filter chaining)

The key:  userId-dateInMllis-sessionId

At first I need to extract dateInMllis with regex or substring (using special delimiters for date)

Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this:





----- Ursprüngliche Message -----
Von: Michael Segel <mi...@hotmail.com>
An: user@hbase.apache.org
CC: 
Gesendet: 13:52 Mittwoch, 1.August 2012
Betreff: Re: How to query by rowKey-infix

Actually w coprocessors you can create a secondary index in short order. 
Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. 

On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:

> When deciding between a table scan vs secondary index, you should try to
> estimate what percent of the underlying data blocks will be used in the
> query.  By default, each block is 64KB.
> 
> If each user's data is small and you are fitting multiple users per block,
> then you're going to need all the blocks, so a tablescan is better because
> it's simpler.  If each user has 1MB+ data then you will want to pick out
> the individual blocks relevant to each date.  The secondary index will help
> you go directly to those sparse blocks, but with a cost in complexity,
> consistency, and extra denormalized data that knocks primary data out of
> your block cache.
> 
> If latency is not a concern, I would start with the table scan.  If that's
> too slow you add the secondary index, and if you still need it faster you
> do the primary key lookups in parallel as Jerry mentions.
> 
> Matt
> 
> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:
> 
>> Hi Chris:
>> 
>> I'm thinking about building a secondary index for primary key lookup, then
>> query using the primary keys in parallel.
>> 
>> I'm interested to see if there is other option too.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>> wrote:
>> 
>>> Hello there,
>>> 
>>> I designed a row key for queries that need best performance (~100 ms)
>>> which looks like this:
>>> 
>>> userId-date-sessionId
>>> 
>>> These queries(scans) are always based on a userId and sometimes
>>> additionally on a date, too.
>>> That's no problem with the key above.
>>> 
>>> However, another kind of queries shall be based on a given time range
>>> whereas the outermost left userId is not given or known.
>>> In this case I need to get all rows covering the given time range with
>>> their date to create a daily reporting.
>>> 
>>> As I can't set wildcards at the beginning of a left-based index for the
>>> scan,
>>> I only see the possibility to scan the index of the whole table to
>> collect
>>> the
>>> rowKeys that are inside the timerange I'm interested in.
>>> 
>>> Is there a more elegant way to collect rows within time range X?
>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>> stored by hbase automatically.)
>>> 
>>> Could/should one maybe leverage some kind of row key caching to
>> accelerate
>>> the collection process?
>>> Is that covered by the block cache?
>>> 
>>> Thanks in advance for any advice.
>>> 
>>> regards
>>> Chris
>>> 
>>

Re: How to query by rowKey-infix

Posted by Michael Segel <mi...@hotmail.com>.

Actually w coprocessors you can create a secondary index in short order. 
Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. 

On Jul 31, 2012, at 12:41 PM, Matt Corgan <mc...@hotpads.com> wrote:

> When deciding between a table scan vs secondary index, you should try to
> estimate what percent of the underlying data blocks will be used in the
> query.  By default, each block is 64KB.
> 
> If each user's data is small and you are fitting multiple users per block,
> then you're going to need all the blocks, so a tablescan is better because
> it's simpler.  If each user has 1MB+ data then you will want to pick out
> the individual blocks relevant to each date.  The secondary index will help
> you go directly to those sparse blocks, but with a cost in complexity,
> consistency, and extra denormalized data that knocks primary data out of
> your block cache.
> 
> If latency is not a concern, I would start with the table scan.  If that's
> too slow you add the secondary index, and if you still need it faster you
> do the primary key lookups in parallel as Jerry mentions.
> 
> Matt
> 
> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:
> 
>> Hi Chris:
>> 
>> I'm thinking about building a secondary index for primary key lookup, then
>> query using the primary keys in parallel.
>> 
>> I'm interested to see if there is other option too.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>> wrote:
>> 
>>> Hello there,
>>> 
>>> I designed a row key for queries that need best performance (~100 ms)
>>> which looks like this:
>>> 
>>> userId-date-sessionId
>>> 
>>> These queries(scans) are always based on a userId and sometimes
>>> additionally on a date, too.
>>> That's no problem with the key above.
>>> 
>>> However, another kind of queries shall be based on a given time range
>>> whereas the outermost left userId is not given or known.
>>> In this case I need to get all rows covering the given time range with
>>> their date to create a daily reporting.
>>> 
>>> As I can't set wildcards at the beginning of a left-based index for the
>>> scan,
>>> I only see the possibility to scan the index of the whole table to
>> collect
>>> the
>>> rowKeys that are inside the timerange I'm interested in.
>>> 
>>> Is there a more elegant way to collect rows within time range X?
>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>> stored by hbase automatically.)
>>> 
>>> Could/should one maybe leverage some kind of row key caching to
>> accelerate
>>> the collection process?
>>> Is that covered by the block cache?
>>> 
>>> Thanks in advance for any advice.
>>> 
>>> regards
>>> Chris
>>> 
>>

Re: How to query by rowKey-infix

Posted by Christian Schäfer <sy...@yahoo.de>.

Thanks Matt & Jerry for your replies.

The data for each row is small (some hundred Bytes).

So, I will try the parallel table scan at first as you suggested...
Before organizing that by myself, wouldn't it be a better idea to create a map reduce job for that?

I'm not so keen on implementing secondary indices especially due to the mentioned consistency concerns.
Unfortunately projects like ithbase and ihbase are no more supporting current hbase and secondary indexes by coprocessors seems are not yet to there.
If I'm wrong feel free to correct me :)

regards,
Chris

----- Ursprüngliche Message -----
Von: Matt Corgan <mc...@hotpads.com>
An: user@hbase.apache.org
CC: Christian Schäfer <sy...@yahoo.de>
Gesendet: 19:41 Dienstag, 31.Juli 2012
Betreff: Re: How to query by rowKey-infix

When deciding between a table scan vs secondary index, you should try to
estimate what percent of the underlying data blocks will be used in the
query.  By default, each block is 64KB.

If each user's data is small and you are fitting multiple users per block,
then you're going to need all the blocks, so a tablescan is better because
it's simpler.  If each user has 1MB+ data then you will want to pick out
the individual blocks relevant to each date.  The secondary index will help
you go directly to those sparse blocks, but with a cost in complexity,
consistency, and extra denormalized data that knocks primary data out of
your block cache.

If latency is not a concern, I would start with the table scan.  If that's
too slow you add the secondary index, and if you still need it faster you
do the primary key lookups in parallel as Jerry mentions.

Matt

On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Chris:
>
> I'm thinking about building a secondary index for primary key lookup, then
> query using the primary keys in parallel.
>
> I'm interested to see if there is other option too.
>
> Best Regards,
>
> Jerry
>
> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
> >wrote:
>
> > Hello there,
> >
> > I designed a row key for queries that need best performance (~100 ms)
> > which looks like this:
> >
> > userId-date-sessionId
> >
> > These queries(scans) are always based on a userId and sometimes
> > additionally on a date, too.
> > That's no problem with the key above.
> >
> > However, another kind of queries shall be based on a given time range
> > whereas the outermost left userId is not given or known.
> > In this case I need to get all rows covering the given time range with
> > their date to create a daily reporting.
> >
> > As I can't set wildcards at the beginning of a left-based index for the
> > scan,
> > I only see the possibility to scan the index of the whole table to
> collect
> > the
> > rowKeys that are inside the timerange I'm interested in.
> >
> > Is there a more elegant way to collect rows within time range X?
> > (Unfortunately, the date attribute is not equal to the timestamp that is
> > stored by hbase automatically.)
> >
> > Could/should one maybe leverage some kind of row key caching to
> accelerate
> > the collection process?
> > Is that covered by the block cache?
> >
> > Thanks in advance for any advice.
> >
> > regards
> > Chris
> >
>

Re: How to query by rowKey-infix

Posted by Matt Corgan <mc...@hotpads.com>.

When deciding between a table scan vs secondary index, you should try to
estimate what percent of the underlying data blocks will be used in the
query.  By default, each block is 64KB.

If each user's data is small and you are fitting multiple users per block,
then you're going to need all the blocks, so a tablescan is better because
it's simpler.  If each user has 1MB+ data then you will want to pick out
the individual blocks relevant to each date.  The secondary index will help
you go directly to those sparse blocks, but with a cost in complexity,
consistency, and extra denormalized data that knocks primary data out of
your block cache.

If latency is not a concern, I would start with the table scan.  If that's
too slow you add the secondary index, and if you still need it faster you
do the primary key lookups in parallel as Jerry mentions.

Matt

On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Chris:
>
> I'm thinking about building a secondary index for primary key lookup, then
> query using the primary keys in parallel.
>
> I'm interested to see if there is other option too.
>
> Best Regards,
>
> Jerry
>
> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
> >wrote:
>
> > Hello there,
> >
> > I designed a row key for queries that need best performance (~100 ms)
> > which looks like this:
> >
> > userId-date-sessionId
> >
> > These queries(scans) are always based on a userId and sometimes
> > additionally on a date, too.
> > That's no problem with the key above.
> >
> > However, another kind of queries shall be based on a given time range
> > whereas the outermost left userId is not given or known.
> > In this case I need to get all rows covering the given time range with
> > their date to create a daily reporting.
> >
> > As I can't set wildcards at the beginning of a left-based index for the
> > scan,
> > I only see the possibility to scan the index of the whole table to
> collect
> > the
> > rowKeys that are inside the timerange I'm interested in.
> >
> > Is there a more elegant way to collect rows within time range X?
> > (Unfortunately, the date attribute is not equal to the timestamp that is
> > stored by hbase automatically.)
> >
> > Could/should one maybe leverage some kind of row key caching to
> accelerate
> > the collection process?
> > Is that covered by the block cache?
> >
> > Thanks in advance for any advice.
> >
> > regards
> > Chris
> >
>

Re: How to query by rowKey-infix

Posted by Jerry Lam <ch...@gmail.com>.

Hi Chris:

I'm thinking about building a secondary index for primary key lookup, then
query using the primary keys in parallel.

I'm interested to see if there is other option too.

Best Regards,

Jerry

On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <sy...@yahoo.de>wrote:

> Hello there,
>
> I designed a row key for queries that need best performance (~100 ms)
> which looks like this:
>
> userId-date-sessionId
>
> These queries(scans) are always based on a userId and sometimes
> additionally on a date, too.
> That's no problem with the key above.
>
> However, another kind of queries shall be based on a given time range
> whereas the outermost left userId is not given or known.
> In this case I need to get all rows covering the given time range with
> their date to create a daily reporting.
>
> As I can't set wildcards at the beginning of a left-based index for the
> scan,
> I only see the possibility to scan the index of the whole table to collect
> the
> rowKeys that are inside the timerange I'm interested in.
>
> Is there a more elegant way to collect rows within time range X?
> (Unfortunately, the date attribute is not equal to the timestamp that is
> stored by hbase automatically.)
>
> Could/should one maybe leverage some kind of row key caching to accelerate
> the collection process?
> Is that covered by the block cache?
>
> Thanks in advance for any advice.
>
> regards
> Chris
>