You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Josh <jo...@gmail.com> on 2017/04/12 17:33:15 UTC

Efficient time based queries - TIMERANGE or STARTROW/STOPROW?

Hi,

I am just getting started with HBase, and have a question about the
efficiency of timestamp based scans.

My table's row key has structure `uuid#reverse_timestamp` where
reverse_timestamp is (java.lang.Long.MAX_VALUE - time in millis when the
row was written). For a given uuid I want to be able to retrieve the most
recent 10 rows in the table where timestamp is greater than x. It's
possible that a given uuid may have many thousands of rows (with different
timestamps).

I found there are two ways to run my query:
1. use HBase's built in timestamps and scan a time range:
> scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
TIMERANGE => [x, current_time], LIMIT => 10}

2. use only my row keys to do the scan, with STARTROW and STOPROW:
scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
STOPROW='647b2194-fbb8-46af-95ba-f498ddc8adcc#x', LIMIT => 10}

Both of these seem to work - but is one more efficient that the other?

Thanks for any advice,
Josh

Re: Efficient time based queries - TIMERANGE or STARTROW/STOPROW?

Posted by Josh <jo...@gmail.com>.
Hi Josh,

Ok thanks for the insights. I will stick to using STARTROW and STOPROW
along with a reverse timestamp in the row key for these queries then.

Thanks!

On Thu, Apr 13, 2017 at 4:36 AM, Josh Elser <jo...@gmail.com> wrote:

> If it's helpful to state it in generic terms: specifying a range of
> HBase timestamps is *only* a post-filter (server-side) and *never* a
> primary search criteria.
>
> In other words, searching by the HBase timestamp is a full-table scan
> (exhaustive search). While the timestamp can be nice for certain
> use-cases, I think it best to use it sparingly and only for its
> original purpose (preserving old versions of a cell), and not as a
> "query parameter". I've seen way too many use-cases where people think
> they're being tricky by using timestamps to implement a poor-man's
> search. This works for small amounts of data, but quickly falls to its
> knees with non-trivial data-sets.
>
> On Wed, Apr 12, 2017 at 3:05 PM, Ted Yu <yu...@gmail.com> wrote:
> > If you change the rowkey, it would be tricky to satisfy what you stated
> in
> > your first email:
> >
> > bq. retrieve the most recent 10 rows
> >
> > Since the same uuid may have many rows which are no longer sorted in the
> > order amenable to your query.
> >
> > Cheers
> >
> > On Wed, Apr 12, 2017 at 10:43 AM, Josh <jo...@gmail.com> wrote:
> >
> >> Hi Ted,
> >>
> >> Thanks for the fast reply!
> >> Ok I see - just out of interest, if I changed my row key to be
> >> uuid#timestamp  (instead of uuid#reverse_timestamp) - would the
> timestamp
> >> approach still be equally efficient? I just want to understand whether
> or
> >> not the timestamp approach is relying on the ordering of my row keys.
> >>
> >> Josh
> >>
> >> On Wed, Apr 12, 2017 at 6:39 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >> > Since STARTROW is specified (with uuid) in both of your examples, I
> think
> >> > their efficiency should be tantamount.
> >> >
> >> > Cheers
> >> >
> >> > On Wed, Apr 12, 2017 at 10:33 AM, Josh <jo...@gmail.com> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I am just getting started with HBase, and have a question about the
> >> > > efficiency of timestamp based scans.
> >> > >
> >> > > My table's row key has structure `uuid#reverse_timestamp` where
> >> > > reverse_timestamp is (java.lang.Long.MAX_VALUE - time in millis when
> >> the
> >> > > row was written). For a given uuid I want to be able to retrieve the
> >> most
> >> > > recent 10 rows in the table where timestamp is greater than x. It's
> >> > > possible that a given uuid may have many thousands of rows (with
> >> > different
> >> > > timestamps).
> >> > >
> >> > > I found there are two ways to run my query:
> >> > > 1. use HBase's built in timestamps and scan a time range:
> >> > > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-
> f498ddc8adcc',
> >> > > TIMERANGE => [x, current_time], LIMIT => 10}
> >> > >
> >> > > 2. use only my row keys to do the scan, with STARTROW and STOPROW:
> >> > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-
> f498ddc8adcc',
> >> > > STOPROW='647b2194-fbb8-46af-95ba-f498ddc8adcc#x', LIMIT => 10}
> >> > >
> >> > > Both of these seem to work - but is one more efficient that the
> other?
> >> > >
> >> > > Thanks for any advice,
> >> > > Josh
> >> > >
> >> >
> >>
>

Re: Efficient time based queries - TIMERANGE or STARTROW/STOPROW?

Posted by Josh Elser <jo...@gmail.com>.
If it's helpful to state it in generic terms: specifying a range of
HBase timestamps is *only* a post-filter (server-side) and *never* a
primary search criteria.

In other words, searching by the HBase timestamp is a full-table scan
(exhaustive search). While the timestamp can be nice for certain
use-cases, I think it best to use it sparingly and only for its
original purpose (preserving old versions of a cell), and not as a
"query parameter". I've seen way too many use-cases where people think
they're being tricky by using timestamps to implement a poor-man's
search. This works for small amounts of data, but quickly falls to its
knees with non-trivial data-sets.

On Wed, Apr 12, 2017 at 3:05 PM, Ted Yu <yu...@gmail.com> wrote:
> If you change the rowkey, it would be tricky to satisfy what you stated in
> your first email:
>
> bq. retrieve the most recent 10 rows
>
> Since the same uuid may have many rows which are no longer sorted in the
> order amenable to your query.
>
> Cheers
>
> On Wed, Apr 12, 2017 at 10:43 AM, Josh <jo...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> Thanks for the fast reply!
>> Ok I see - just out of interest, if I changed my row key to be
>> uuid#timestamp  (instead of uuid#reverse_timestamp) - would the timestamp
>> approach still be equally efficient? I just want to understand whether or
>> not the timestamp approach is relying on the ordering of my row keys.
>>
>> Josh
>>
>> On Wed, Apr 12, 2017 at 6:39 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > Since STARTROW is specified (with uuid) in both of your examples, I think
>> > their efficiency should be tantamount.
>> >
>> > Cheers
>> >
>> > On Wed, Apr 12, 2017 at 10:33 AM, Josh <jo...@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > I am just getting started with HBase, and have a question about the
>> > > efficiency of timestamp based scans.
>> > >
>> > > My table's row key has structure `uuid#reverse_timestamp` where
>> > > reverse_timestamp is (java.lang.Long.MAX_VALUE - time in millis when
>> the
>> > > row was written). For a given uuid I want to be able to retrieve the
>> most
>> > > recent 10 rows in the table where timestamp is greater than x. It's
>> > > possible that a given uuid may have many thousands of rows (with
>> > different
>> > > timestamps).
>> > >
>> > > I found there are two ways to run my query:
>> > > 1. use HBase's built in timestamps and scan a time range:
>> > > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
>> > > TIMERANGE => [x, current_time], LIMIT => 10}
>> > >
>> > > 2. use only my row keys to do the scan, with STARTROW and STOPROW:
>> > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
>> > > STOPROW='647b2194-fbb8-46af-95ba-f498ddc8adcc#x', LIMIT => 10}
>> > >
>> > > Both of these seem to work - but is one more efficient that the other?
>> > >
>> > > Thanks for any advice,
>> > > Josh
>> > >
>> >
>>

Re: Efficient time based queries - TIMERANGE or STARTROW/STOPROW?

Posted by Ted Yu <yu...@gmail.com>.
If you change the rowkey, it would be tricky to satisfy what you stated in
your first email:

bq. retrieve the most recent 10 rows

Since the same uuid may have many rows which are no longer sorted in the
order amenable to your query.

Cheers

On Wed, Apr 12, 2017 at 10:43 AM, Josh <jo...@gmail.com> wrote:

> Hi Ted,
>
> Thanks for the fast reply!
> Ok I see - just out of interest, if I changed my row key to be
> uuid#timestamp  (instead of uuid#reverse_timestamp) - would the timestamp
> approach still be equally efficient? I just want to understand whether or
> not the timestamp approach is relying on the ordering of my row keys.
>
> Josh
>
> On Wed, Apr 12, 2017 at 6:39 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Since STARTROW is specified (with uuid) in both of your examples, I think
> > their efficiency should be tantamount.
> >
> > Cheers
> >
> > On Wed, Apr 12, 2017 at 10:33 AM, Josh <jo...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am just getting started with HBase, and have a question about the
> > > efficiency of timestamp based scans.
> > >
> > > My table's row key has structure `uuid#reverse_timestamp` where
> > > reverse_timestamp is (java.lang.Long.MAX_VALUE - time in millis when
> the
> > > row was written). For a given uuid I want to be able to retrieve the
> most
> > > recent 10 rows in the table where timestamp is greater than x. It's
> > > possible that a given uuid may have many thousands of rows (with
> > different
> > > timestamps).
> > >
> > > I found there are two ways to run my query:
> > > 1. use HBase's built in timestamps and scan a time range:
> > > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
> > > TIMERANGE => [x, current_time], LIMIT => 10}
> > >
> > > 2. use only my row keys to do the scan, with STARTROW and STOPROW:
> > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
> > > STOPROW='647b2194-fbb8-46af-95ba-f498ddc8adcc#x', LIMIT => 10}
> > >
> > > Both of these seem to work - but is one more efficient that the other?
> > >
> > > Thanks for any advice,
> > > Josh
> > >
> >
>

Re: Efficient time based queries - TIMERANGE or STARTROW/STOPROW?

Posted by Josh <jo...@gmail.com>.
Hi Ted,

Thanks for the fast reply!
Ok I see - just out of interest, if I changed my row key to be
uuid#timestamp  (instead of uuid#reverse_timestamp) - would the timestamp
approach still be equally efficient? I just want to understand whether or
not the timestamp approach is relying on the ordering of my row keys.

Josh

On Wed, Apr 12, 2017 at 6:39 PM, Ted Yu <yu...@gmail.com> wrote:

> Since STARTROW is specified (with uuid) in both of your examples, I think
> their efficiency should be tantamount.
>
> Cheers
>
> On Wed, Apr 12, 2017 at 10:33 AM, Josh <jo...@gmail.com> wrote:
>
> > Hi,
> >
> > I am just getting started with HBase, and have a question about the
> > efficiency of timestamp based scans.
> >
> > My table's row key has structure `uuid#reverse_timestamp` where
> > reverse_timestamp is (java.lang.Long.MAX_VALUE - time in millis when the
> > row was written). For a given uuid I want to be able to retrieve the most
> > recent 10 rows in the table where timestamp is greater than x. It's
> > possible that a given uuid may have many thousands of rows (with
> different
> > timestamps).
> >
> > I found there are two ways to run my query:
> > 1. use HBase's built in timestamps and scan a time range:
> > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
> > TIMERANGE => [x, current_time], LIMIT => 10}
> >
> > 2. use only my row keys to do the scan, with STARTROW and STOPROW:
> > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
> > STOPROW='647b2194-fbb8-46af-95ba-f498ddc8adcc#x', LIMIT => 10}
> >
> > Both of these seem to work - but is one more efficient that the other?
> >
> > Thanks for any advice,
> > Josh
> >
>

Re: Efficient time based queries - TIMERANGE or STARTROW/STOPROW?

Posted by Ted Yu <yu...@gmail.com>.
Since STARTROW is specified (with uuid) in both of your examples, I think
their efficiency should be tantamount.

Cheers

On Wed, Apr 12, 2017 at 10:33 AM, Josh <jo...@gmail.com> wrote:

> Hi,
>
> I am just getting started with HBase, and have a question about the
> efficiency of timestamp based scans.
>
> My table's row key has structure `uuid#reverse_timestamp` where
> reverse_timestamp is (java.lang.Long.MAX_VALUE - time in millis when the
> row was written). For a given uuid I want to be able to retrieve the most
> recent 10 rows in the table where timestamp is greater than x. It's
> possible that a given uuid may have many thousands of rows (with different
> timestamps).
>
> I found there are two ways to run my query:
> 1. use HBase's built in timestamps and scan a time range:
> > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
> TIMERANGE => [x, current_time], LIMIT => 10}
>
> 2. use only my row keys to do the scan, with STARTROW and STOPROW:
> scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-f498ddc8adcc',
> STOPROW='647b2194-fbb8-46af-95ba-f498ddc8adcc#x', LIMIT => 10}
>
> Both of these seem to work - but is one more efficient that the other?
>
> Thanks for any advice,
> Josh
>