You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Tom Brown <to...@gmail.com> on 2012/09/12 23:52:28 UTC

Performance of scan setTimeRange VS manually doing it

When I query HBase, I always include a time range. This has not been a
problem when querying recent data, but it seems to be an issue when I
query older data (a few hours old). All of my row keys include the
timestamp as part of the key (this value is the same as the HBase
timestamp for the row).  I recently tried an experiment where I
manually re-seek to the possible row (based on the timestamp as part
of the row key) instead of using "setTimeRange" on my scan object and
was amazed to see that there was no degradation for older data.

Can someone postulate a theory as to why this might be happening? I'm
happy to provide extra data if it will help you theorize...

Is there a downside to stopping using "setTimeRange"?

--Tom

RE: Performance of scan setTimeRange VS manually doing it

Posted by Anoop Sam John <an...@huawei.com>.
@Tom
I think your guess is correct. When the HFile can not be skipped as the max and min TS overlap with the given time range, that file will be scanned fully and certain rows will be filtered out. Those are read from HDFS.
When you do the reseeks many such read can be avoided.. Remember that HFiles are split into HBlocks and from HDFS we will read one block after the other. So doing this reseeks might be skipping many blocks..  

-Anoop-
________________________________________
From: Tom Brown [tombrown52@gmail.com]
Sent: Thursday, September 13, 2012 4:12 AM
To: user@hbase.apache.org
Subject: Re: Performance of scan setTimeRange VS manually doing it

It seems like the the internal logic for handling a time range is two
part: First, as you said, each file contains the minimum and maximum
timestamps contained within. This provides a very rough filter for the
data, but if your data is right, the effect can be huge. Second, a
time range acts a simple filter during a scan; While looking for the
next row to return, it checks whether the timestamp for the row is
within the time range; Returns that row if it is, and continues to the
next row if it isn't.

What it *doesn't* appear to do, however, is reseek to the row with the
minimum timestamp. Since my row key also contains a copy of the
timestamp, a reseek is able to bypass a lot of rows that the generic
logic would test individually. Perhaps HBase itself could be made to
work this way, but I'm unsure enough of its internal workings that I
can't say for sure.

(The above is my best guess; Let me know if something about that
explanation doesn't smell right)

--Tom

On Wed, Sep 12, 2012 at 4:08 PM, n keywal <nk...@gmail.com> wrote:
> For each file; there is a time range. When you scan/search, the file is
> skipped if there is no overlap between the file timerange and the timerange
> of the query. As there are other parameters as well (row distribution,
> compaction effects, cache, bloom filters, ...) it's difficult to know in
> advance what's going to happen exactly.  But specifying a timerange does no
> harm for sure, if it matches your functional needs...
>
> This said, if you already have the rowkey, the time range is less
> interesting as you will skip a lot of file already.
>
> On Wed, Sep 12, 2012 at 11:52 PM, Tom Brown <to...@gmail.com> wrote:
>
>> When I query HBase, I always include a time range. This has not been a
>> problem when querying recent data, but it seems to be an issue when I
>> query older data (a few hours old). All of my row keys include the
>> timestamp as part of the key (this value is the same as the HBase
>> timestamp for the row).  I recently tried an experiment where I
>> manually re-seek to the possible row (based on the timestamp as part
>> of the row key) instead of using "setTimeRange" on my scan object and
>> was amazed to see that there was no degradation for older data.
>>
>> Can someone postulate a theory as to why this might be happening? I'm
>> happy to provide extra data if it will help you theorize...
>>
>> Is there a downside to stopping using "setTimeRange"?
>>
>> --Tom
>>

Re: Performance of scan setTimeRange VS manually doing it

Posted by Tom Brown <to...@gmail.com>.
It seems like the the internal logic for handling a time range is two
part: First, as you said, each file contains the minimum and maximum
timestamps contained within. This provides a very rough filter for the
data, but if your data is right, the effect can be huge. Second, a
time range acts a simple filter during a scan; While looking for the
next row to return, it checks whether the timestamp for the row is
within the time range; Returns that row if it is, and continues to the
next row if it isn't.

What it *doesn't* appear to do, however, is reseek to the row with the
minimum timestamp. Since my row key also contains a copy of the
timestamp, a reseek is able to bypass a lot of rows that the generic
logic would test individually. Perhaps HBase itself could be made to
work this way, but I'm unsure enough of its internal workings that I
can't say for sure.

(The above is my best guess; Let me know if something about that
explanation doesn't smell right)

--Tom

On Wed, Sep 12, 2012 at 4:08 PM, n keywal <nk...@gmail.com> wrote:
> For each file; there is a time range. When you scan/search, the file is
> skipped if there is no overlap between the file timerange and the timerange
> of the query. As there are other parameters as well (row distribution,
> compaction effects, cache, bloom filters, ...) it's difficult to know in
> advance what's going to happen exactly.  But specifying a timerange does no
> harm for sure, if it matches your functional needs...
>
> This said, if you already have the rowkey, the time range is less
> interesting as you will skip a lot of file already.
>
> On Wed, Sep 12, 2012 at 11:52 PM, Tom Brown <to...@gmail.com> wrote:
>
>> When I query HBase, I always include a time range. This has not been a
>> problem when querying recent data, but it seems to be an issue when I
>> query older data (a few hours old). All of my row keys include the
>> timestamp as part of the key (this value is the same as the HBase
>> timestamp for the row).  I recently tried an experiment where I
>> manually re-seek to the possible row (based on the timestamp as part
>> of the row key) instead of using "setTimeRange" on my scan object and
>> was amazed to see that there was no degradation for older data.
>>
>> Can someone postulate a theory as to why this might be happening? I'm
>> happy to provide extra data if it will help you theorize...
>>
>> Is there a downside to stopping using "setTimeRange"?
>>
>> --Tom
>>

Re: Performance of scan setTimeRange VS manually doing it

Posted by n keywal <nk...@gmail.com>.
For each file; there is a time range. When you scan/search, the file is
skipped if there is no overlap between the file timerange and the timerange
of the query. As there are other parameters as well (row distribution,
compaction effects, cache, bloom filters, ...) it's difficult to know in
advance what's going to happen exactly.  But specifying a timerange does no
harm for sure, if it matches your functional needs...

This said, if you already have the rowkey, the time range is less
interesting as you will skip a lot of file already.

On Wed, Sep 12, 2012 at 11:52 PM, Tom Brown <to...@gmail.com> wrote:

> When I query HBase, I always include a time range. This has not been a
> problem when querying recent data, but it seems to be an issue when I
> query older data (a few hours old). All of my row keys include the
> timestamp as part of the key (this value is the same as the HBase
> timestamp for the row).  I recently tried an experiment where I
> manually re-seek to the possible row (based on the timestamp as part
> of the row key) instead of using "setTimeRange" on my scan object and
> was amazed to see that there was no degradation for older data.
>
> Can someone postulate a theory as to why this might be happening? I'm
> happy to provide extra data if it will help you theorize...
>
> Is there a downside to stopping using "setTimeRange"?
>
> --Tom
>

Re: Performance of scan setTimeRange VS manually doing it

Posted by Xiang Hua <be...@gmail.com>.
Hi,
   do you have script in python for rack awareness configuration?

  Thanks!

beatls


On Thu, Sep 13, 2012 at 5:52 AM, Tom Brown <to...@gmail.com> wrote:

> When I query HBase, I always include a time range. This has not been a
> problem when querying recent data, but it seems to be an issue when I
> query older data (a few hours old). All of my row keys include the
> timestamp as part of the key (this value is the same as the HBase
> timestamp for the row).  I recently tried an experiment where I
> manually re-seek to the possible row (based on the timestamp as part
> of the row key) instead of using "setTimeRange" on my scan object and
> was amazed to see that there was no degradation for older data.
>
> Can someone postulate a theory as to why this might be happening? I'm
> happy to provide extra data if it will help you theorize...
>
> Is there a downside to stopping using "setTimeRange"?
>
> --Tom
>