You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Steinmaurer Thomas <Th...@scch.at> on 2011/12/14 09:38:14 UTC

Questions on timestamps, insights on how timerange/timestamp filter are processed?

Hello,

can anybody share some insights on how timerange/timestamp filters are
processed?

Basically we intend to use timerange/timestamp filters to process rather
new data from an insertion timestamp POV

- How does the process of skipping records and/or regions work, if one
use timerange filters?
- I also wonder, do timestamp change when e.g. running a major
compaction?
- If data grows over the years, is there any chance that regions with
"older" rows keep "stable" in a way, that they can be skipped very
quickly when querying data with a timerange filter of e.g. the last
three yours?

Thanks,
Thomas

RE: Questions on timestamps, insights on how timerange/timestamp filter are processed?

Posted by Steinmaurer Thomas <Th...@scch.at>.

In our tests, filtering rows with timestamps was much faster than using
a filter which results in a full table scan. But, I question the
reliability to use the internal timestamp to detect new data and if this
still scales with a growing amount of data over the years.

Regards,
Thomas

-----Original Message-----
From: Carson Hoffacker [mailto:choffacker@gmail.com] 
Sent: Donnerstag, 15. Dezember 2011 05:36
To: user@hbase.apache.org; Stuart Smith
Subject: Re: Questions on timestamps, insights on how
timerange/timestamp filter are processed?

I believe it's the same amount of work.

On Wed, Dec 14, 2011 at 3:37 PM, Stuart Smith <st...@yahoo.com>
wrote:

> Ah. Thanks for clarifying my wrong answer.. !
>
> The only time I had to deal with timestamps I had to go through the 
> thrift API ...
> Never noticed the setTimeRange in the Scan() java API :)
>
> So now I'm curious.. If I use this and it can't skip HFiles.. is there

> any performance gain from doing this vs doing it client side?
> Or is it basically the same amount of work - a full scan checking & 
> skipping timestamps.. ?
>
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Carson Hoffacker <ch...@gmail.com>
> To: user@hbase.apache.org; Stuart Smith <st...@yahoo.com>
> Sent: Wednesday, December 14, 2011 10:29 AM
> Subject: Re: Questions on timestamps, insights on how 
> timerange/timestamp filter are processed?
>
> The timerange scan is able to leverage metadata in each of the HFiles.

> Each HFile should store information about the timerange associated 
> with the data within the HFile. If the the timerange associated with 
> the HFile is different than the timerange you are interested in, that 
> hfile will be skipped completely. This can significantly increase scan
performance.
>
> However, when these files get compacted and the data is merged into a 
> smaller number of files, the time range associated with each file 
> increases. I don't think it works this way out of the box, but I 
> believe you can be smart about how you manage compactions over time to

> get the behavior that you want. You could have compactions compact all

> the data from January 2011 into a single file, and then compact all 
> the data from February 2011 into a different file.
>
> -Carson
>
> On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith <st...@yahoo.com>
wrote:
>
> > Hello Thomas,
> >
> >    Someone here could probably provide more help, but to start you 
> > off, the only way I've filtered timestamps is to do a scan, and just

> > filter
> out
> > rows one by one. This definitely sounds like something coprocessors 
> > could help with, but I don't really understand those yet, so someone

> > else will have to step up.. or you can really dig into the 
> > documentation about them (AFAIK, it's a little bit of custom code 
> > that runs on the regionservers that can pre-process your gets.. but
don't quote me on that!).
> >
> > But I can say that a major compaction should not affect them - I've 
> > never seen it happen, and if it does, I believe that's a bug.
> >
> > Take care,
> >   -stu
> >
> >
> >
> > ________________________________
> >  From: Steinmaurer Thomas <Th...@scch.at>
> > To: user@hbase.apache.org
> > Sent: Wednesday, December 14, 2011 12:38 AM
> > Subject: Questions on timestamps, insights on how 
> > timerange/timestamp filter are processed?
> >
> > Hello,
> >
> > can anybody share some insights on how timerange/timestamp filters 
> > are processed?
> >
> > Basically we intend to use timerange/timestamp filters to process 
> > rather new data from an insertion timestamp POV
> >
> > - How does the process of skipping records and/or regions work, if 
> > one use timerange filters?
> > - I also wonder, do timestamp change when e.g. running a major 
> > compaction?
> > - If data grows over the years, is there any chance that regions 
> > with "older" rows keep "stable" in a way, that they can be skipped 
> > very quickly when querying data with a timerange filter of e.g. the 
> > last three yours?
> >
> > Thanks,
> > Thomas
> >
>

Re: Questions on timestamps, insights on how timerange/timestamp filter are processed?

Posted by Carson Hoffacker <ch...@gmail.com>.

I believe it's the same amount of work.

On Wed, Dec 14, 2011 at 3:37 PM, Stuart Smith <st...@yahoo.com> wrote:

> Ah. Thanks for clarifying my wrong answer.. !
>
> The only time I had to deal with timestamps I had to go through the thrift
> API ...
> Never noticed the setTimeRange in the Scan() java API :)
>
> So now I'm curious.. If I use this and it can't skip HFiles.. is there any
> performance gain from doing this vs doing it client side?
> Or is it basically the same amount of work - a full scan checking &
> skipping timestamps.. ?
>
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Carson Hoffacker <ch...@gmail.com>
> To: user@hbase.apache.org; Stuart Smith <st...@yahoo.com>
> Sent: Wednesday, December 14, 2011 10:29 AM
> Subject: Re: Questions on timestamps, insights on how timerange/timestamp
> filter are processed?
>
> The timerange scan is able to leverage metadata in each of the HFiles. Each
> HFile should store information about the timerange associated with the data
> within the HFile. If the the timerange associated with the HFile is
> different than the timerange you are interested in, that hfile will be
> skipped completely. This can significantly increase scan performance.
>
> However, when these files get compacted and the data is merged into a
> smaller number of files, the time range associated with each file
> increases. I don't think it works this way out of the box, but I believe
> you can be smart about how you manage compactions over time to get the
> behavior that you want. You could have compactions compact all the data
> from January 2011 into a single file, and then compact all the data from
> February 2011 into a different file.
>
> -Carson
>
> On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith <st...@yahoo.com> wrote:
>
> > Hello Thomas,
> >
> >    Someone here could probably provide more help, but to start you off,
> > the only way I've filtered timestamps is to do a scan, and just filter
> out
> > rows one by one. This definitely sounds like something coprocessors could
> > help with, but I don't really understand those yet, so someone else will
> > have to step up.. or you can really dig into the documentation about them
> > (AFAIK, it's a little bit of custom code that runs on the regionservers
> > that can pre-process your gets.. but don't quote me on that!).
> >
> > But I can say that a major compaction should not affect them - I've never
> > seen it happen, and if it does, I believe that's a bug.
> >
> > Take care,
> >   -stu
> >
> >
> >
> > ________________________________
> >  From: Steinmaurer Thomas <Th...@scch.at>
> > To: user@hbase.apache.org
> > Sent: Wednesday, December 14, 2011 12:38 AM
> > Subject: Questions on timestamps, insights on how timerange/timestamp
> > filter are processed?
> >
> > Hello,
> >
> > can anybody share some insights on how timerange/timestamp filters are
> > processed?
> >
> > Basically we intend to use timerange/timestamp filters to process rather
> > new data from an insertion timestamp POV
> >
> > - How does the process of skipping records and/or regions work, if one
> > use timerange filters?
> > - I also wonder, do timestamp change when e.g. running a major
> > compaction?
> > - If data grows over the years, is there any chance that regions with
> > "older" rows keep "stable" in a way, that they can be skipped very
> > quickly when querying data with a timerange filter of e.g. the last
> > three yours?
> >
> > Thanks,
> > Thomas
> >
>

Re: Questions on timestamps, insights on how timerange/timestamp filter are processed?

Posted by Stuart Smith <st...@yahoo.com>.

Ah. Thanks for clarifying my wrong answer.. !

The only time I had to deal with timestamps I had to go through the thrift API ...
Never noticed the setTimeRange in the Scan() java API :)

So now I'm curious.. If I use this and it can't skip HFiles.. is there any performance gain from doing this vs doing it client side?
Or is it basically the same amount of work - a full scan checking & skipping timestamps.. ?

Take care,
  -stu

________________________________
 From: Carson Hoffacker <ch...@gmail.com>
To: user@hbase.apache.org; Stuart Smith <st...@yahoo.com> 
Sent: Wednesday, December 14, 2011 10:29 AM
Subject: Re: Questions on timestamps, insights on how timerange/timestamp filter are processed?

The timerange scan is able to leverage metadata in each of the HFiles. Each
HFile should store information about the timerange associated with the data
within the HFile. If the the timerange associated with the HFile is
different than the timerange you are interested in, that hfile will be
skipped completely. This can significantly increase scan performance.

However, when these files get compacted and the data is merged into a
smaller number of files, the time range associated with each file
increases. I don't think it works this way out of the box, but I believe
you can be smart about how you manage compactions over time to get the
behavior that you want. You could have compactions compact all the data
from January 2011 into a single file, and then compact all the data from
February 2011 into a different file.

-Carson

On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith <st...@yahoo.com> wrote:

> Hello Thomas,
>
>    Someone here could probably provide more help, but to start you off,
> the only way I've filtered timestamps is to do a scan, and just filter out
> rows one by one. This definitely sounds like something coprocessors could
> help with, but I don't really understand those yet, so someone else will
> have to step up.. or you can really dig into the documentation about them
> (AFAIK, it's a little bit of custom code that runs on the regionservers
> that can pre-process your gets.. but don't quote me on that!).
>
> But I can say that a major compaction should not affect them - I've never
> seen it happen, and if it does, I believe that's a bug.
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Steinmaurer Thomas <Th...@scch.at>
> To: user@hbase.apache.org
> Sent: Wednesday, December 14, 2011 12:38 AM
> Subject: Questions on timestamps, insights on how timerange/timestamp
> filter are processed?
>
> Hello,
>
> can anybody share some insights on how timerange/timestamp filters are
> processed?
>
> Basically we intend to use timerange/timestamp filters to process rather
> new data from an insertion timestamp POV
>
> - How does the process of skipping records and/or regions work, if one
> use timerange filters?
> - I also wonder, do timestamp change when e.g. running a major
> compaction?
> - If data grows over the years, is there any chance that regions with
> "older" rows keep "stable" in a way, that they can be skipped very
> quickly when querying data with a timerange filter of e.g. the last
> three yours?
>
> Thanks,
> Thomas
>

Re: Questions on timestamps, insights on how timerange/timestamp filter are processed?

Posted by Sam Seigal <se...@yahoo.com>.

That is an interesting comment. How would you enforce this in practice
? Can you give more details.

On Wed, Dec 14, 2011 at 10:29 AM, Carson Hoffacker <ch...@gmail.com> wrote:
> The timerange scan is able to leverage metadata in each of the HFiles. Each
> HFile should store information about the timerange associated with the data
> within the HFile. If the the timerange associated with the HFile is
> different than the timerange you are interested in, that hfile will be
> skipped completely. This can significantly increase scan performance.
>
> However, when these files get compacted and the data is merged into a
> smaller number of files, the time range associated with each file
> increases. I don't think it works this way out of the box, but I believe
> you can be smart about how you manage compactions over time to get the
> behavior that you want. You could have compactions compact all the data
> from January 2011 into a single file, and then compact all the data from
> February 2011 into a different file.
>
> -Carson
>
> On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith <st...@yahoo.com> wrote:
>
>> Hello Thomas,
>>
>>    Someone here could probably provide more help, but to start you off,
>> the only way I've filtered timestamps is to do a scan, and just filter out
>> rows one by one. This definitely sounds like something coprocessors could
>> help with, but I don't really understand those yet, so someone else will
>> have to step up.. or you can really dig into the documentation about them
>> (AFAIK, it's a little bit of custom code that runs on the regionservers
>> that can pre-process your gets.. but don't quote me on that!).
>>
>> But I can say that a major compaction should not affect them - I've never
>> seen it happen, and if it does, I believe that's a bug.
>>
>> Take care,
>>   -stu
>>
>>
>>
>> ________________________________
>>  From: Steinmaurer Thomas <Th...@scch.at>
>> To: user@hbase.apache.org
>> Sent: Wednesday, December 14, 2011 12:38 AM
>> Subject: Questions on timestamps, insights on how timerange/timestamp
>> filter are processed?
>>
>> Hello,
>>
>> can anybody share some insights on how timerange/timestamp filters are
>> processed?
>>
>> Basically we intend to use timerange/timestamp filters to process rather
>> new data from an insertion timestamp POV
>>
>> - How does the process of skipping records and/or regions work, if one
>> use timerange filters?
>> - I also wonder, do timestamp change when e.g. running a major
>> compaction?
>> - If data grows over the years, is there any chance that regions with
>> "older" rows keep "stable" in a way, that they can be skipped very
>> quickly when querying data with a timerange filter of e.g. the last
>> three yours?
>>
>> Thanks,
>> Thomas
>>

Re: Questions on timestamps, insights on how timerange/timestamp filter are processed?

Posted by Carson Hoffacker <ch...@gmail.com>.

The timerange scan is able to leverage metadata in each of the HFiles. Each
HFile should store information about the timerange associated with the data
within the HFile. If the the timerange associated with the HFile is
different than the timerange you are interested in, that hfile will be
skipped completely. This can significantly increase scan performance.

However, when these files get compacted and the data is merged into a
smaller number of files, the time range associated with each file
increases. I don't think it works this way out of the box, but I believe
you can be smart about how you manage compactions over time to get the
behavior that you want. You could have compactions compact all the data
from January 2011 into a single file, and then compact all the data from
February 2011 into a different file.

-Carson

On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith <st...@yahoo.com> wrote:

> Hello Thomas,
>
>    Someone here could probably provide more help, but to start you off,
> the only way I've filtered timestamps is to do a scan, and just filter out
> rows one by one. This definitely sounds like something coprocessors could
> help with, but I don't really understand those yet, so someone else will
> have to step up.. or you can really dig into the documentation about them
> (AFAIK, it's a little bit of custom code that runs on the regionservers
> that can pre-process your gets.. but don't quote me on that!).
>
> But I can say that a major compaction should not affect them - I've never
> seen it happen, and if it does, I believe that's a bug.
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Steinmaurer Thomas <Th...@scch.at>
> To: user@hbase.apache.org
> Sent: Wednesday, December 14, 2011 12:38 AM
> Subject: Questions on timestamps, insights on how timerange/timestamp
> filter are processed?
>
> Hello,
>
> can anybody share some insights on how timerange/timestamp filters are
> processed?
>
> Basically we intend to use timerange/timestamp filters to process rather
> new data from an insertion timestamp POV
>
> - How does the process of skipping records and/or regions work, if one
> use timerange filters?
> - I also wonder, do timestamp change when e.g. running a major
> compaction?
> - If data grows over the years, is there any chance that regions with
> "older" rows keep "stable" in a way, that they can be skipped very
> quickly when querying data with a timerange filter of e.g. the last
> three yours?
>
> Thanks,
> Thomas
>

Re: Questions on timestamps, insights on how timerange/timestamp filter are processed?

Posted by Stuart Smith <st...@yahoo.com>.

Hello Thomas,

   Someone here could probably provide more help, but to start you off, the only way I've filtered timestamps is to do a scan, and just filter out rows one by one. This definitely sounds like something coprocessors could help with, but I don't really understand those yet, so someone else will have to step up.. or you can really dig into the documentation about them (AFAIK, it's a little bit of custom code that runs on the regionservers that can pre-process your gets.. but don't quote me on that!).

But I can say that a major compaction should not affect them - I've never seen it happen, and if it does, I believe that's a bug.

Take care,
  -stu



________________________________
 From: Steinmaurer Thomas <Th...@scch.at>
To: user@hbase.apache.org 
Sent: Wednesday, December 14, 2011 12:38 AM
Subject: Questions on timestamps, insights on how timerange/timestamp filter are processed?
 
Hello,

can anybody share some insights on how timerange/timestamp filters are
processed?

Basically we intend to use timerange/timestamp filters to process rather
new data from an insertion timestamp POV

- How does the process of skipping records and/or regions work, if one
use timerange filters?
- I also wonder, do timestamp change when e.g. running a major
compaction?
- If data grows over the years, is there any chance that regions with
"older" rows keep "stable" in a way, that they can be skipped very
quickly when querying data with a timerange filter of e.g. the last
three yours?

Thanks,
Thomas