You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Leif Wickland <le...@gmail.com> on 2011/09/09 20:32:37 UTC

Performance characteristics of scans using timestamp as the filter

(Apologies if this has been answered before.  I couldn't find anything in
the archives quite along these lines.)

I have a process which writes to HBase as new data arrives.  I'd like to run
a map-reduce periodically, say daily, that takes the new items as input.  A
naive approach would use a scan which grabs all of the rows that have a
timestamp in a specified interval as the input to a MapReduce.  I tested a
scenario like that with 10s of GB of data and it seemed to perform OK.
 Should I expected that approach to continue to perform reasonably well when
I have TBs of data?

>From what I understand of the HBase architecture, I don't see a reason that
the the scan approach would continue to perform well as the data grows.  It
seems like I may have to keep a log of modified keys and use that as the
map-reduce input, instead.

Thanks,

Leif Wickland

Re: Performance characteristics of scans using timestamp as the filter

Posted by Jean-Daniel Cryans <jd...@apache.org>.

(super late answer, I'm cleaning up my old unread emails)

This sort of sounds like what Mozilla did for the crash reports.

The issue with your solution is when you're looking to get only a
small portion of your whole dataset you still have to go over the rest
of the data to reach it. So if you just need the daily data you're
taking a pretty big hit.

Keeping a log of modified keys sounds ok, but I'm not sure how you
plan to feed the data to MR (unless you just need the key and nothing
else).

J-D

On Fri, Sep 9, 2011 at 11:32 AM, Leif Wickland <le...@gmail.com> wrote:
> (Apologies if this has been answered before.  I couldn't find anything in
> the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like to run
> a map-reduce periodically, say daily, that takes the new items as input.  A
> naive approach would use a scan which grabs all of the rows that have a
> timestamp in a specified interval as the input to a MapReduce.  I tested a
> scenario like that with 10s of GB of data and it seemed to perform OK.
>  Should I expected that approach to continue to perform reasonably well when
> I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason that
> the the scan approach would continue to perform well as the data grows.  It
> seems like I may have to keep a log of modified keys and use that as the
> map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

Re: Performance characteristics of scans using timestamp as the filter

Posted by Doug Meil <do...@explorysmedical.com>.

Scans work on startRow/stopRow...

http://hbase.apache.org/book.html#scan

... you can also select by timestamp *within the startRow/stopRow
selection*, but this isn't intended to quickly select rows by timestamp
irrespective of their keys.




On 12/1/11 9:03 AM, "Srikanth P. Shreenivas"
<Sr...@mindtree.com> wrote:

>So, will it be safe to assume that Scan queries with TimeRange will
>perform well and will read only necessary portions of the tables instead
>of doing full table scan?
>
>I have run into a situation, wherein I would like to find out all rows
>that got create/updated on during a time range.
>I was hoping that I could to time range scan.
>
>Regards,
>Srikanth
>
>
>
>-----Original Message-----
>From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
>Sent: Monday, October 10, 2011 3:44 PM
>To: user@hbase.apache.org
>Subject: RE: Performance characteristics of scans using timestamp as the
>filter
>
>Yes its true.
>Your cluster time should be in sync for reliable functioning.
>
>-----Original Message-----
>From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at]
>Sent: Monday, October 10, 2011 3:04 PM
>To: user@hbase.apache.org
>Subject: RE: Performance characteristics of scans using timestamp as the
>filter
>
>Isn't a synchronized time along all nodes a general requirement for
>running the cluster reliably?
>
>Regards,
>Thomas
>
>-----Original Message-----
>From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
>Sent: Montag, 10. Oktober 2011 11:18
>To: user@hbase.apache.org
>Subject: RE: Performance characteristics of scans using timestamp as the
>filter
>
>Steinmaurer,
>
>I have done a little POC with Timerange scan and it worked fine for me.
>Another thing to note is time should be same on all machines of your
>cluster of Hbase.
>
>-----Original Message-----
>From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at]
>Sent: Monday, October 10, 2011 2:32 PM
>To: user@hbase.apache.org
>Subject: RE: Performance characteristics of scans using timestamp as the
>filter
>
>Hello,
>
>others have stated that one shouldn't try to use timestamps, although I
>haven't figured out why? If it's reliability, which means, rows are
>omitted, even if they should be included in a timerange-based scan, then
>this might be a good argument. ;-)
>
>One thing is that the timestamp AFAIK changes when you update a row even
>cell values didn't change.
>
>Regards,
>Thomas
>
>-----Original Message-----
>From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
>Sent: Montag, 10. Oktober 2011 10:07
>To: user@hbase.apache.org
>Subject: RE: Performance characteristics of scans using timestamp as the
>filter
>
>Hi Saurabh,
>
>AFAIK you can also scan on the basis of Timestamp Range. This can provide
>you data update in that timestamp range. You do not need to keep
>timestamp in you row key.
>
>-----Original Message-----
>From: saurabh.r.s@gmail.com [mailto:saurabh.r.s@gmail.com] On Behalf Of
>Sam Seigal
>Sent: Monday, October 10, 2011 1:20 PM
>To: user@hbase.apache.org
>Subject: Re: Performance characteristics of scans using timestamp as the
>filter
>
>Is it possible to do incremental processing without putting the timestamp
>in the leading part of the row key in a more efficient manner i.e.
>process data that came within the last hour/ 2 hour etc ? I can't seem to
>find a good answer to this question myself.
>
>On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas <
>Thomas.Steinmaurer@scch.at> wrote:
>
>> Leif,
>>
>> we are pretty much in the same boat with a custom timestamp at the end
>
>> of a three-part rowkey, so basically we end up with reading all data
>> when processing daily batches. Beside performance aspects, have you
>> seen that using internals timestamps for scans etc... work reliable?
>>
>> Or did you come up with another solution to your problem?
>>
>> Thanks,
>> Thomas
>>
>> -----Original Message-----
>> From: Leif Wickland [mailto:leifwickland@gmail.com]
>> Sent: Freitag, 09. September 2011 20:33
>> To: user@hbase.apache.org
>> Subject: Performance characteristics of scans using timestamp as the
>> filter
>>
>> (Apologies if this has been answered before.  I couldn't find anything
>
>> in the archives quite along these lines.)
>>
>> I have a process which writes to HBase as new data arrives.  I'd like
>> to run a map-reduce periodically, say daily, that takes the new items
>as input.
>>  A naive approach would use a scan which grabs all of the rows that
>> have a timestamp in a specified interval as the input to a MapReduce.
>> I tested a scenario like that with 10s of GB of data and it seemed to
>perform OK.
>>  Should I expected that approach to continue to perform reasonably
>> well when I have TBs of data?
>>
>> From what I understand of the HBase architecture, I don't see a reason
>
>> that the the scan approach would continue to perform well as the data
>> grows.  It seems like I may have to keep a log of modified keys and
>> use that as the map-reduce input, instead.
>>
>> Thanks,
>>
>> Leif Wickland
>>
>
>::DISCLAIMER::
>------------------------------------------------------------------------
>-----------------------------------------------
>
>The contents of this e-mail and any attachment(s) are confidential and
>intended for the named recipient(s) only.
>It shall not attach any liability on the originator or HCL or its
>affiliates. Any views or opinions presented in this email are solely
>those of the author and may not necessarily reflect the opinions of HCL
>or its affiliates.
>Any form of reproduction, dissemination, copying, disclosure,
>modification, distribution and / or publication of this message without
>the prior written consent of the author of this e-mail is strictly
>prohibited. If you have received this email in error please delete it and
>notify the sender immediately. Before opening any mail and attachments
>please check them for viruses and defect.
>
>------------------------------------------------------------------------
>-----------------------------------------------
>
>________________________________
>
>http://www.mindtree.com/email/disclaimer.html
>

RE: Performance characteristics of scans using timestamp as the filter

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

So, will it be safe to assume that Scan queries with TimeRange will perform well and will read only necessary portions of the tables instead of doing full table scan?

I have run into a situation, wherein I would like to find out all rows that got create/updated on during a time range.
I was hoping that I could to time range scan.

Regards,
Srikanth

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Monday, October 10, 2011 3:44 PM
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Yes its true.
Your cluster time should be in sync for reliable functioning.

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at]
Sent: Monday, October 10, 2011 3:04 PM
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Isn't a synchronized time along all nodes a general requirement for running the cluster reliably?

Regards,
Thomas

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Montag, 10. Oktober 2011 11:18
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Steinmaurer,

I have done a little POC with Timerange scan and it worked fine for me.
Another thing to note is time should be same on all machines of your cluster of Hbase.

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at]
Sent: Monday, October 10, 2011 2:32 PM
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Hello,

others have stated that one shouldn't try to use timestamps, although I haven't figured out why? If it's reliability, which means, rows are omitted, even if they should be included in a timerange-based scan, then this might be a good argument. ;-)

One thing is that the timestamp AFAIK changes when you update a row even cell values didn't change.

Regards,
Thomas

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Montag, 10. Oktober 2011 10:07
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Hi Saurabh,

AFAIK you can also scan on the basis of Timestamp Range. This can provide you data update in that timestamp range. You do not need to keep timestamp in you row key.

-----Original Message-----
From: saurabh.r.s@gmail.com [mailto:saurabh.r.s@gmail.com] On Behalf Of Sam Seigal
Sent: Monday, October 10, 2011 1:20 PM
To: user@hbase.apache.org
Subject: Re: Performance characteristics of scans using timestamp as the filter

Is it possible to do incremental processing without putting the timestamp in the leading part of the row key in a more efficient manner i.e. process data that came within the last hour/ 2 hour etc ? I can't seem to find a good answer to this question myself.

On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas < Thomas.Steinmaurer@scch.at> wrote:

> Leif,
>
> we are pretty much in the same boat with a custom timestamp at the end

> of a three-part rowkey, so basically we end up with reading all data
> when processing daily batches. Beside performance aspects, have you
> seen that using internals timestamps for scans etc... work reliable?
>
> Or did you come up with another solution to your problem?
>
> Thanks,
> Thomas
>
> -----Original Message-----
> From: Leif Wickland [mailto:leifwickland@gmail.com]
> Sent: Freitag, 09. September 2011 20:33
> To: user@hbase.apache.org
> Subject: Performance characteristics of scans using timestamp as the
> filter
>
> (Apologies if this has been answered before.  I couldn't find anything

> in the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like
> to run a map-reduce periodically, say daily, that takes the new items
as input.
>  A naive approach would use a scan which grabs all of the rows that
> have a timestamp in a specified interval as the input to a MapReduce.
> I tested a scenario like that with 10s of GB of data and it seemed to
perform OK.
>  Should I expected that approach to continue to perform reasonably
> well when I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason

> that the the scan approach would continue to perform well as the data
> grows.  It seems like I may have to keep a log of modified keys and
> use that as the map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

::DISCLAIMER::
------------------------------------------------------------------------
-----------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.

------------------------------------------------------------------------
-----------------------------------------------

________________________________

http://www.mindtree.com/email/disclaimer.html

RE: Performance characteristics of scans using timestamp as the filter

Posted by Stuti Awasthi <st...@hcl.com>.

Yes its true. 
Your cluster time should be in sync for reliable functioning.

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at] 
Sent: Monday, October 10, 2011 3:04 PM
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Isn't a synchronized time along all nodes a general requirement for running the cluster reliably?

Regards,
Thomas

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Montag, 10. Oktober 2011 11:18
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Steinmaurer,

I have done a little POC with Timerange scan and it worked fine for me.
Another thing to note is time should be same on all machines of your cluster of Hbase.

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at]
Sent: Monday, October 10, 2011 2:32 PM
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Hello,

others have stated that one shouldn't try to use timestamps, although I haven't figured out why? If it's reliability, which means, rows are omitted, even if they should be included in a timerange-based scan, then this might be a good argument. ;-)

One thing is that the timestamp AFAIK changes when you update a row even cell values didn't change.

Regards,
Thomas

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Montag, 10. Oktober 2011 10:07
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Hi Saurabh,

AFAIK you can also scan on the basis of Timestamp Range. This can provide you data update in that timestamp range. You do not need to keep timestamp in you row key.

-----Original Message-----
From: saurabh.r.s@gmail.com [mailto:saurabh.r.s@gmail.com] On Behalf Of Sam Seigal
Sent: Monday, October 10, 2011 1:20 PM
To: user@hbase.apache.org
Subject: Re: Performance characteristics of scans using timestamp as the filter

Is it possible to do incremental processing without putting the timestamp in the leading part of the row key in a more efficient manner i.e. process data that came within the last hour/ 2 hour etc ? I can't seem to find a good answer to this question myself.

On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas < Thomas.Steinmaurer@scch.at> wrote:

> Leif,
>
> we are pretty much in the same boat with a custom timestamp at the end

> of a three-part rowkey, so basically we end up with reading all data 
> when processing daily batches. Beside performance aspects, have you 
> seen that using internals timestamps for scans etc... work reliable?
>
> Or did you come up with another solution to your problem?
>
> Thanks,
> Thomas
>
> -----Original Message-----
> From: Leif Wickland [mailto:leifwickland@gmail.com]
> Sent: Freitag, 09. September 2011 20:33
> To: user@hbase.apache.org
> Subject: Performance characteristics of scans using timestamp as the 
> filter
>
> (Apologies if this has been answered before.  I couldn't find anything

> in the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like 
> to run a map-reduce periodically, say daily, that takes the new items
as input.
>  A naive approach would use a scan which grabs all of the rows that 
> have a timestamp in a specified interval as the input to a MapReduce.
> I tested a scenario like that with 10s of GB of data and it seemed to
perform OK.
>  Should I expected that approach to continue to perform reasonably 
> well when I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason

> that the the scan approach would continue to perform well as the data 
> grows.  It seems like I may have to keep a log of modified keys and 
> use that as the map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

::DISCLAIMER::
------------------------------------------------------------------------
-----------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.

------------------------------------------------------------------------
-----------------------------------------------

RE: Performance characteristics of scans using timestamp as the filter

Posted by Steinmaurer Thomas <Th...@scch.at>.

Isn't a synchronized time along all nodes a general requirement for
running the cluster reliably?

Regards,
Thomas

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com] 
Sent: Montag, 10. Oktober 2011 11:18
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the
filter

Steinmaurer,

I have done a little POC with Timerange scan and it worked fine for me.
Another thing to note is time should be same on all machines of your
cluster of Hbase.

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at]
Sent: Monday, October 10, 2011 2:32 PM
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the
filter

Hello,

others have stated that one shouldn't try to use timestamps, although I
haven't figured out why? If it's reliability, which means, rows are
omitted, even if they should be included in a timerange-based scan, then
this might be a good argument. ;-)

One thing is that the timestamp AFAIK changes when you update a row even
cell values didn't change.

Regards,
Thomas

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Montag, 10. Oktober 2011 10:07
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the
filter

Hi Saurabh,

AFAIK you can also scan on the basis of Timestamp Range. This can
provide you data update in that timestamp range. You do not need to keep
timestamp in you row key.

-----Original Message-----
From: saurabh.r.s@gmail.com [mailto:saurabh.r.s@gmail.com] On Behalf Of
Sam Seigal
Sent: Monday, October 10, 2011 1:20 PM
To: user@hbase.apache.org
Subject: Re: Performance characteristics of scans using timestamp as the
filter

Is it possible to do incremental processing without putting the
timestamp in the leading part of the row key in a more efficient manner
i.e. process data that came within the last hour/ 2 hour etc ? I can't
seem to find a good answer to this question myself.

On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas <
Thomas.Steinmaurer@scch.at> wrote:

> Leif,
>
> we are pretty much in the same boat with a custom timestamp at the end

> of a three-part rowkey, so basically we end up with reading all data 
> when processing daily batches. Beside performance aspects, have you 
> seen that using internals timestamps for scans etc... work reliable?
>
> Or did you come up with another solution to your problem?
>
> Thanks,
> Thomas
>
> -----Original Message-----
> From: Leif Wickland [mailto:leifwickland@gmail.com]
> Sent: Freitag, 09. September 2011 20:33
> To: user@hbase.apache.org
> Subject: Performance characteristics of scans using timestamp as the 
> filter
>
> (Apologies if this has been answered before.  I couldn't find anything

> in the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like 
> to run a map-reduce periodically, say daily, that takes the new items
as input.
>  A naive approach would use a scan which grabs all of the rows that 
> have a timestamp in a specified interval as the input to a MapReduce.
> I tested a scenario like that with 10s of GB of data and it seemed to
perform OK.
>  Should I expected that approach to continue to perform reasonably 
> well when I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason

> that the the scan approach would continue to perform well as the data 
> grows.  It seems like I may have to keep a log of modified keys and 
> use that as the map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

::DISCLAIMER::
------------------------------------------------------------------------
-----------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and
intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its
affiliates. Any views or opinions presented in this email are solely
those of the author and may not necessarily reflect the opinions of HCL
or its affiliates.
Any form of reproduction, dissemination, copying, disclosure,
modification, distribution and / or publication of this message without
the prior written consent of the author of this e-mail is strictly
prohibited. If you have received this email in error please delete it
and notify the sender immediately. Before opening any mail and
attachments please check them for viruses and defect.

------------------------------------------------------------------------
-----------------------------------------------

RE: Performance characteristics of scans using timestamp as the filter

Posted by Stuti Awasthi <st...@hcl.com>.

Steinmaurer,

I have done a little POC with Timerange scan and it worked fine for me. Another thing to note is time should be same on all machines of your cluster of Hbase.

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at] 
Sent: Monday, October 10, 2011 2:32 PM
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Hello,

others have stated that one shouldn't try to use timestamps, although I haven't figured out why? If it's reliability, which means, rows are omitted, even if they should be included in a timerange-based scan, then this might be a good argument. ;-)

One thing is that the timestamp AFAIK changes when you update a row even cell values didn't change.

Regards,
Thomas

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Montag, 10. Oktober 2011 10:07
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the filter

Hi Saurabh,

AFAIK you can also scan on the basis of Timestamp Range. This can provide you data update in that timestamp range. You do not need to keep timestamp in you row key.

-----Original Message-----
From: saurabh.r.s@gmail.com [mailto:saurabh.r.s@gmail.com] On Behalf Of Sam Seigal
Sent: Monday, October 10, 2011 1:20 PM
To: user@hbase.apache.org
Subject: Re: Performance characteristics of scans using timestamp as the filter

Is it possible to do incremental processing without putting the timestamp in the leading part of the row key in a more efficient manner i.e. process data that came within the last hour/ 2 hour etc ? I can't seem to find a good answer to this question myself.

On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas < Thomas.Steinmaurer@scch.at> wrote:

> Leif,
>
> we are pretty much in the same boat with a custom timestamp at the end

> of a three-part rowkey, so basically we end up with reading all data 
> when processing daily batches. Beside performance aspects, have you 
> seen that using internals timestamps for scans etc... work reliable?
>
> Or did you come up with another solution to your problem?
>
> Thanks,
> Thomas
>
> -----Original Message-----
> From: Leif Wickland [mailto:leifwickland@gmail.com]
> Sent: Freitag, 09. September 2011 20:33
> To: user@hbase.apache.org
> Subject: Performance characteristics of scans using timestamp as the 
> filter
>
> (Apologies if this has been answered before.  I couldn't find anything

> in the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like 
> to run a map-reduce periodically, say daily, that takes the new items
as input.
>  A naive approach would use a scan which grabs all of the rows that 
> have a timestamp in a specified interval as the input to a MapReduce.
> I tested a scenario like that with 10s of GB of data and it seemed to
perform OK.
>  Should I expected that approach to continue to perform reasonably 
> well when I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason

> that the the scan approach would continue to perform well as the data 
> grows.  It seems like I may have to keep a log of modified keys and 
> use that as the map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

::DISCLAIMER::
------------------------------------------------------------------------
-----------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.

------------------------------------------------------------------------
-----------------------------------------------

RE: Performance characteristics of scans using timestamp as the filter

Posted by Steinmaurer Thomas <Th...@scch.at>.

Hello,

others have stated that one shouldn't try to use timestamps, although I
haven't figured out why? If it's reliability, which means, rows are
omitted, even if they should be included in a timerange-based scan, then
this might be a good argument. ;-)

One thing is that the timestamp AFAIK changes when you update a row even
cell values didn't change.

Regards,
Thomas

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com] 
Sent: Montag, 10. Oktober 2011 10:07
To: user@hbase.apache.org
Subject: RE: Performance characteristics of scans using timestamp as the
filter

Hi Saurabh,

AFAIK you can also scan on the basis of Timestamp Range. This can
provide you data update in that timestamp range. You do not need to keep
timestamp in you row key.

-----Original Message-----
From: saurabh.r.s@gmail.com [mailto:saurabh.r.s@gmail.com] On Behalf Of
Sam Seigal
Sent: Monday, October 10, 2011 1:20 PM
To: user@hbase.apache.org
Subject: Re: Performance characteristics of scans using timestamp as the
filter

Is it possible to do incremental processing without putting the
timestamp in the leading part of the row key in a more efficient manner
i.e. process data that came within the last hour/ 2 hour etc ? I can't
seem to find a good answer to this question myself.

On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas <
Thomas.Steinmaurer@scch.at> wrote:

> Leif,
>
> we are pretty much in the same boat with a custom timestamp at the end

> of a three-part rowkey, so basically we end up with reading all data 
> when processing daily batches. Beside performance aspects, have you 
> seen that using internals timestamps for scans etc... work reliable?
>
> Or did you come up with another solution to your problem?
>
> Thanks,
> Thomas
>
> -----Original Message-----
> From: Leif Wickland [mailto:leifwickland@gmail.com]
> Sent: Freitag, 09. September 2011 20:33
> To: user@hbase.apache.org
> Subject: Performance characteristics of scans using timestamp as the 
> filter
>
> (Apologies if this has been answered before.  I couldn't find anything

> in the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like 
> to run a map-reduce periodically, say daily, that takes the new items
as input.
>  A naive approach would use a scan which grabs all of the rows that 
> have a timestamp in a specified interval as the input to a MapReduce.
> I tested a scenario like that with 10s of GB of data and it seemed to
perform OK.
>  Should I expected that approach to continue to perform reasonably 
> well when I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason

> that the the scan approach would continue to perform well as the data 
> grows.  It seems like I may have to keep a log of modified keys and 
> use that as the map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

::DISCLAIMER::
------------------------------------------------------------------------
-----------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and
intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its
affiliates. Any views or opinions presented in this email are solely
those of the author and may not necessarily reflect the opinions of HCL
or its affiliates.
Any form of reproduction, dissemination, copying, disclosure,
modification, distribution and / or publication of this message without
the prior written consent of the author of this e-mail is strictly
prohibited. If you have received this email in error please delete it
and notify the sender immediately. Before opening any mail and
attachments please check them for viruses and defect.

------------------------------------------------------------------------
-----------------------------------------------

RE: Performance characteristics of scans using timestamp as the filter

Posted by Stuti Awasthi <st...@hcl.com>.

Hi Saurabh,

AFAIK you can also scan on the basis of Timestamp Range. This can provide you data update in that timestamp range. You do not need to keep timestamp in you row key.

-----Original Message-----
From: saurabh.r.s@gmail.com [mailto:saurabh.r.s@gmail.com] On Behalf Of Sam Seigal
Sent: Monday, October 10, 2011 1:20 PM
To: user@hbase.apache.org
Subject: Re: Performance characteristics of scans using timestamp as the filter

Is it possible to do incremental processing without putting the timestamp in the leading part of the row key in a more efficient manner  i.e. process data that came within the last hour/ 2 hour etc ? I can't seem to find a good answer to this question myself.

On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas < Thomas.Steinmaurer@scch.at> wrote:

> Leif,
>
> we are pretty much in the same boat with a custom timestamp at the end
> of a three-part rowkey, so basically we end up with reading all data
> when processing daily batches. Beside performance aspects, have you
> seen that using internals timestamps for scans etc... work reliable?
>
> Or did you come up with another solution to your problem?
>
> Thanks,
> Thomas
>
> -----Original Message-----
> From: Leif Wickland [mailto:leifwickland@gmail.com]
> Sent: Freitag, 09. September 2011 20:33
> To: user@hbase.apache.org
> Subject: Performance characteristics of scans using timestamp as the
> filter
>
> (Apologies if this has been answered before.  I couldn't find anything
> in the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like
> to run a map-reduce periodically, say daily, that takes the new items as input.
>  A naive approach would use a scan which grabs all of the rows that
> have a timestamp in a specified interval as the input to a MapReduce.
> I tested a scenario like that with 10s of GB of data and it seemed to perform OK.
>  Should I expected that approach to continue to perform reasonably
> well when I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason
> that the the scan approach would continue to perform well as the data
> grows.  It seems like I may have to keep a log of modified keys and
> use that as the map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
received this email in error please delete it and notify the sender immediately. Before opening any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------

Re: Performance characteristics of scans using timestamp as the filter

Posted by Sam Seigal <se...@yahoo.com>.

Is it possible to do incremental processing without putting the timestamp in
the leading part of the row key in a more efficient manner  i.e. process
data that came within the last hour/ 2 hour etc ? I can't seem to find a
good answer to this question myself.

On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas <
Thomas.Steinmaurer@scch.at> wrote:

> Leif,
>
> we are pretty much in the same boat with a custom timestamp at the end of a
> three-part rowkey, so basically we end up with reading all data when
> processing daily batches. Beside performance aspects, have you seen that
> using internals timestamps for scans etc... work reliable?
>
> Or did you come up with another solution to your problem?
>
> Thanks,
> Thomas
>
> -----Original Message-----
> From: Leif Wickland [mailto:leifwickland@gmail.com]
> Sent: Freitag, 09. September 2011 20:33
> To: user@hbase.apache.org
> Subject: Performance characteristics of scans using timestamp as the filter
>
> (Apologies if this has been answered before.  I couldn't find anything in
> the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like to
> run a map-reduce periodically, say daily, that takes the new items as input.
>  A naive approach would use a scan which grabs all of the rows that have a
> timestamp in a specified interval as the input to a MapReduce.  I tested a
> scenario like that with 10s of GB of data and it seemed to perform OK.
>  Should I expected that approach to continue to perform reasonably well
> when I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason that
> the the scan approach would continue to perform well as the data grows.  It
> seems like I may have to keep a log of modified keys and use that as the
> map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

RE: Performance characteristics of scans using timestamp as the filter

Posted by Steinmaurer Thomas <Th...@scch.at>.

Leif,

we are pretty much in the same boat with a custom timestamp at the end of a three-part rowkey, so basically we end up with reading all data when processing daily batches. Beside performance aspects, have you seen that using internals timestamps for scans etc... work reliable?

Or did you come up with another solution to your problem?

Thanks,
Thomas

-----Original Message-----
From: Leif Wickland [mailto:leifwickland@gmail.com] 
Sent: Freitag, 09. September 2011 20:33
To: user@hbase.apache.org
Subject: Performance characteristics of scans using timestamp as the filter

(Apologies if this has been answered before.  I couldn't find anything in the archives quite along these lines.)

I have a process which writes to HBase as new data arrives.  I'd like to run a map-reduce periodically, say daily, that takes the new items as input.  A naive approach would use a scan which grabs all of the rows that have a timestamp in a specified interval as the input to a MapReduce.  I tested a scenario like that with 10s of GB of data and it seemed to perform OK.
 Should I expected that approach to continue to perform reasonably well when I have TBs of data?

From what I understand of the HBase architecture, I don't see a reason that the the scan approach would continue to perform well as the data grows.  It seems like I may have to keep a log of modified keys and use that as the map-reduce input, instead.

Thanks,

Leif Wickland