You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by David Swift <da...@charter.net> on 2010/03/29 20:28:18 UTC

Delete Range Of Rows In HBase or How To Age Out Old Data

Hi,

We're evaluating HBase and we have a case where we would want to drop on the
order of about 3 billion of the oldest records out of about 500 billion at
once.  We would take  measures to ensure that there would be no new inserts
into that old age range during the deletion. We would know the low and the
high row IDs in this scenario.

Looking through the HBaseAdmin, Table, and Delete classes, we don't see a
way to efficiently drop a large range of rows.  Table.delete(Delete[]) would
require too large of an array for our needs.

Should we instead just use a different table for each set of 3 billion
records and then drop a table when its data becomes too old to be useful? 
Is there a better strategy for aging out old data in HBase?  

Thanks,
David Swift
-- 
View this message in context: http://old.nabble.com/Delete-Range-Of-Rows-In-HBase-or-How-To-Age-Out-Old-Data-tp28073228p28073228.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Delete Range Of Rows In HBase or How To Age Out Old Data

Posted by David Swift <da...@charter.net>.

Andrew,

As long as it cleans up the entire row when all the columns are garbage
collected, that'll be great!  I'll experiment with that approach right away.

Thanks!


Andrew Purtell-2 wrote:
> 
> Hi David,
> 
> What about setting time to lives on column families? You can add or change
> the 'TTL' attribute on a column family in the shell, or specify a time to
> live when creating a table. See javadoc for HColumnDescriptor. A time to
> live is a Long value (unit is microseconds) associated with the column
> family. When a value's timestamp + ttl > current, then the value will no
> longer be returned in results for gets and scans, and will be garbage
> collected upon the next major compaction. 
> 
> In a past project I used TTLs to age out content retrieved as part of web
> crawling after 30 days, and also to age out various metadata over shorter
> time frames depending on the type of information. In fact I contributed
> the TTL feature to enable this use case. 
> 
> Hope that helps,
> 
>    - Andy
> 
>> From: David Swift
>> Subject: Delete Range Of Rows In HBase or How To Age Out Old Data
>> 
>> We're evaluating HBase and we have a case where we would
>> want to drop on the order of about 3 billion of the oldest records
>> out of about 500 billion at once.  We would take  measures to ensure
>> that there would be no new inserts into that old age range during
>> the deletion. We would know the low and the high row IDs in this
>> scenario.
> 
> 
> 
>     
> 
> 

-- 
View this message in context: http://old.nabble.com/Delete-Range-Of-Rows-In-HBase-or-How-To-Age-Out-Old-Data-tp28073228p28073422.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Delete Range Of Rows In HBase or How To Age Out Old Data

Posted by Andrew Purtell <ap...@apache.org>.

Please see inline.

> From: David Swift
> 
> Andrew,
> 
> The TimeToLive works exactly as you described.  It's
> perfect for our needs.
> 
> However, I aged out several hundred thousand rows, waited
> about 10 minutes, and then ran a compact from the HBase
> shell.  During the whole period, I ran a periodic du command
> on the Hadoop data directory.  After a few minutes
> after the 2nd or 3rd compact request, my disk usage
> actually went up by 40 blocks and remained that way for an
> hour.  Perhaps this is reasonable and by design [...]

Did you 'compact' or 'major_compact'? 

Compaction rewrites some files in a store into fewer, quickly. This is to improve read side performance by limiting the number of flush files which may exist in a store at any given time. (When processing reads, each store file currently must be consulted.)

Major compaction rewrites _all_ files for a store into one, garbage collecting expired cells or old versions > max_versions. 

The basic rule of thumb is major compaction does garbage collection. The period in which major compaction is performed is configurable. The default is every 24 hours. 

Compaction can trigger a split. When this happens, the new daughters are established with references back to the parent. Then in the background data is copied over from the parent. How long this can take depends on system load. This means the space required to store a region can temporarily double. 

Also, keep in mind that HDFS lazily cleans up after deleted block replicas.

> I'm curious if there's a page somewhere describing the
> relationship between minor and major compactions and their
> impact on actual local file system disk usage by Hadoop.

To my knowledge, no. 

   - Andy

Re: Delete Range Of Rows In HBase or How To Age Out Old Data

Posted by David Swift <da...@charter.net>.

Andrew,

The TimeToLive works exactly as you described.  It's perfect for our needs.

However, I aged out several hundred thousand rows, waited about 10 minutes,
and then ran a compact from the HBase shell.  During the whole period, I ran
a periodic du command on the Hadoop data directory.  After a few minutes
after the 2nd or 3rd compact request, my disk usage actually went up by 40
blocks and remained that way for an hour.  Perhaps this is reasonable and by
design, but I'm curious if there's a page somewhere describing the
relationship between minor and major compactions and their impact on actual
local file system disk usage by Hadoop.

Thanks,
David


Andrew Purtell-2 wrote:
> 
> Hi David,
> 
> What about setting time to lives on column families? You can add or change
> the 'TTL' attribute on a column family in the shell, or specify a time to
> live when creating a table. See javadoc for HColumnDescriptor. A time to
> live is a Long value (unit is microseconds) associated with the column
> family. When a value's timestamp + ttl > current, then the value will no
> longer be returned in results for gets and scans, and will be garbage
> collected upon the next major compaction. 
> 
> In a past project I used TTLs to age out content retrieved as part of web
> crawling after 30 days, and also to age out various metadata over shorter
> time frames depending on the type of information. In fact I contributed
> the TTL feature to enable this use case. 
> 
> Hope that helps,
> 
>    - Andy
> 
>> From: David Swift
>> Subject: Delete Range Of Rows In HBase or How To Age Out Old Data
>> 
>> We're evaluating HBase and we have a case where we would
>> want to drop on the order of about 3 billion of the oldest records
>> out of about 500 billion at once.  We would take  measures to ensure
>> that there would be no new inserts into that old age range during
>> the deletion. We would know the low and the high row IDs in this
>> scenario.
> 
> 
> 
>     
> 
> 

-- 
View this message in context: http://old.nabble.com/Delete-Range-Of-Rows-In-HBase-or-How-To-Age-Out-Old-Data-tp28073228p28075513.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Delete Range Of Rows In HBase or How To Age Out Old Data

Posted by Andrew Purtell <ap...@apache.org>.

Hi David,

What about setting time to lives on column families? You can add or change the 'TTL' attribute on a column family in the shell, or specify a time to live when creating a table. See javadoc for HColumnDescriptor. A time to live is a Long value (unit is microseconds) associated with the column family. When a value's timestamp + ttl > current, then the value will no longer be returned in results for gets and scans, and will be garbage collected upon the next major compaction. 

In a past project I used TTLs to age out content retrieved as part of web crawling after 30 days, and also to age out various metadata over shorter time frames depending on the type of information. In fact I contributed the TTL feature to enable this use case. 

Hope that helps,

   - Andy

> From: David Swift
> Subject: Delete Range Of Rows In HBase or How To Age Out Old Data
> 
> We're evaluating HBase and we have a case where we would
> want to drop on the order of about 3 billion of the oldest records
> out of about 500 billion at once.  We would take  measures to ensure
> that there would be no new inserts into that old age range during
> the deletion. We would know the low and the high row IDs in this
> scenario.