You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by James Estes <ja...@gmail.com> on 2014/09/22 19:39:45 UTC

Configuring tombstone purge independent of deleted cell purge

Could tombstone purges be independent of purging deleted cells and
KEEP_DELETED_CELLS setting? In my use case, I do not want to keep deleted
cells, but I do need to keep the tombstones around. Without the tombstones,
I'm not able to do incremental backups (custom, we do timerange raw scans
ourselves for this).

As a rough example, if I have the following timeline for the same row key
(where t# is time):
t0 - full backup (using a time range up to t0)
t1 - PUT v1
t2 - incremental backup #1 (time range t0 up to t2)
t3 - DELETE
t4 - flush and major compaction happens
t5 - incremental backup #2 (time range t2 up to t5)
t6 - full system crash
t7 - data restored from full backup + incrementals #1 and #2

When the restore completes, the row will have been un-deleted. This is
because the incremental backup in #2 will not have the tombstone, since it
gets compacted out.

So in our case, I do NOT want to keep deleted cells (because I do not want
the cells to show up in time range scans users may do), but I DO want to
keep the tombstones for a configurable amount of time (much larger than our
planned incremental backup schedule) so they are captured during backup.
This would allow for the custom incremental backups to be independent of
major compactions. Without it, the backup schedule would have to manually
handle compactions and would always have to do a FULL Backup after a major
compaction (otherwise there can be loss because when any major compaction
happens, any tombstone that came in after the last incremental will be
lost).

It seems like there could be another setting for when to purge tombstones.
Currently, there is hbase.hstore.time.to.purge.deletes for when to purge
deleted cells, but ONLY if KEEP_DELETED_CELLS is configured (which makes
sense). I'd like to propose a hbase.hstore.time.to.purge.tombstones that
could default to the same value as hbase.hstore.time.to.purge.deletes, but
would take effect regardless of the KEEP_DELETED_CELLS setting. It should
have a constraint so that hbase.hstore.time.to.purge.deletes <
hbase.hstore.time.to.purge.tombstones (b/c we don't want tombstones
disappearing before the deleted cells).

Does this seem reasonable? Is there another approach I might take?

Thanks,

Re: Configuring tombstone purge independent of deleted cell purge

Posted by James Estes <ja...@gmail.com>.
Hah. Indeed it does. Thanks for the help.

James

On Sep 23, 2014, at 10:54 AM, Dan Di Spaltro <da...@gmail.com> wrote:

> Simple question, did you copy and paste that snippet since it has two name
> stanzas.
> 
> On Tue, Sep 23, 2014 at 9:42 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
> 
>> Hi James,
>> 
>> Is it possible that you are impacted by
>> https://issues.apache.org/jira/browse/HBASE-10118 ? Any change to test
>> with
>> one release where HBASE-10118 is available?
>> 
>> JM
>> 
>> 2014-09-23 12:10 GMT-04:00 James Estes <ja...@gmail.com>:
>> 
>>> It does sound like what I'd want (that's why I was trying to use it :) ),
>>> but it isn't working as described. Maybe it is a bug?
>>> 
>>> The behavior I'm seeing is that the delete markers are removed on major
>>> compaction, regardless of having a hbase.hstore.time.to.purge.deletes set
>>> in hbase-site.xml:
>>> https://gist.github.com/housejester/2b8fbba0d05c6abbe784
>>> 
>>> I think I've found the issue now. You mentioned the setting could be
>>> applied per CF...so I tested that way, and it works as expected. My
>>> hbase-site.xml had:
>>> 
>>> <property>
>>>  <name>hbase.hstore.time.to.purge.deletes</name>
>>>  <name>600000</name>
>>> </property>
>>> 
>>> But that doesn't seem to be applied (even with restarts, etc). Setting
>>> hbase.hstore.time.to.purge.deletes directly on the column family does
>> work
>>> though:
>>> https://gist.github.com/housejester/a81274bf74a8666fba32
>>> 
>>> Not sure why it isn't picking up from my hbase-site.xml, but I'll just
>>> configure it on the CFs. This is on hbase-0.98.6.1-hadoop2 and
>>> hbase-0.96.0-hadoop2 running in local mode.
>>> 
>>> Thanks Lars,
>>> James
>>> 
>>> On Mon, Sep 22, 2014 at 11:04 PM, lars hofhansl <la...@apache.org>
>> wrote:
>>> 
>>>> You can use the hbase.hstore.time.to.purge.deletes config option.
>>>> You can set it globally or per Column Family.
>>>> 
>>>> This is the description in hbase-default.xml:
>>>>  <property>
>>>>    <name>hbase.hstore.time.to.purge.deletes</name>
>>>>    <value>0</value>
>>>>    <description>The amount of time to delay purging of delete markers
>>>> with future timestamps. If
>>>>      unset, or set to 0, all delete markers, including those with
>> future
>>>> timestamps, are purged
>>>>      during the next major compaction. Otherwise, a delete marker is
>>> kept
>>>> until the major compaction
>>>>      which occurs after the marker's timestamp plus the value of this
>>>> setting, in milliseconds.
>>>>    </description>
>>>>  </property>
>>>> 
>>>> That seems to be exactly what you want.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>> From: James Estes <ja...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Cc:
>>>> Sent: Monday, September 22, 2014 10:39 AM
>>>> Subject: Configuring tombstone purge independent of deleted cell purge
>>>> 
>>>> Could tombstone purges be independent of purging deleted cells and
>>>> KEEP_DELETED_CELLS setting? In my use case, I do not want to keep
>> deleted
>>>> cells, but I do need to keep the tombstones around. Without the
>>> tombstones,
>>>> I'm not able to do incremental backups (custom, we do timerange raw
>> scans
>>>> ourselves for this).
>>>> 
>>>> As a rough example, if I have the following timeline for the same row
>> key
>>>> (where t# is time):
>>>> t0 - full backup (using a time range up to t0)
>>>> t1 - PUT v1
>>>> t2 - incremental backup #1 (time range t0 up to t2)
>>>> t3 - DELETE
>>>> t4 - flush and major compaction happens
>>>> t5 - incremental backup #2 (time range t2 up to t5)
>>>> t6 - full system crash
>>>> t7 - data restored from full backup + incrementals #1 and #2
>>>> 
>>>> When the restore completes, the row will have been un-deleted. This is
>>>> because the incremental backup in #2 will not have the tombstone, since
>>> it
>>>> gets compacted out.
>>>> 
>>>> So in our case, I do NOT want to keep deleted cells (because I do not
>>> want
>>>> the cells to show up in time range scans users may do), but I DO want
>> to
>>>> keep the tombstones for a configurable amount of time (much larger than
>>> our
>>>> planned incremental backup schedule) so they are captured during
>> backup.
>>>> This would allow for the custom incremental backups to be independent
>> of
>>>> major compactions. Without it, the backup schedule would have to
>> manually
>>>> handle compactions and would always have to do a FULL Backup after a
>>> major
>>>> compaction (otherwise there can be loss because when any major
>> compaction
>>>> happens, any tombstone that came in after the last incremental will be
>>>> lost).
>>>> 
>>>> It seems like there could be another setting for when to purge
>>> tombstones.
>>>> Currently, there is hbase.hstore.time.to.purge.deletes for when to
>> purge
>>>> deleted cells, but ONLY if KEEP_DELETED_CELLS is configured (which
>> makes
>>>> sense). I'd like to propose a hbase.hstore.time.to.purge.tombstones
>> that
>>>> could default to the same value as hbase.hstore.time.to.purge.deletes,
>>> but
>>>> would take effect regardless of the KEEP_DELETED_CELLS setting. It
>> should
>>>> have a constraint so that hbase.hstore.time.to.purge.deletes <
>>>> hbase.hstore.time.to.purge.tombstones (b/c we don't want tombstones
>>>> disappearing before the deleted cells).
>>>> 
>>>> Does this seem reasonable? Is there another approach I might take?
>>>> 
>>>> Thanks,
>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Dan Di Spaltro


Re: Configuring tombstone purge independent of deleted cell purge

Posted by Dan Di Spaltro <da...@gmail.com>.
Simple question, did you copy and paste that snippet since it has two name
stanzas.

On Tue, Sep 23, 2014 at 9:42 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi James,
>
> Is it possible that you are impacted by
> https://issues.apache.org/jira/browse/HBASE-10118 ? Any change to test
> with
> one release where HBASE-10118 is available?
>
> JM
>
> 2014-09-23 12:10 GMT-04:00 James Estes <ja...@gmail.com>:
>
> > It does sound like what I'd want (that's why I was trying to use it :) ),
> > but it isn't working as described. Maybe it is a bug?
> >
> > The behavior I'm seeing is that the delete markers are removed on major
> > compaction, regardless of having a hbase.hstore.time.to.purge.deletes set
> > in hbase-site.xml:
> > https://gist.github.com/housejester/2b8fbba0d05c6abbe784
> >
> > I think I've found the issue now. You mentioned the setting could be
> > applied per CF...so I tested that way, and it works as expected. My
> > hbase-site.xml had:
> >
> > <property>
> >   <name>hbase.hstore.time.to.purge.deletes</name>
> >   <name>600000</name>
> > </property>
> >
> > But that doesn't seem to be applied (even with restarts, etc). Setting
> > hbase.hstore.time.to.purge.deletes directly on the column family does
> work
> > though:
> > https://gist.github.com/housejester/a81274bf74a8666fba32
> >
> > Not sure why it isn't picking up from my hbase-site.xml, but I'll just
> > configure it on the CFs. This is on hbase-0.98.6.1-hadoop2 and
> > hbase-0.96.0-hadoop2 running in local mode.
> >
> > Thanks Lars,
> > James
> >
> > On Mon, Sep 22, 2014 at 11:04 PM, lars hofhansl <la...@apache.org>
> wrote:
> >
> > > You can use the hbase.hstore.time.to.purge.deletes config option.
> > > You can set it globally or per Column Family.
> > >
> > > This is the description in hbase-default.xml:
> > >   <property>
> > >     <name>hbase.hstore.time.to.purge.deletes</name>
> > >     <value>0</value>
> > >     <description>The amount of time to delay purging of delete markers
> > > with future timestamps. If
> > >       unset, or set to 0, all delete markers, including those with
> future
> > > timestamps, are purged
> > >       during the next major compaction. Otherwise, a delete marker is
> > kept
> > > until the major compaction
> > >       which occurs after the marker's timestamp plus the value of this
> > > setting, in milliseconds.
> > >     </description>
> > >   </property>
> > >
> > > That seems to be exactly what you want.
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: James Estes <ja...@gmail.com>
> > > To: user@hbase.apache.org
> > > Cc:
> > > Sent: Monday, September 22, 2014 10:39 AM
> > > Subject: Configuring tombstone purge independent of deleted cell purge
> > >
> > > Could tombstone purges be independent of purging deleted cells and
> > > KEEP_DELETED_CELLS setting? In my use case, I do not want to keep
> deleted
> > > cells, but I do need to keep the tombstones around. Without the
> > tombstones,
> > > I'm not able to do incremental backups (custom, we do timerange raw
> scans
> > > ourselves for this).
> > >
> > > As a rough example, if I have the following timeline for the same row
> key
> > > (where t# is time):
> > > t0 - full backup (using a time range up to t0)
> > > t1 - PUT v1
> > > t2 - incremental backup #1 (time range t0 up to t2)
> > > t3 - DELETE
> > > t4 - flush and major compaction happens
> > > t5 - incremental backup #2 (time range t2 up to t5)
> > > t6 - full system crash
> > > t7 - data restored from full backup + incrementals #1 and #2
> > >
> > > When the restore completes, the row will have been un-deleted. This is
> > > because the incremental backup in #2 will not have the tombstone, since
> > it
> > > gets compacted out.
> > >
> > > So in our case, I do NOT want to keep deleted cells (because I do not
> > want
> > > the cells to show up in time range scans users may do), but I DO want
> to
> > > keep the tombstones for a configurable amount of time (much larger than
> > our
> > > planned incremental backup schedule) so they are captured during
> backup.
> > > This would allow for the custom incremental backups to be independent
> of
> > > major compactions. Without it, the backup schedule would have to
> manually
> > > handle compactions and would always have to do a FULL Backup after a
> > major
> > > compaction (otherwise there can be loss because when any major
> compaction
> > > happens, any tombstone that came in after the last incremental will be
> > > lost).
> > >
> > > It seems like there could be another setting for when to purge
> > tombstones.
> > > Currently, there is hbase.hstore.time.to.purge.deletes for when to
> purge
> > > deleted cells, but ONLY if KEEP_DELETED_CELLS is configured (which
> makes
> > > sense). I'd like to propose a hbase.hstore.time.to.purge.tombstones
> that
> > > could default to the same value as hbase.hstore.time.to.purge.deletes,
> > but
> > > would take effect regardless of the KEEP_DELETED_CELLS setting. It
> should
> > > have a constraint so that hbase.hstore.time.to.purge.deletes <
> > > hbase.hstore.time.to.purge.tombstones (b/c we don't want tombstones
> > > disappearing before the deleted cells).
> > >
> > > Does this seem reasonable? Is there another approach I might take?
> > >
> > > Thanks,
> > >
> > >
> >
>



-- 
Dan Di Spaltro

Re: Configuring tombstone purge independent of deleted cell purge

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi James,

Is it possible that you are impacted by
https://issues.apache.org/jira/browse/HBASE-10118 ? Any change to test with
one release where HBASE-10118 is available?

JM

2014-09-23 12:10 GMT-04:00 James Estes <ja...@gmail.com>:

> It does sound like what I'd want (that's why I was trying to use it :) ),
> but it isn't working as described. Maybe it is a bug?
>
> The behavior I'm seeing is that the delete markers are removed on major
> compaction, regardless of having a hbase.hstore.time.to.purge.deletes set
> in hbase-site.xml:
> https://gist.github.com/housejester/2b8fbba0d05c6abbe784
>
> I think I've found the issue now. You mentioned the setting could be
> applied per CF...so I tested that way, and it works as expected. My
> hbase-site.xml had:
>
> <property>
>   <name>hbase.hstore.time.to.purge.deletes</name>
>   <name>600000</name>
> </property>
>
> But that doesn't seem to be applied (even with restarts, etc). Setting
> hbase.hstore.time.to.purge.deletes directly on the column family does work
> though:
> https://gist.github.com/housejester/a81274bf74a8666fba32
>
> Not sure why it isn't picking up from my hbase-site.xml, but I'll just
> configure it on the CFs. This is on hbase-0.98.6.1-hadoop2 and
> hbase-0.96.0-hadoop2 running in local mode.
>
> Thanks Lars,
> James
>
> On Mon, Sep 22, 2014 at 11:04 PM, lars hofhansl <la...@apache.org> wrote:
>
> > You can use the hbase.hstore.time.to.purge.deletes config option.
> > You can set it globally or per Column Family.
> >
> > This is the description in hbase-default.xml:
> >   <property>
> >     <name>hbase.hstore.time.to.purge.deletes</name>
> >     <value>0</value>
> >     <description>The amount of time to delay purging of delete markers
> > with future timestamps. If
> >       unset, or set to 0, all delete markers, including those with future
> > timestamps, are purged
> >       during the next major compaction. Otherwise, a delete marker is
> kept
> > until the major compaction
> >       which occurs after the marker's timestamp plus the value of this
> > setting, in milliseconds.
> >     </description>
> >   </property>
> >
> > That seems to be exactly what you want.
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: James Estes <ja...@gmail.com>
> > To: user@hbase.apache.org
> > Cc:
> > Sent: Monday, September 22, 2014 10:39 AM
> > Subject: Configuring tombstone purge independent of deleted cell purge
> >
> > Could tombstone purges be independent of purging deleted cells and
> > KEEP_DELETED_CELLS setting? In my use case, I do not want to keep deleted
> > cells, but I do need to keep the tombstones around. Without the
> tombstones,
> > I'm not able to do incremental backups (custom, we do timerange raw scans
> > ourselves for this).
> >
> > As a rough example, if I have the following timeline for the same row key
> > (where t# is time):
> > t0 - full backup (using a time range up to t0)
> > t1 - PUT v1
> > t2 - incremental backup #1 (time range t0 up to t2)
> > t3 - DELETE
> > t4 - flush and major compaction happens
> > t5 - incremental backup #2 (time range t2 up to t5)
> > t6 - full system crash
> > t7 - data restored from full backup + incrementals #1 and #2
> >
> > When the restore completes, the row will have been un-deleted. This is
> > because the incremental backup in #2 will not have the tombstone, since
> it
> > gets compacted out.
> >
> > So in our case, I do NOT want to keep deleted cells (because I do not
> want
> > the cells to show up in time range scans users may do), but I DO want to
> > keep the tombstones for a configurable amount of time (much larger than
> our
> > planned incremental backup schedule) so they are captured during backup.
> > This would allow for the custom incremental backups to be independent of
> > major compactions. Without it, the backup schedule would have to manually
> > handle compactions and would always have to do a FULL Backup after a
> major
> > compaction (otherwise there can be loss because when any major compaction
> > happens, any tombstone that came in after the last incremental will be
> > lost).
> >
> > It seems like there could be another setting for when to purge
> tombstones.
> > Currently, there is hbase.hstore.time.to.purge.deletes for when to purge
> > deleted cells, but ONLY if KEEP_DELETED_CELLS is configured (which makes
> > sense). I'd like to propose a hbase.hstore.time.to.purge.tombstones that
> > could default to the same value as hbase.hstore.time.to.purge.deletes,
> but
> > would take effect regardless of the KEEP_DELETED_CELLS setting. It should
> > have a constraint so that hbase.hstore.time.to.purge.deletes <
> > hbase.hstore.time.to.purge.tombstones (b/c we don't want tombstones
> > disappearing before the deleted cells).
> >
> > Does this seem reasonable? Is there another approach I might take?
> >
> > Thanks,
> >
> >
>

Re: Configuring tombstone purge independent of deleted cell purge

Posted by lars hofhansl <la...@apache.org>.
And my other problem is that I do not read all emails before I reply.
Looks like you resolved it now... All good :)



________________________________
 From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Tuesday, September 23, 2014 10:10 PM
Subject: Re: Configuring tombstone purge independent of deleted cell purge
 

Didn't read the email to the end... I seem to be doing that all the time. Sorry about that.

Hmm... That should work identically, it's read by the same mechanism. If the global setting does not work, that would be a bug.

-- Lars



________________________________



From: James Estes <ja...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Tuesday, September 23, 2014 9:10 AM
Subject: Re: Configuring tombstone purge independent of deleted cell purge


It does sound like what I'd want (that's why I was trying to use it :) ),
but it isn't working as described. Maybe it is a bug?

The behavior I'm seeing is that the delete markers are removed on major
compaction, regardless of having a hbase.hstore.time.to.purge.deletes set
in hbase-site.xml:
https://gist.github.com/housejester/2b8fbba0d05c6abbe784

I think I've found the issue now. You mentioned the setting could be
applied per CF...so I tested that way, and it works as expected. My
hbase-site.xml had:

<property>
  <name>hbase.hstore.time.to.purge.deletes</name>
  <name>600000</name>
</property>

But that doesn't seem to be applied (even with restarts, etc). Setting
hbase.hstore.time.to.purge.deletes directly on the column family does work
though:
https://gist.github.com/housejester/a81274bf74a8666fba32

Not sure why it isn't picking up from my hbase-site.xml, but I'll just
configure it on the CFs. This is on hbase-0.98.6.1-hadoop2 and
hbase-0.96.0-hadoop2 running in local mode.

Thanks Lars,
James




On Mon, Sep 22, 2014 at 11:04 PM, lars hofhansl <la...@apache.org> wrote:

> You can use the hbase.hstore.time.to.purge.deletes config option.
> You can set it globally or per Column Family.
>
> This is the description in hbase-default.xml:
>   <property>
>     <name>hbase.hstore.time.to.purge.deletes</name>
>     <value>0</value>
>     <description>The amount of time to delay purging of delete markers
> with future timestamps. If
>       unset, or set to 0, all delete markers, including those with future
> timestamps, are purged
>       during the next major compaction. Otherwise, a delete marker is kept
> until the major compaction
>       which occurs after the marker's timestamp plus the value of this
> setting, in milliseconds.
>     </description>
>   </property>
>
> That seems to be exactly what you want.
>
> -- Lars
>
>
> ----- Original Message -----
> From: James Estes <ja...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Monday, September 22, 2014 10:39 AM
> Subject: Configuring tombstone purge independent of deleted cell purge
>
> Could tombstone purges be independent of purging deleted cells and
> KEEP_DELETED_CELLS setting? In my use case, I do not want to keep deleted
> cells, but I do need to keep the tombstones around. Without the tombstones,
> I'm not able to do incremental backups (custom, we do timerange raw scans
> ourselves for this).
>
> As a rough example, if I have the following timeline for the same row key
> (where t# is time):
> t0 - full backup (using a time range up to t0)
> t1 - PUT v1
> t2 - incremental backup #1 (time range t0 up to t2)
> t3 - DELETE
> t4 - flush and major compaction happens
> t5 - incremental backup #2 (time range t2 up to t5)
> t6 - full system crash
> t7 - data restored from full backup + incrementals #1 and #2
>
> When the restore completes, the row will have been un-deleted. This is
> because the incremental backup in #2 will not have the tombstone, since it
> gets compacted out.
>
> So in our case, I do NOT want to keep deleted cells (because I do not want
> the cells to show up in time range scans users may do), but I DO want to
> keep the tombstones for a configurable amount of time (much larger than our
> planned incremental backup schedule) so they are captured during backup.
> This would allow for the custom incremental backups to be independent of
> major compactions. Without it, the backup schedule would have to manually
> handle compactions and would always have to do a FULL Backup after a major
> compaction (otherwise there can be loss because when any major compaction
> happens, any tombstone that came in after the last incremental will be
> lost).
>
> It seems like there could be another setting for when to purge tombstones.
> Currently, there is hbase.hstore.time.to.purge.deletes for when to purge
> deleted cells, but ONLY if KEEP_DELETED_CELLS is configured (which makes
> sense). I'd like to propose a hbase.hstore.time.to.purge.tombstones that
> could default to the same value as hbase.hstore.time.to.purge.deletes, but
> would take effect regardless of the KEEP_DELETED_CELLS setting. It should
> have a constraint so that hbase.hstore.time.to.purge.deletes <
> hbase.hstore.time.to.purge.tombstones (b/c we don't want tombstones
> disappearing before the deleted cells).
>
> Does this seem reasonable? Is there another approach I might take?
>
> Thanks,
>
>

Re: Configuring tombstone purge independent of deleted cell purge

Posted by lars hofhansl <la...@apache.org>.
Didn't read the email to the end... I seem to be doing that all the time. Sorry about that.

Hmm... That should work identically, it's read by the same mechanism. If the global setting does not work, that would be a bug.

-- Lars



________________________________
 From: James Estes <ja...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Tuesday, September 23, 2014 9:10 AM
Subject: Re: Configuring tombstone purge independent of deleted cell purge
 

It does sound like what I'd want (that's why I was trying to use it :) ),
but it isn't working as described. Maybe it is a bug?

The behavior I'm seeing is that the delete markers are removed on major
compaction, regardless of having a hbase.hstore.time.to.purge.deletes set
in hbase-site.xml:
https://gist.github.com/housejester/2b8fbba0d05c6abbe784

I think I've found the issue now. You mentioned the setting could be
applied per CF...so I tested that way, and it works as expected. My
hbase-site.xml had:

<property>
  <name>hbase.hstore.time.to.purge.deletes</name>
  <name>600000</name>
</property>

But that doesn't seem to be applied (even with restarts, etc). Setting
hbase.hstore.time.to.purge.deletes directly on the column family does work
though:
https://gist.github.com/housejester/a81274bf74a8666fba32

Not sure why it isn't picking up from my hbase-site.xml, but I'll just
configure it on the CFs. This is on hbase-0.98.6.1-hadoop2 and
hbase-0.96.0-hadoop2 running in local mode.

Thanks Lars,
James




On Mon, Sep 22, 2014 at 11:04 PM, lars hofhansl <la...@apache.org> wrote:

> You can use the hbase.hstore.time.to.purge.deletes config option.
> You can set it globally or per Column Family.
>
> This is the description in hbase-default.xml:
>   <property>
>     <name>hbase.hstore.time.to.purge.deletes</name>
>     <value>0</value>
>     <description>The amount of time to delay purging of delete markers
> with future timestamps. If
>       unset, or set to 0, all delete markers, including those with future
> timestamps, are purged
>       during the next major compaction. Otherwise, a delete marker is kept
> until the major compaction
>       which occurs after the marker's timestamp plus the value of this
> setting, in milliseconds.
>     </description>
>   </property>
>
> That seems to be exactly what you want.
>
> -- Lars
>
>
> ----- Original Message -----
> From: James Estes <ja...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Monday, September 22, 2014 10:39 AM
> Subject: Configuring tombstone purge independent of deleted cell purge
>
> Could tombstone purges be independent of purging deleted cells and
> KEEP_DELETED_CELLS setting? In my use case, I do not want to keep deleted
> cells, but I do need to keep the tombstones around. Without the tombstones,
> I'm not able to do incremental backups (custom, we do timerange raw scans
> ourselves for this).
>
> As a rough example, if I have the following timeline for the same row key
> (where t# is time):
> t0 - full backup (using a time range up to t0)
> t1 - PUT v1
> t2 - incremental backup #1 (time range t0 up to t2)
> t3 - DELETE
> t4 - flush and major compaction happens
> t5 - incremental backup #2 (time range t2 up to t5)
> t6 - full system crash
> t7 - data restored from full backup + incrementals #1 and #2
>
> When the restore completes, the row will have been un-deleted. This is
> because the incremental backup in #2 will not have the tombstone, since it
> gets compacted out.
>
> So in our case, I do NOT want to keep deleted cells (because I do not want
> the cells to show up in time range scans users may do), but I DO want to
> keep the tombstones for a configurable amount of time (much larger than our
> planned incremental backup schedule) so they are captured during backup.
> This would allow for the custom incremental backups to be independent of
> major compactions. Without it, the backup schedule would have to manually
> handle compactions and would always have to do a FULL Backup after a major
> compaction (otherwise there can be loss because when any major compaction
> happens, any tombstone that came in after the last incremental will be
> lost).
>
> It seems like there could be another setting for when to purge tombstones.
> Currently, there is hbase.hstore.time.to.purge.deletes for when to purge
> deleted cells, but ONLY if KEEP_DELETED_CELLS is configured (which makes
> sense). I'd like to propose a hbase.hstore.time.to.purge.tombstones that
> could default to the same value as hbase.hstore.time.to.purge.deletes, but
> would take effect regardless of the KEEP_DELETED_CELLS setting. It should
> have a constraint so that hbase.hstore.time.to.purge.deletes <
> hbase.hstore.time.to.purge.tombstones (b/c we don't want tombstones
> disappearing before the deleted cells).
>
> Does this seem reasonable? Is there another approach I might take?
>
> Thanks,
>
>

Re: Configuring tombstone purge independent of deleted cell purge

Posted by James Estes <ja...@gmail.com>.
It does sound like what I'd want (that's why I was trying to use it :) ),
but it isn't working as described. Maybe it is a bug?

The behavior I'm seeing is that the delete markers are removed on major
compaction, regardless of having a hbase.hstore.time.to.purge.deletes set
in hbase-site.xml:
https://gist.github.com/housejester/2b8fbba0d05c6abbe784

I think I've found the issue now. You mentioned the setting could be
applied per CF...so I tested that way, and it works as expected. My
hbase-site.xml had:

<property>
  <name>hbase.hstore.time.to.purge.deletes</name>
  <name>600000</name>
</property>

But that doesn't seem to be applied (even with restarts, etc). Setting
hbase.hstore.time.to.purge.deletes directly on the column family does work
though:
https://gist.github.com/housejester/a81274bf74a8666fba32

Not sure why it isn't picking up from my hbase-site.xml, but I'll just
configure it on the CFs. This is on hbase-0.98.6.1-hadoop2 and
hbase-0.96.0-hadoop2 running in local mode.

Thanks Lars,
James

On Mon, Sep 22, 2014 at 11:04 PM, lars hofhansl <la...@apache.org> wrote:

> You can use the hbase.hstore.time.to.purge.deletes config option.
> You can set it globally or per Column Family.
>
> This is the description in hbase-default.xml:
>   <property>
>     <name>hbase.hstore.time.to.purge.deletes</name>
>     <value>0</value>
>     <description>The amount of time to delay purging of delete markers
> with future timestamps. If
>       unset, or set to 0, all delete markers, including those with future
> timestamps, are purged
>       during the next major compaction. Otherwise, a delete marker is kept
> until the major compaction
>       which occurs after the marker's timestamp plus the value of this
> setting, in milliseconds.
>     </description>
>   </property>
>
> That seems to be exactly what you want.
>
> -- Lars
>
>
> ----- Original Message -----
> From: James Estes <ja...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Monday, September 22, 2014 10:39 AM
> Subject: Configuring tombstone purge independent of deleted cell purge
>
> Could tombstone purges be independent of purging deleted cells and
> KEEP_DELETED_CELLS setting? In my use case, I do not want to keep deleted
> cells, but I do need to keep the tombstones around. Without the tombstones,
> I'm not able to do incremental backups (custom, we do timerange raw scans
> ourselves for this).
>
> As a rough example, if I have the following timeline for the same row key
> (where t# is time):
> t0 - full backup (using a time range up to t0)
> t1 - PUT v1
> t2 - incremental backup #1 (time range t0 up to t2)
> t3 - DELETE
> t4 - flush and major compaction happens
> t5 - incremental backup #2 (time range t2 up to t5)
> t6 - full system crash
> t7 - data restored from full backup + incrementals #1 and #2
>
> When the restore completes, the row will have been un-deleted. This is
> because the incremental backup in #2 will not have the tombstone, since it
> gets compacted out.
>
> So in our case, I do NOT want to keep deleted cells (because I do not want
> the cells to show up in time range scans users may do), but I DO want to
> keep the tombstones for a configurable amount of time (much larger than our
> planned incremental backup schedule) so they are captured during backup.
> This would allow for the custom incremental backups to be independent of
> major compactions. Without it, the backup schedule would have to manually
> handle compactions and would always have to do a FULL Backup after a major
> compaction (otherwise there can be loss because when any major compaction
> happens, any tombstone that came in after the last incremental will be
> lost).
>
> It seems like there could be another setting for when to purge tombstones.
> Currently, there is hbase.hstore.time.to.purge.deletes for when to purge
> deleted cells, but ONLY if KEEP_DELETED_CELLS is configured (which makes
> sense). I'd like to propose a hbase.hstore.time.to.purge.tombstones that
> could default to the same value as hbase.hstore.time.to.purge.deletes, but
> would take effect regardless of the KEEP_DELETED_CELLS setting. It should
> have a constraint so that hbase.hstore.time.to.purge.deletes <
> hbase.hstore.time.to.purge.tombstones (b/c we don't want tombstones
> disappearing before the deleted cells).
>
> Does this seem reasonable? Is there another approach I might take?
>
> Thanks,
>
>

Re: Configuring tombstone purge independent of deleted cell purge

Posted by lars hofhansl <la...@apache.org>.
You can use the hbase.hstore.time.to.purge.deletes config option.
You can set it globally or per Column Family.

This is the description in hbase-default.xml:
  <property>
    <name>hbase.hstore.time.to.purge.deletes</name>
    <value>0</value>
    <description>The amount of time to delay purging of delete markers with future timestamps. If 
      unset, or set to 0, all delete markers, including those with future timestamps, are purged 
      during the next major compaction. Otherwise, a delete marker is kept until the major compaction 
      which occurs after the marker's timestamp plus the value of this setting, in milliseconds.
    </description>
  </property>

That seems to be exactly what you want.

-- Lars


----- Original Message -----
From: James Estes <ja...@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Monday, September 22, 2014 10:39 AM
Subject: Configuring tombstone purge independent of deleted cell purge

Could tombstone purges be independent of purging deleted cells and
KEEP_DELETED_CELLS setting? In my use case, I do not want to keep deleted
cells, but I do need to keep the tombstones around. Without the tombstones,
I'm not able to do incremental backups (custom, we do timerange raw scans
ourselves for this).

As a rough example, if I have the following timeline for the same row key
(where t# is time):
t0 - full backup (using a time range up to t0)
t1 - PUT v1
t2 - incremental backup #1 (time range t0 up to t2)
t3 - DELETE
t4 - flush and major compaction happens
t5 - incremental backup #2 (time range t2 up to t5)
t6 - full system crash
t7 - data restored from full backup + incrementals #1 and #2

When the restore completes, the row will have been un-deleted. This is
because the incremental backup in #2 will not have the tombstone, since it
gets compacted out.

So in our case, I do NOT want to keep deleted cells (because I do not want
the cells to show up in time range scans users may do), but I DO want to
keep the tombstones for a configurable amount of time (much larger than our
planned incremental backup schedule) so they are captured during backup.
This would allow for the custom incremental backups to be independent of
major compactions. Without it, the backup schedule would have to manually
handle compactions and would always have to do a FULL Backup after a major
compaction (otherwise there can be loss because when any major compaction
happens, any tombstone that came in after the last incremental will be
lost).

It seems like there could be another setting for when to purge tombstones.
Currently, there is hbase.hstore.time.to.purge.deletes for when to purge
deleted cells, but ONLY if KEEP_DELETED_CELLS is configured (which makes
sense). I'd like to propose a hbase.hstore.time.to.purge.tombstones that
could default to the same value as hbase.hstore.time.to.purge.deletes, but
would take effect regardless of the KEEP_DELETED_CELLS setting. It should
have a constraint so that hbase.hstore.time.to.purge.deletes <
hbase.hstore.time.to.purge.tombstones (b/c we don't want tombstones
disappearing before the deleted cells).

Does this seem reasonable? Is there another approach I might take?

Thanks,