You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Chao Shi <st...@live.com> on 2013/07/15 12:36:28 UTC

Delete all data before a given timestamp

Hi HBase users,

We have created a index table (say T2) of another table (say t1). The
clients who write to T1 also write a index record to T2 with the same
timestamp. There may be accumulated inconsistency as time goes by. So we
run a MR job periodically, which fully scans T1, builds a index, and
bulk-loads the result to T2.

Because the MR job may be running for a while, during the period of which,
all new data into T2 must be kept and not be overridden. So the MR creates
puts using the timestamp the job starts.

Then we want all data in T2 before a given timestamp to invisible for read
after the index builds successfully and get deleted eventually (e.g. during
major compaction). We prefer setting it explicitly than using the TTL
feature for safety, as we want only old data are deleted only when the new
data is written. Does HBase support this kind of operation for now?

Thanks,
Chao

Re: Delete all data before a given timestamp

Posted by Chao Shi <st...@live.com>.
Thanks Ted. I think it is exactly what I need :)


On Tue, Jul 16, 2013 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:

> Would this method (of Delete) serve your need ?
>
>   public Delete deleteFamily(byte [] family, long timestamp) {
> From its Javadoc:
>
>    * Delete all columns of the specified family with a timestamp less than
>
>    * or equal to the specified timestamp.
>
> On Mon, Jul 15, 2013 at 8:07 PM, Chao Shi <st...@live.com> wrote:
>
> > Jean-Marc Spaggiari <je...@...> writes:
> >
> > >
> > > When you send a delete command to the server, you can specify a
> > timestamp.
> > > So as the result of your MR job,"just" emit this delete with the
> specific
> > > timestamp to remove any previous version?
> > >
> > > JM
> > >
> > > 2013/7/15 Chao Shi <st...@...>
> > >
> > > > Hi HBase users,
> > > >
> > > > We have created a index table (say T2) of another table (say t1). The
> > > > clients who write to T1 also write a index record to T2 with the same
> > > > timestamp. There may be accumulated inconsistency as time goes by. So
> > we
> > > > run a MR job periodically, which fully scans T1, builds a index, and
> > > > bulk-loads the result to T2.
> > > >
> > > > Because the MR job may be running for a while, during the period of
> > which,
> > > > all new data into T2 must be kept and not be overridden. So the MR
> > creates
> > > > puts using the timestamp the job starts.
> > > >
> > > > Then we want all data in T2 before a given timestamp to invisible for
> > read
> > > > after the index builds successfully and get deleted eventually (e.g.
> > during
> > > > major compaction). We prefer setting it explicitly than using the TTL
> > > > feature for safety, as we want only old data are deleted only when
> the
> > new
> > > > data is written. Does HBase support this kind of operation for now?
> > > >
> > > > Thanks,
> > > > Chao
> > > >
> > >
> >
> > Hi Jean-Marc,
> >
> > Thanks for the reply.
> >
> > I see delete can specify a timestamp, but I don't think that is what I
> > need.
> > To clarify, in my scenario, I don't want to issue deletes for every key
> > (because I don't know what exactly to delete unless do another full
> scan).
> >
> > I'd like to see if this is possible: set a min_timestamp to
> > ColumnDescriptor. Once done, KVs before this timestamp become invisible
> to
> > read. During major compaction, these KVs are deleted. It is the absolute
> > version of TTL.
> >
> >
> >
> >
> >
>

Re: Delete all data before a given timestamp

Posted by Chao Shi <st...@live.com>.
Yes, it also makes sense. I'll prefer Ted's approach as it is extremely
easier.


On Tue, Jul 16, 2013 at 8:59 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Another option might be to setup the proper TTL on the table? You alter the
> table to set the TTL to reflect your timestamp, the you run a compaction?
> The issue is that you have to disable the table while you alter it.
>
> JM
>
> 2013/7/16 Ted Yu <yu...@gmail.com>
>
> > Would this method (of Delete) serve your need ?
> >
> >   public Delete deleteFamily(byte [] family, long timestamp) {
> > From its Javadoc:
> >
> >    * Delete all columns of the specified family with a timestamp less
> than
> >
> >    * or equal to the specified timestamp.
> >
> > On Mon, Jul 15, 2013 at 8:07 PM, Chao Shi <st...@live.com> wrote:
> >
> > > Jean-Marc Spaggiari <je...@...> writes:
> > >
> > > >
> > > > When you send a delete command to the server, you can specify a
> > > timestamp.
> > > > So as the result of your MR job,"just" emit this delete with the
> > specific
> > > > timestamp to remove any previous version?
> > > >
> > > > JM
> > > >
> > > > 2013/7/15 Chao Shi <st...@...>
> > > >
> > > > > Hi HBase users,
> > > > >
> > > > > We have created a index table (say T2) of another table (say t1).
> The
> > > > > clients who write to T1 also write a index record to T2 with the
> same
> > > > > timestamp. There may be accumulated inconsistency as time goes by.
> So
> > > we
> > > > > run a MR job periodically, which fully scans T1, builds a index,
> and
> > > > > bulk-loads the result to T2.
> > > > >
> > > > > Because the MR job may be running for a while, during the period of
> > > which,
> > > > > all new data into T2 must be kept and not be overridden. So the MR
> > > creates
> > > > > puts using the timestamp the job starts.
> > > > >
> > > > > Then we want all data in T2 before a given timestamp to invisible
> for
> > > read
> > > > > after the index builds successfully and get deleted eventually
> (e.g.
> > > during
> > > > > major compaction). We prefer setting it explicitly than using the
> TTL
> > > > > feature for safety, as we want only old data are deleted only when
> > the
> > > new
> > > > > data is written. Does HBase support this kind of operation for now?
> > > > >
> > > > > Thanks,
> > > > > Chao
> > > > >
> > > >
> > >
> > > Hi Jean-Marc,
> > >
> > > Thanks for the reply.
> > >
> > > I see delete can specify a timestamp, but I don't think that is what I
> > > need.
> > > To clarify, in my scenario, I don't want to issue deletes for every key
> > > (because I don't know what exactly to delete unless do another full
> > scan).
> > >
> > > I'd like to see if this is possible: set a min_timestamp to
> > > ColumnDescriptor. Once done, KVs before this timestamp become invisible
> > to
> > > read. During major compaction, these KVs are deleted. It is the
> absolute
> > > version of TTL.
> > >
> > >
> > >
> > >
> > >
> >
>

Re: Delete all data before a given timestamp

Posted by Chao Shi <st...@live.com>.
Yes, this is what we did now. We maintained a lower bound of timestamp for
scan. Once an index build is done, we increase it to a higher value.


On Wed, Jul 17, 2013 at 2:50 AM, Jimmy Xiang <jx...@cloudera.com> wrote:

> When you set up the MR, does it help to set a proper timestamp filter or
> time range in the scan object?
>
>
> On Tue, Jul 16, 2013 at 5:59 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Another option might be to setup the proper TTL on the table? You alter
> the
> > table to set the TTL to reflect your timestamp, the you run a compaction?
> > The issue is that you have to disable the table while you alter it.
> >
> > JM
> >
> > 2013/7/16 Ted Yu <yu...@gmail.com>
> >
> > > Would this method (of Delete) serve your need ?
> > >
> > >   public Delete deleteFamily(byte [] family, long timestamp) {
> > > From its Javadoc:
> > >
> > >    * Delete all columns of the specified family with a timestamp less
> > than
> > >
> > >    * or equal to the specified timestamp.
> > >
> > > On Mon, Jul 15, 2013 at 8:07 PM, Chao Shi <st...@live.com> wrote:
> > >
> > > > Jean-Marc Spaggiari <je...@...> writes:
> > > >
> > > > >
> > > > > When you send a delete command to the server, you can specify a
> > > > timestamp.
> > > > > So as the result of your MR job,"just" emit this delete with the
> > > specific
> > > > > timestamp to remove any previous version?
> > > > >
> > > > > JM
> > > > >
> > > > > 2013/7/15 Chao Shi <st...@...>
> > > > >
> > > > > > Hi HBase users,
> > > > > >
> > > > > > We have created a index table (say T2) of another table (say t1).
> > The
> > > > > > clients who write to T1 also write a index record to T2 with the
> > same
> > > > > > timestamp. There may be accumulated inconsistency as time goes
> by.
> > So
> > > > we
> > > > > > run a MR job periodically, which fully scans T1, builds a index,
> > and
> > > > > > bulk-loads the result to T2.
> > > > > >
> > > > > > Because the MR job may be running for a while, during the period
> of
> > > > which,
> > > > > > all new data into T2 must be kept and not be overridden. So the
> MR
> > > > creates
> > > > > > puts using the timestamp the job starts.
> > > > > >
> > > > > > Then we want all data in T2 before a given timestamp to invisible
> > for
> > > > read
> > > > > > after the index builds successfully and get deleted eventually
> > (e.g.
> > > > during
> > > > > > major compaction). We prefer setting it explicitly than using the
> > TTL
> > > > > > feature for safety, as we want only old data are deleted only
> when
> > > the
> > > > new
> > > > > > data is written. Does HBase support this kind of operation for
> now?
> > > > > >
> > > > > > Thanks,
> > > > > > Chao
> > > > > >
> > > > >
> > > >
> > > > Hi Jean-Marc,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > I see delete can specify a timestamp, but I don't think that is what
> I
> > > > need.
> > > > To clarify, in my scenario, I don't want to issue deletes for every
> key
> > > > (because I don't know what exactly to delete unless do another full
> > > scan).
> > > >
> > > > I'd like to see if this is possible: set a min_timestamp to
> > > > ColumnDescriptor. Once done, KVs before this timestamp become
> invisible
> > > to
> > > > read. During major compaction, these KVs are deleted. It is the
> > absolute
> > > > version of TTL.
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Delete all data before a given timestamp

Posted by Jimmy Xiang <jx...@cloudera.com>.
When you set up the MR, does it help to set a proper timestamp filter or
time range in the scan object?


On Tue, Jul 16, 2013 at 5:59 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Another option might be to setup the proper TTL on the table? You alter the
> table to set the TTL to reflect your timestamp, the you run a compaction?
> The issue is that you have to disable the table while you alter it.
>
> JM
>
> 2013/7/16 Ted Yu <yu...@gmail.com>
>
> > Would this method (of Delete) serve your need ?
> >
> >   public Delete deleteFamily(byte [] family, long timestamp) {
> > From its Javadoc:
> >
> >    * Delete all columns of the specified family with a timestamp less
> than
> >
> >    * or equal to the specified timestamp.
> >
> > On Mon, Jul 15, 2013 at 8:07 PM, Chao Shi <st...@live.com> wrote:
> >
> > > Jean-Marc Spaggiari <je...@...> writes:
> > >
> > > >
> > > > When you send a delete command to the server, you can specify a
> > > timestamp.
> > > > So as the result of your MR job,"just" emit this delete with the
> > specific
> > > > timestamp to remove any previous version?
> > > >
> > > > JM
> > > >
> > > > 2013/7/15 Chao Shi <st...@...>
> > > >
> > > > > Hi HBase users,
> > > > >
> > > > > We have created a index table (say T2) of another table (say t1).
> The
> > > > > clients who write to T1 also write a index record to T2 with the
> same
> > > > > timestamp. There may be accumulated inconsistency as time goes by.
> So
> > > we
> > > > > run a MR job periodically, which fully scans T1, builds a index,
> and
> > > > > bulk-loads the result to T2.
> > > > >
> > > > > Because the MR job may be running for a while, during the period of
> > > which,
> > > > > all new data into T2 must be kept and not be overridden. So the MR
> > > creates
> > > > > puts using the timestamp the job starts.
> > > > >
> > > > > Then we want all data in T2 before a given timestamp to invisible
> for
> > > read
> > > > > after the index builds successfully and get deleted eventually
> (e.g.
> > > during
> > > > > major compaction). We prefer setting it explicitly than using the
> TTL
> > > > > feature for safety, as we want only old data are deleted only when
> > the
> > > new
> > > > > data is written. Does HBase support this kind of operation for now?
> > > > >
> > > > > Thanks,
> > > > > Chao
> > > > >
> > > >
> > >
> > > Hi Jean-Marc,
> > >
> > > Thanks for the reply.
> > >
> > > I see delete can specify a timestamp, but I don't think that is what I
> > > need.
> > > To clarify, in my scenario, I don't want to issue deletes for every key
> > > (because I don't know what exactly to delete unless do another full
> > scan).
> > >
> > > I'd like to see if this is possible: set a min_timestamp to
> > > ColumnDescriptor. Once done, KVs before this timestamp become invisible
> > to
> > > read. During major compaction, these KVs are deleted. It is the
> absolute
> > > version of TTL.
> > >
> > >
> > >
> > >
> > >
> >
>

Re: Delete all data before a given timestamp

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Another option might be to setup the proper TTL on the table? You alter the
table to set the TTL to reflect your timestamp, the you run a compaction?
The issue is that you have to disable the table while you alter it.

JM

2013/7/16 Ted Yu <yu...@gmail.com>

> Would this method (of Delete) serve your need ?
>
>   public Delete deleteFamily(byte [] family, long timestamp) {
> From its Javadoc:
>
>    * Delete all columns of the specified family with a timestamp less than
>
>    * or equal to the specified timestamp.
>
> On Mon, Jul 15, 2013 at 8:07 PM, Chao Shi <st...@live.com> wrote:
>
> > Jean-Marc Spaggiari <je...@...> writes:
> >
> > >
> > > When you send a delete command to the server, you can specify a
> > timestamp.
> > > So as the result of your MR job,"just" emit this delete with the
> specific
> > > timestamp to remove any previous version?
> > >
> > > JM
> > >
> > > 2013/7/15 Chao Shi <st...@...>
> > >
> > > > Hi HBase users,
> > > >
> > > > We have created a index table (say T2) of another table (say t1). The
> > > > clients who write to T1 also write a index record to T2 with the same
> > > > timestamp. There may be accumulated inconsistency as time goes by. So
> > we
> > > > run a MR job periodically, which fully scans T1, builds a index, and
> > > > bulk-loads the result to T2.
> > > >
> > > > Because the MR job may be running for a while, during the period of
> > which,
> > > > all new data into T2 must be kept and not be overridden. So the MR
> > creates
> > > > puts using the timestamp the job starts.
> > > >
> > > > Then we want all data in T2 before a given timestamp to invisible for
> > read
> > > > after the index builds successfully and get deleted eventually (e.g.
> > during
> > > > major compaction). We prefer setting it explicitly than using the TTL
> > > > feature for safety, as we want only old data are deleted only when
> the
> > new
> > > > data is written. Does HBase support this kind of operation for now?
> > > >
> > > > Thanks,
> > > > Chao
> > > >
> > >
> >
> > Hi Jean-Marc,
> >
> > Thanks for the reply.
> >
> > I see delete can specify a timestamp, but I don't think that is what I
> > need.
> > To clarify, in my scenario, I don't want to issue deletes for every key
> > (because I don't know what exactly to delete unless do another full
> scan).
> >
> > I'd like to see if this is possible: set a min_timestamp to
> > ColumnDescriptor. Once done, KVs before this timestamp become invisible
> to
> > read. During major compaction, these KVs are deleted. It is the absolute
> > version of TTL.
> >
> >
> >
> >
> >
>

Re: Delete all data before a given timestamp

Posted by Ted Yu <yu...@gmail.com>.
Would this method (of Delete) serve your need ?

  public Delete deleteFamily(byte [] family, long timestamp) {
>From its Javadoc:

   * Delete all columns of the specified family with a timestamp less than

   * or equal to the specified timestamp.

On Mon, Jul 15, 2013 at 8:07 PM, Chao Shi <st...@live.com> wrote:

> Jean-Marc Spaggiari <je...@...> writes:
>
> >
> > When you send a delete command to the server, you can specify a
> timestamp.
> > So as the result of your MR job,"just" emit this delete with the specific
> > timestamp to remove any previous version?
> >
> > JM
> >
> > 2013/7/15 Chao Shi <st...@...>
> >
> > > Hi HBase users,
> > >
> > > We have created a index table (say T2) of another table (say t1). The
> > > clients who write to T1 also write a index record to T2 with the same
> > > timestamp. There may be accumulated inconsistency as time goes by. So
> we
> > > run a MR job periodically, which fully scans T1, builds a index, and
> > > bulk-loads the result to T2.
> > >
> > > Because the MR job may be running for a while, during the period of
> which,
> > > all new data into T2 must be kept and not be overridden. So the MR
> creates
> > > puts using the timestamp the job starts.
> > >
> > > Then we want all data in T2 before a given timestamp to invisible for
> read
> > > after the index builds successfully and get deleted eventually (e.g.
> during
> > > major compaction). We prefer setting it explicitly than using the TTL
> > > feature for safety, as we want only old data are deleted only when the
> new
> > > data is written. Does HBase support this kind of operation for now?
> > >
> > > Thanks,
> > > Chao
> > >
> >
>
> Hi Jean-Marc,
>
> Thanks for the reply.
>
> I see delete can specify a timestamp, but I don't think that is what I
> need.
> To clarify, in my scenario, I don't want to issue deletes for every key
> (because I don't know what exactly to delete unless do another full scan).
>
> I'd like to see if this is possible: set a min_timestamp to
> ColumnDescriptor. Once done, KVs before this timestamp become invisible to
> read. During major compaction, these KVs are deleted. It is the absolute
> version of TTL.
>
>
>
>
>

Re: Delete all data before a given timestamp

Posted by lars hofhansl <la...@apache.org>.
You might be interested in HBASE-8784 (https://issues.apache.org/jira/browse/HBASE-8784).



----- Original Message -----
From: Chao Shi <st...@live.com>
To: user@hbase.apache.org
Cc: 
Sent: Monday, July 15, 2013 8:07 PM
Subject: Re: Delete all data before a given timestamp

Jean-Marc Spaggiari <je...@...> writes:

> 
> When you send a delete command to the server, you can specify a timestamp.
> So as the result of your MR job,"just" emit this delete with the specific
> timestamp to remove any previous version?
> 
> JM
> 
> 2013/7/15 Chao Shi <st...@...>
> 
> > Hi HBase users,
> >
> > We have created a index table (say T2) of another table (say t1). The
> > clients who write to T1 also write a index record to T2 with the same
> > timestamp. There may be accumulated inconsistency as time goes by. So we
> > run a MR job periodically, which fully scans T1, builds a index, and
> > bulk-loads the result to T2.
> >
> > Because the MR job may be running for a while, during the period of 
which,
> > all new data into T2 must be kept and not be overridden. So the MR 
creates
> > puts using the timestamp the job starts.
> >
> > Then we want all data in T2 before a given timestamp to invisible for 
read
> > after the index builds successfully and get deleted eventually (e.g. 
during
> > major compaction). We prefer setting it explicitly than using the TTL
> > feature for safety, as we want only old data are deleted only when the 
new
> > data is written. Does HBase support this kind of operation for now?
> >
> > Thanks,
> > Chao
> >
> 

Hi Jean-Marc,

Thanks for the reply.

I see delete can specify a timestamp, but I don't think that is what I need. 
To clarify, in my scenario, I don't want to issue deletes for every key 
(because I don't know what exactly to delete unless do another full scan).

I'd like to see if this is possible: set a min_timestamp to 
ColumnDescriptor. Once done, KVs before this timestamp become invisible to 
read. During major compaction, these KVs are deleted. It is the absolute 
version of TTL.

Re: Delete all data before a given timestamp

Posted by Chao Shi <st...@live.com>.
Jean-Marc Spaggiari <je...@...> writes:

> 
> When you send a delete command to the server, you can specify a timestamp.
> So as the result of your MR job,"just" emit this delete with the specific
> timestamp to remove any previous version?
> 
> JM
> 
> 2013/7/15 Chao Shi <st...@...>
> 
> > Hi HBase users,
> >
> > We have created a index table (say T2) of another table (say t1). The
> > clients who write to T1 also write a index record to T2 with the same
> > timestamp. There may be accumulated inconsistency as time goes by. So we
> > run a MR job periodically, which fully scans T1, builds a index, and
> > bulk-loads the result to T2.
> >
> > Because the MR job may be running for a while, during the period of 
which,
> > all new data into T2 must be kept and not be overridden. So the MR 
creates
> > puts using the timestamp the job starts.
> >
> > Then we want all data in T2 before a given timestamp to invisible for 
read
> > after the index builds successfully and get deleted eventually (e.g. 
during
> > major compaction). We prefer setting it explicitly than using the TTL
> > feature for safety, as we want only old data are deleted only when the 
new
> > data is written. Does HBase support this kind of operation for now?
> >
> > Thanks,
> > Chao
> >
> 

Hi Jean-Marc,

Thanks for the reply.

I see delete can specify a timestamp, but I don't think that is what I need. 
To clarify, in my scenario, I don't want to issue deletes for every key 
(because I don't know what exactly to delete unless do another full scan).

I'd like to see if this is possible: set a min_timestamp to 
ColumnDescriptor. Once done, KVs before this timestamp become invisible to 
read. During major compaction, these KVs are deleted. It is the absolute 
version of TTL.





Re: Delete all data before a given timestamp

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
When you send a delete command to the server, you can specify a timestamp.
So as the result of your MR job,"just" emit this delete with the specific
timestamp to remove any previous version?

JM

2013/7/15 Chao Shi <st...@live.com>

> Hi HBase users,
>
> We have created a index table (say T2) of another table (say t1). The
> clients who write to T1 also write a index record to T2 with the same
> timestamp. There may be accumulated inconsistency as time goes by. So we
> run a MR job periodically, which fully scans T1, builds a index, and
> bulk-loads the result to T2.
>
> Because the MR job may be running for a while, during the period of which,
> all new data into T2 must be kept and not be overridden. So the MR creates
> puts using the timestamp the job starts.
>
> Then we want all data in T2 before a given timestamp to invisible for read
> after the index builds successfully and get deleted eventually (e.g. during
> major compaction). We prefer setting it explicitly than using the TTL
> feature for safety, as we want only old data are deleted only when the new
> data is written. Does HBase support this kind of operation for now?
>
> Thanks,
> Chao
>