You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by lars hofhansl <lh...@yahoo.com> on 2012/01/16 04:21:40 UTC

Delete client API.

There are some confusing parts about the Delete client API:
1. calling deleteFamily removes all prior column or columns markers without checking the TS.
2. delete{Column|Columns|Family} do not use the timestamp passed to Delete at construction time, but instead default to LATEST_TIMESTAMP.

  Delete d = new Delete(R,T);
  d.deleteFamily(CF);

Does not do what you expect (won't use T for the family delete, but rather the current time).

Neither does
  d.deleteColumns(CF, C1, T2);
  d.deleteFamily(CF, T1); // T1 < T2


(the columns marker will be removed)


#1 prevents Delete from adding a family marker F for time T1 and a column/columns marker for columns of F at T2 even if T2 > T1.
#2 is just unexpected and different from what Put is doing.

In HBASE-5205 I propose a simple patch to fix this.

Since this is a (slight) API change, please provide feed back.

Thanks.

-- Lars

Re: Delete client API.

Posted by Mikael Sitruk <mi...@gmail.com>.

@Lars - Let's say i do a future delete, is there a way to "rollback" this
delete without major compaction?
@Karthik - memstore-ts will just be the time at which kv arrived into
memstore and not the real ordering of the operation.
Also perhaps the batch should not rely on KV timestamp for reordering the
batch but on total ordering of the batch.

I also think that the behavior should be more deterministic.

Mikael.S

On Wed, Jan 18, 2012 at 11:51 AM, M. C. Srivas <mc...@gmail.com> wrote:

> On Tue, Jan 17, 2012 at 8:56 PM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > The memstoreTS is used for visibility during an intra-row transaction.
> > Are you proposing to do this only if the deletes/puts did not use the
> > current time?
> >
> > The ability to define timestamps for all operations is crucial to HBase.
> > o It ensures that HTable.batch works correctly (which reorders Deletes
> > w.r.t. to Puts at the Region Server).
> > o It ensures that replication works correctly.
> > o many other scenarios
> >
> > If you do not use application defined timestamp the current time is used
> > and everything works as expected.
> > If you use application defined timestamps you are asking for a delete to
> > be either in the future or the past, and you have to understand what that
> > means.
> > Maybe we should document the behavior better.
> >
>
> I guess I am saying that I *do* understand the current "delete with TS"
> behavior, and I find the current implementation  unstable and
> non-deterministic.  Documenting it more thoroughly does not make it less
> quirky or more stable.  I propose fixing it along the lines suggested in
> option B.  Karthik seems to agree.
>
>
>
>
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Karthik Ranganathan <kr...@fb.com>
> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <
> > lhofhansl@yahoo.com>
> > Cc:
> > Sent: Tuesday, January 17, 2012 3:27 PM
> > Subject: Re: Delete client API.
> >
> >
> > @Srivas - totally agree that B is the correct thing to do.
> >
> > One way we have talked about implementing this is using the memstore ts.
> > Every insert of a KV into the memstore is given a memstore-ts. These are
> > persisted only till they are needed (to ensure read atomicity for
> > scanners) and then that value is zeroed out on a subsequent compaction
> > (saves space). If we retained the memstore-ts even beyond these
> > compactions, we could get a deterministic order for the puts and deletes
> > (first insert ts < del ts < second insert ts).
> >
> > Thanks
> > Karthik
> >
> >
> > On 1/17/12 2:14 PM, "M. C. Srivas" <mc...@gmail.com> wrote:
> >
> > >On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <lh...@yahoo.com>
> > >wrote:
> > >
> > >> Yeah, it's confusing if one expects it to work like in a relational
> > >> database.
> > >> You can even do worse. If you by accident place a delete in the future
> > >>all
> > >> current inserts will be hidden until the next major compaction. :)
> > >> I got confused about this myself just recently (see my mail on the
> > >> dev-list).
> > >>
> > >>
> > >> In the end this is a pretty powerful feature and core to how HBase
> works
> > >> (not saying that is not confusing though).
> > >>
> > >>
> > >> If one keeps the following two points in mind it makes more sense:
> > >> 1. Delete just sets a tomb stone marker at a specific TS (marking
> > >> everything older as deleted).
> > >> 2. Everything is versioned, if no version is specified the current
> time
> > >> (at the regionserver) is used.
> > >>
> > >> In your example1 below t3 > 6, hence the insert is hidden.
> > >> In example2 both delete and insert TS are 6, hence the insert is
> hidden.
> > >>
> > >
> > >Lets consider my example2 for a little longer. Sequence of events
> > >
> > >   1.  ins  val1  with TS=6 set by client
> > >   2.  del  entire row at TS=6 set by client
> > >   3.  ins  val2  with TS=6  set by client
> > >   4.  read row
> > >
> > >The row returns nothing even though the insert at step 3 happened after
> > >the
> > >delete at step 2. (step 2 masks even future inserts)
> > >
> > >Now, the same sequence with a compaction thrown in the middle:
> > >
> > >   1.  ins  val1  with TS=6 set by client
> > >   2.  del  entire row at TS=6 set by client
> > >   3.  ---- table is compacted -----
> > >   4.  ins  val2  with TS=6  set by client
> > >   5.  read row
> > >
> > >The row returns val2.  (the delete at step2 got lost due to compaction).
> > >
> > >So we have different results depending upon whether an internal
> > >re-organization (like a compaction) happened or not. If we want both
> > >sequences to behave exactly the same, then we need to first choose what
> is
> > >the proper (and deterministic) behavior.
> > >
> > >A.  if we think that the first sequence is the correct one, then the
> > >delete
> > >at step 2 needs to be preserved forever.
> > >
> > >or,
> > >
> > >B. if we think that the second sequence is the correct behavior (ie, a
> > >read
> > >always produces the same results independent of compaction), then the
> > >record needs a second "internal TS" field to allow the RS to distinguish
> > >the real sequence of events, and not rely upon the TS field which is
> > >settable by the client.
> > >
> > >My opinion:
> > >
> > >We should do B.  It is normal for someone to write code that says  "if
> old
> > >exists, delete it;  add new". A subsequent read should always reliably
> > >return "new".
> > >
> > >The current way of relying on a client-settable TS field to determine
> > >causal order results in quirky behavior, and quirky is not good.
> > >
> > >
> > >
> > >> Look at these two examples:
> > >>
> > >> 1. insert Val1  at real time t1
> > >> 2. <del>  at real time t2 > t1
> > >> 3. insert  Val2 at real time  t3 > t2
> > >>
> > >> 1. insert Val1  with TS=1 at real time t1
> > >> 2. <del>  with TS = 2 at real time t2 > t1
> > >>
> > >> 3. insert  Val2 with TS = 3 at real time  t3 > t2
> > >>
> > >>
> > >> In both cases Val2 is visible.
> > >>
> > >> If the your code sets your own timestamps, you better know what you're
> > >> doing :)
> > >>
> > >> Note that my examples below are confusing even if you know how
> deletion
> > >>in
> > >> HBase works.
> > >> You have to look at Delete.java to figure out what is happening.
> > >> OK, since there were know objections in two days, I will commit my
> > >> proposed change in HBASE-5205.
> > >>
> > >>
> > >> -- Lars
> > >>
> > >> ________________________________
> > >> From: M. C. Srivas <mc...@gmail.com>
> > >> To: dev@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > >> Sent: Tuesday, January 17, 2012 8:13 AM
> > >> Subject: Re: Delete client API.
> > >>
> > >>
> > >> Delete seems to be confusing in general. Here are some examples that
> > >>make
> > >> me scratch my head (key is same in all the examples):
> > >>
> > >> Example1:
> > >> ----------------
> > >> 1. insert Val3  with TS=3  at real time t1
> > >> 2. insert Val5  with TS=5  at real time t2 > t1
> > >> 3. <del>    at real time t3 > t2
> > >> 4. insert  Val6  with TS=6  at real time  t4 > t3
> > >>
> > >> What does a read return?  (I would expect  Val6, since it was done
> > >>last).
> > >> But depending upon whether compaction happened or not between steps 3
> > >>and
> > >> 4, I get either Val6 or  nothing.
> > >>
> > >> Example 2:
> > >> -----------------
> > >> 1. insert Val3  with TS=3  at real time t1
> > >> 2. insert Val5  with TS=5  at real time t2 > t1
> > >> 3. <del>  TS=6  at real time t3 > t2
> > >> 4. insert  Val6  with TS=6  at real time  t4 > t3
> > >>
> > >> Note the difference in step 3 is this time a TS was specified by the
> > >> client.
> > >>
> > >> What does a read return?  Again, I expect Val6 to be returned. But
> > >> depending upon what's going on, I seem to get either Val5 or Val6.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lh...@yahoo.com>
> > >> wrote:
> > >>
> > >> There are some confusing parts about the Delete client API:
> > >> >1. calling deleteFamily removes all prior column or columns markers
> > >> without checking the TS.
> > >> >2. delete{Column|Columns|Family} do not use the timestamp passed to
> > >> Delete at construction time, but instead default to LATEST_TIMESTAMP.
> > >> >
> > >> >  Delete d = new Delete(R,T);
> > >> >  d.deleteFamily(CF);
> > >> >
> > >> >Does not do what you expect (won't use T for the family delete, but
> > >> rather the current time).
> > >> >
> > >> >Neither does
> > >> >  d.deleteColumns(CF, C1, T2);
> > >> >  d.deleteFamily(CF, T1); // T1 < T2
> > >> >
> > >> >
> > >> >(the columns marker will be removed)
> > >> >
> > >> >
> > >> >#1 prevents Delete from adding a family marker F for time T1 and a
> > >> column/columns marker for columns of F at T2 even if T2 > T1.
> > >> >#2 is just unexpected and different from what Put is doing.
> > >> >
> > >> >In HBASE-5205 I propose a simple patch to fix this.
> > >> >
> > >> >Since this is a (slight) API change, please provide feed back.
> > >> >
> > >> >Thanks.
> > >> >
> > >> >-- Lars
> > >> >
> > >> >
> > >>
> >
>



-- 
Mikael.S

Re: Delete client API.

Posted by Mikael Sitruk <mi...@gmail.com>.

Karthil let me explain my scenario, let say that you have client (not
hbase) that is writing records in very high rate into a buffer and use row
TS for indicating the time at which the operation was done (a.k.a. the row
ordering), then you have worker threads that take from the buffer as bulk
and send the records to hbase. Now you may have records with the same key
that will arrive in different order than the one you put into the buffer,
and the rs ordering will be incorrect.
Mikael.s
On Jan 18, 2012 10:05 PM, "Karthik Ranganathan" <kr...@fb.com> wrote:

>
> @Mikael:
>
> << memstore-ts will just be the time at which kv arrived into
> memstore and not the real ordering of the operation.>>
>
> By virtue of the fact that the regionserver is the only entity writing
> updates for a given key (each key belongs to exactly one RS) this will
> also be a global ordering of the operations for that key. Of course we
> need to make the memstore-ts a logical number instead of the time on that
> RS if we want to account for clock skew between RS's when a region fails
> over.
>
> @Ian: you are correct about the reasoning that "if there were no client
> specified timestamps then this issue would never exist". But for the
> better or for the worse, Hbase overloads timestamps and versions. So there
> is no inherent way to achieve versioning other than by using timestamps.
>
> Also, I don¹t understand what overhead it would add to each call? There is
> a memstore-ts maintained anyways...
>
> Thanks
> Karthik
>
>
>
> On 1/18/12 6:38 AM, "Ian Varley" <iv...@salesforce.com> wrote:
>
> >M.C., why would option B be superior to simply letting the native
> >timestamps in HBase do what they were meant to do, and then storing your
> >app-level logical timestamps in the cell itself along with the data? The
> >(admittedly more correct) behavior you want is already the normal
> >behavior when you're not setting application-defined timestamps.
> >
> >In other words: HBase already has a timestamp that behaves as you
> >describe, and only when you intentionally use it for another purpose does
> >the behavior become non-intuitive. And, other things will become
> >non-intuitive too, like replication.
> >
> >In the FB messaging case, if I'm not mistaken, the official timestamp
> >value is in use for something that isn't a timestamp at all (message ids,
> >or something along those lines). So in that case, it would make sense
> >that you'd want to also have another timestamp. I'm tempted to assert
> >that that's an unusual use of the timestamp field, but then again, if the
> >biggest use case of a product does something, it's hardly "unusual". :)
> >
> >At the very least, since it would add overhead to every cell, this should
> >be an opt-in behavior (the ability to say, "I'm setting my own
> >timestamps, so HBase should also keep its own real timestamp"). But then
> >again, what's the argument for doing that rather than storing the
> >timestamps in your cell value? Is it the added abilities the API gives
> >you around time ranges?
> >
> >Ian
> >
> >On Jan 18, 2012, at 1:51 AM, M. C. Srivas wrote:
> >
> >On Tue, Jan 17, 2012 at 8:56 PM, lars hofhansl
> ><lh...@yahoo.com>> wrote:
> >
> >The memstoreTS is used for visibility during an intra-row transaction.
> >Are you proposing to do this only if the deletes/puts did not use the
> >current time?
> >
> >The ability to define timestamps for all operations is crucial to HBase.
> >o It ensures that HTable.batch works correctly (which reorders Deletes
> >w.r.t. to Puts at the Region Server).
> >o It ensures that replication works correctly.
> >o many other scenarios
> >
> >If you do not use application defined timestamp the current time is used
> >and everything works as expected.
> >If you use application defined timestamps you are asking for a delete to
> >be either in the future or the past, and you have to understand what that
> >means.
> >Maybe we should document the behavior better.
> >
> >
> >I guess I am saying that I *do* understand the current "delete with TS"
> >behavior, and I find the current implementation  unstable and
> >non-deterministic.  Documenting it more thoroughly does not make it less
> >quirky or more stable.  I propose fixing it along the lines suggested in
> >option B.  Karthik seems to agree.
> >
> >
> >
> >
> >
> >-- Lars
> >
> >
> >----- Original Message -----
> >From: Karthik Ranganathan
> ><kr...@fb.com>>
> >To: "dev@hbase.apache.org<ma...@hbase.apache.org>"
> ><de...@hbase.apache.org>>; lars hofhansl <
> >lhofhansl@yahoo.com<ma...@yahoo.com>>
> >Cc:
> >Sent: Tuesday, January 17, 2012 3:27 PM
> >Subject: Re: Delete client API.
> >
> >
> >@Srivas - totally agree that B is the correct thing to do.
> >
> >One way we have talked about implementing this is using the memstore ts.
> >Every insert of a KV into the memstore is given a memstore-ts. These are
> >persisted only till they are needed (to ensure read atomicity for
> >scanners) and then that value is zeroed out on a subsequent compaction
> >(saves space). If we retained the memstore-ts even beyond these
> >compactions, we could get a deterministic order for the puts and deletes
> >(first insert ts < del ts < second insert ts).
> >
> >Thanks
> >Karthik
> >
> >
> >On 1/17/12 2:14 PM, "M. C. Srivas"
> ><mc...@gmail.com>> wrote:
> >
> >On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl
> ><lh...@yahoo.com>>
> >wrote:
> >
> >Yeah, it's confusing if one expects it to work like in a relational
> >database.
> >You can even do worse. If you by accident place a delete in the future
> >all
> >current inserts will be hidden until the next major compaction. :)
> >I got confused about this myself just recently (see my mail on the
> >dev-list).
> >
> >
> >In the end this is a pretty powerful feature and core to how HBase works
> >(not saying that is not confusing though).
> >
> >
> >If one keeps the following two points in mind it makes more sense:
> >1. Delete just sets a tomb stone marker at a specific TS (marking
> >everything older as deleted).
> >2. Everything is versioned, if no version is specified the current time
> >(at the regionserver) is used.
> >
> >In your example1 below t3 > 6, hence the insert is hidden.
> >In example2 both delete and insert TS are 6, hence the insert is hidden.
> >
> >
> >Lets consider my example2 for a little longer. Sequence of events
> >
> > 1.  ins  val1  with TS=6 set by client
> > 2.  del  entire row at TS=6 set by client
> > 3.  ins  val2  with TS=6  set by client
> > 4.  read row
> >
> >The row returns nothing even though the insert at step 3 happened after
> >the
> >delete at step 2. (step 2 masks even future inserts)
> >
> >Now, the same sequence with a compaction thrown in the middle:
> >
> > 1.  ins  val1  with TS=6 set by client
> > 2.  del  entire row at TS=6 set by client
> > 3.  ---- table is compacted -----
> > 4.  ins  val2  with TS=6  set by client
> > 5.  read row
> >
> >The row returns val2.  (the delete at step2 got lost due to compaction).
> >
> >So we have different results depending upon whether an internal
> >re-organization (like a compaction) happened or not. If we want both
> >sequences to behave exactly the same, then we need to first choose what is
> >the proper (and deterministic) behavior.
> >
> >A.  if we think that the first sequence is the correct one, then the
> >delete
> >at step 2 needs to be preserved forever.
> >
> >or,
> >
> >B. if we think that the second sequence is the correct behavior (ie, a
> >read
> >always produces the same results independent of compaction), then the
> >record needs a second "internal TS" field to allow the RS to distinguish
> >the real sequence of events, and not rely upon the TS field which is
> >settable by the client.
> >
> >My opinion:
> >
> >We should do B.  It is normal for someone to write code that says  "if old
> >exists, delete it;  add new". A subsequent read should always reliably
> >return "new".
> >
> >The current way of relying on a client-settable TS field to determine
> >causal order results in quirky behavior, and quirky is not good.
> >
> >
> >
> >Look at these two examples:
> >
> >1. insert Val1  at real time t1
> >2. <del>  at real time t2 > t1
> >3. insert  Val2 at real time  t3 > t2
> >
> >1. insert Val1  with TS=1 at real time t1
> >2. <del>  with TS = 2 at real time t2 > t1
> >
> >3. insert  Val2 with TS = 3 at real time  t3 > t2
> >
> >
> >In both cases Val2 is visible.
> >
> >If the your code sets your own timestamps, you better know what you're
> >doing :)
> >
> >Note that my examples below are confusing even if you know how deletion
> >in
> >HBase works.
> >You have to look at Delete.java to figure out what is happening.
> >OK, since there were know objections in two days, I will commit my
> >proposed change in HBASE-5205.
> >
> >
> >-- Lars
> >
> >________________________________
> >From: M. C. Srivas <mc...@gmail.com>>
> >To: dev@hbase.apache.org<ma...@hbase.apache.org>; lars hofhansl
> ><lh...@yahoo.com>>
> >Sent: Tuesday, January 17, 2012 8:13 AM
> >Subject: Re: Delete client API.
> >
> >
> >Delete seems to be confusing in general. Here are some examples that
> >make
> >me scratch my head (key is same in all the examples):
> >
> >Example1:
> >----------------
> >1. insert Val3  with TS=3  at real time t1
> >2. insert Val5  with TS=5  at real time t2 > t1
> >3. <del>    at real time t3 > t2
> >4. insert  Val6  with TS=6  at real time  t4 > t3
> >
> >What does a read return?  (I would expect  Val6, since it was done
> >last).
> >But depending upon whether compaction happened or not between steps 3
> >and
> >4, I get either Val6 or  nothing.
> >
> >Example 2:
> >-----------------
> >1. insert Val3  with TS=3  at real time t1
> >2. insert Val5  with TS=5  at real time t2 > t1
> >3. <del>  TS=6  at real time t3 > t2
> >4. insert  Val6  with TS=6  at real time  t4 > t3
> >
> >Note the difference in step 3 is this time a TS was specified by the
> >client.
> >
> >What does a read return?  Again, I expect Val6 to be returned. But
> >depending upon what's going on, I seem to get either Val5 or Val6.
> >
> >
> >
> >
> >
> >On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl
> ><lh...@yahoo.com>>
> >wrote:
> >
> >There are some confusing parts about the Delete client API:
> >1. calling deleteFamily removes all prior column or columns markers
> >without checking the TS.
> >2. delete{Column|Columns|Family} do not use the timestamp passed to
> >Delete at construction time, but instead default to LATEST_TIMESTAMP.
> >
> >Delete d = new Delete(R,T);
> >d.deleteFamily(CF);
> >
> >Does not do what you expect (won't use T for the family delete, but
> >rather the current time).
> >
> >Neither does
> >d.deleteColumns(CF, C1, T2);
> >d.deleteFamily(CF, T1); // T1 < T2
> >
> >
> >(the columns marker will be removed)
> >
> >
> >#1 prevents Delete from adding a family marker F for time T1 and a
> >column/columns marker for columns of F at T2 even if T2 > T1.
> >#2 is just unexpected and different from what Put is doing.
> >
> >In HBASE-5205 I propose a simple patch to fix this.
> >
> >Since this is a (slight) API change, please provide feed back.
> >
> >Thanks.
> >
> >-- Lars
> >
> >
> >
> >
> >
>
>

Re: Delete client API.

Posted by "M. C. Srivas" <mc...@gmail.com>.

On Wed, Jan 18, 2012 at 6:38 AM, Ian Varley <iv...@salesforce.com> wrote:

> M.C., why would option B be superior to simply letting the native
> timestamps in HBase do what they were meant to do, and then storing your
> app-level logical timestamps in the cell itself along with the data? The
> (admittedly more correct) behavior you want is already the normal behavior
> when you're not setting application-defined timestamps.
>

The application-defined timestamp is exactly that .... a feature provided
by HBase for the application to use freely as it deems fit (as a
time-stamp, or as a version number, or as anything else). I am merely
pointing out that when an app does use it, it behaves in a
non-deterministic way.


>
> In other words: HBase already has a timestamp that behaves as you
> describe, and only when you intentionally use it for another purpose does
> the behavior become non-intuitive. And, other things will become
> non-intuitive too, like replication.
>

Why is that?  Can you explain?



>
> In the FB messaging case, if I'm not mistaken, the official timestamp
> value is in use for something that isn't a timestamp at all (message ids,
> or something along those lines). So in that case, it would make sense that
> you'd want to also have another timestamp. I'm tempted to assert that
> that's an unusual use of the timestamp field, but then again, if the
> biggest use case of a product does something, it's hardly "unusual". :)
>

I cannot speak for FB.


>
> At the very least, since it would add overhead to every cell, this should
> be an opt-in behavior (the ability to say, "I'm setting my own timestamps,
> so HBase should also keep its own real timestamp"). But then again, what's
> the argument for doing that rather than storing the timestamps in your cell
> value? Is it the added abilities the API gives you around time ranges?
>

If one were to use Hbase with multiple clients updating the same row, then
a possible "locking" mechanism is to use the timestamps as versions for
each cell. The app can determine on its own which version to preserve and
delete the rest of the versions. But to do so the behavior of delete w/ TS
must be predictable.



>
> Ian
>
> On Jan 18, 2012, at 1:51 AM, M. C. Srivas wrote:
>
> On Tue, Jan 17, 2012 at 8:56 PM, lars hofhansl <lhofhansl@yahoo.com
> <ma...@yahoo.com>> wrote:
>
> The memstoreTS is used for visibility during an intra-row transaction.
> Are you proposing to do this only if the deletes/puts did not use the
> current time?
>
> The ability to define timestamps for all operations is crucial to HBase.
> o It ensures that HTable.batch works correctly (which reorders Deletes
> w.r.t. to Puts at the Region Server).
> o It ensures that replication works correctly.
> o many other scenarios
>
> If you do not use application defined timestamp the current time is used
> and everything works as expected.
> If you use application defined timestamps you are asking for a delete to
> be either in the future or the past, and you have to understand what that
> means.
> Maybe we should document the behavior better.
>
>
> I guess I am saying that I *do* understand the current "delete with TS"
> behavior, and I find the current implementation  unstable and
> non-deterministic.  Documenting it more thoroughly does not make it less
> quirky or more stable.  I propose fixing it along the lines suggested in
> option B.  Karthik seems to agree.
>
>
>
>
>
> -- Lars
>
>
> ----- Original Message -----
> From: Karthik Ranganathan <kranganathan@fb.com<mailto:kranganathan@fb.com
> >>
> To: "dev@hbase.apache.org<ma...@hbase.apache.org>" <
> dev@hbase.apache.org<ma...@hbase.apache.org>>; lars hofhansl <
> lhofhansl@yahoo.com<ma...@yahoo.com>>
> Cc:
> Sent: Tuesday, January 17, 2012 3:27 PM
> Subject: Re: Delete client API.
>
>
> @Srivas - totally agree that B is the correct thing to do.
>
> One way we have talked about implementing this is using the memstore ts.
> Every insert of a KV into the memstore is given a memstore-ts. These are
> persisted only till they are needed (to ensure read atomicity for
> scanners) and then that value is zeroed out on a subsequent compaction
> (saves space). If we retained the memstore-ts even beyond these
> compactions, we could get a deterministic order for the puts and deletes
> (first insert ts < del ts < second insert ts).
>
> Thanks
> Karthik
>
>
> On 1/17/12 2:14 PM, "M. C. Srivas" <mcsrivas@gmail.com<mailto:
> mcsrivas@gmail.com>> wrote:
>
> On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <lhofhansl@yahoo.com
> <ma...@yahoo.com>>
> wrote:
>
> Yeah, it's confusing if one expects it to work like in a relational
> database.
> You can even do worse. If you by accident place a delete in the future
> all
> current inserts will be hidden until the next major compaction. :)
> I got confused about this myself just recently (see my mail on the
> dev-list).
>
>
> In the end this is a pretty powerful feature and core to how HBase works
> (not saying that is not confusing though).
>
>
> If one keeps the following two points in mind it makes more sense:
> 1. Delete just sets a tomb stone marker at a specific TS (marking
> everything older as deleted).
> 2. Everything is versioned, if no version is specified the current time
> (at the regionserver) is used.
>
> In your example1 below t3 > 6, hence the insert is hidden.
> In example2 both delete and insert TS are 6, hence the insert is hidden.
>
>
> Lets consider my example2 for a little longer. Sequence of events
>
>  1.  ins  val1  with TS=6 set by client
>  2.  del  entire row at TS=6 set by client
>  3.  ins  val2  with TS=6  set by client
>  4.  read row
>
> The row returns nothing even though the insert at step 3 happened after
> the
> delete at step 2. (step 2 masks even future inserts)
>
> Now, the same sequence with a compaction thrown in the middle:
>
>  1.  ins  val1  with TS=6 set by client
>  2.  del  entire row at TS=6 set by client
>  3.  ---- table is compacted -----
>  4.  ins  val2  with TS=6  set by client
>  5.  read row
>
> The row returns val2.  (the delete at step2 got lost due to compaction).
>
> So we have different results depending upon whether an internal
> re-organization (like a compaction) happened or not. If we want both
> sequences to behave exactly the same, then we need to first choose what is
> the proper (and deterministic) behavior.
>
> A.  if we think that the first sequence is the correct one, then the
> delete
> at step 2 needs to be preserved forever.
>
> or,
>
> B. if we think that the second sequence is the correct behavior (ie, a
> read
> always produces the same results independent of compaction), then the
> record needs a second "internal TS" field to allow the RS to distinguish
> the real sequence of events, and not rely upon the TS field which is
> settable by the client.
>
> My opinion:
>
> We should do B.  It is normal for someone to write code that says  "if old
> exists, delete it;  add new". A subsequent read should always reliably
> return "new".
>
> The current way of relying on a client-settable TS field to determine
> causal order results in quirky behavior, and quirky is not good.
>
>
>
> Look at these two examples:
>
> 1. insert Val1  at real time t1
> 2. <del>  at real time t2 > t1
> 3. insert  Val2 at real time  t3 > t2
>
> 1. insert Val1  with TS=1 at real time t1
> 2. <del>  with TS = 2 at real time t2 > t1
>
> 3. insert  Val2 with TS = 3 at real time  t3 > t2
>
>
> In both cases Val2 is visible.
>
> If the your code sets your own timestamps, you better know what you're
> doing :)
>
> Note that my examples below are confusing even if you know how deletion
> in
> HBase works.
> You have to look at Delete.java to figure out what is happening.
> OK, since there were know objections in two days, I will commit my
> proposed change in HBASE-5205.
>
>
> -- Lars
>
> ________________________________
> From: M. C. Srivas <mc...@gmail.com>>
> To: dev@hbase.apache.org<ma...@hbase.apache.org>; lars hofhansl <
> lhofhansl@yahoo.com<ma...@yahoo.com>>
> Sent: Tuesday, January 17, 2012 8:13 AM
> Subject: Re: Delete client API.
>
>
> Delete seems to be confusing in general. Here are some examples that
> make
> me scratch my head (key is same in all the examples):
>
> Example1:
> ----------------
> 1. insert Val3  with TS=3  at real time t1
> 2. insert Val5  with TS=5  at real time t2 > t1
> 3. <del>    at real time t3 > t2
> 4. insert  Val6  with TS=6  at real time  t4 > t3
>
> What does a read return?  (I would expect  Val6, since it was done
> last).
> But depending upon whether compaction happened or not between steps 3
> and
> 4, I get either Val6 or  nothing.
>
> Example 2:
> -----------------
> 1. insert Val3  with TS=3  at real time t1
> 2. insert Val5  with TS=5  at real time t2 > t1
> 3. <del>  TS=6  at real time t3 > t2
> 4. insert  Val6  with TS=6  at real time  t4 > t3
>
> Note the difference in step 3 is this time a TS was specified by the
> client.
>
> What does a read return?  Again, I expect Val6 to be returned. But
> depending upon what's going on, I seem to get either Val5 or Val6.
>
>
>
>
>
> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lhofhansl@yahoo.com
> <ma...@yahoo.com>>
> wrote:
>
> There are some confusing parts about the Delete client API:
> 1. calling deleteFamily removes all prior column or columns markers
> without checking the TS.
> 2. delete{Column|Columns|Family} do not use the timestamp passed to
> Delete at construction time, but instead default to LATEST_TIMESTAMP.
>
> Delete d = new Delete(R,T);
> d.deleteFamily(CF);
>
> Does not do what you expect (won't use T for the family delete, but
> rather the current time).
>
> Neither does
> d.deleteColumns(CF, C1, T2);
> d.deleteFamily(CF, T1); // T1 < T2
>
>
> (the columns marker will be removed)
>
>
> #1 prevents Delete from adding a family marker F for time T1 and a
> column/columns marker for columns of F at T2 even if T2 > T1.
> #2 is just unexpected and different from what Put is doing.
>
> In HBASE-5205 I propose a simple patch to fix this.
>
> Since this is a (slight) API change, please provide feed back.
>
> Thanks.
>
> -- Lars
>
>
>
>
>
>

Re: Delete client API.

Posted by Karthik Ranganathan <kr...@fb.com>.

@Mikael:

<< memstore-ts will just be the time at which kv arrived into
memstore and not the real ordering of the operation.>>

By virtue of the fact that the regionserver is the only entity writing
updates for a given key (each key belongs to exactly one RS) this will
also be a global ordering of the operations for that key. Of course we
need to make the memstore-ts a logical number instead of the time on that
RS if we want to account for clock skew between RS's when a region fails
over.

@Ian: you are correct about the reasoning that "if there were no client
specified timestamps then this issue would never exist". But for the
better or for the worse, Hbase overloads timestamps and versions. So there
is no inherent way to achieve versioning other than by using timestamps.

Also, I don¹t understand what overhead it would add to each call? There is
a memstore-ts maintained anyways...

Thanks
Karthik
 


On 1/18/12 6:38 AM, "Ian Varley" <iv...@salesforce.com> wrote:

>M.C., why would option B be superior to simply letting the native
>timestamps in HBase do what they were meant to do, and then storing your
>app-level logical timestamps in the cell itself along with the data? The
>(admittedly more correct) behavior you want is already the normal
>behavior when you're not setting application-defined timestamps.
>
>In other words: HBase already has a timestamp that behaves as you
>describe, and only when you intentionally use it for another purpose does
>the behavior become non-intuitive. And, other things will become
>non-intuitive too, like replication.
>
>In the FB messaging case, if I'm not mistaken, the official timestamp
>value is in use for something that isn't a timestamp at all (message ids,
>or something along those lines). So in that case, it would make sense
>that you'd want to also have another timestamp. I'm tempted to assert
>that that's an unusual use of the timestamp field, but then again, if the
>biggest use case of a product does something, it's hardly "unusual". :)
>
>At the very least, since it would add overhead to every cell, this should
>be an opt-in behavior (the ability to say, "I'm setting my own
>timestamps, so HBase should also keep its own real timestamp"). But then
>again, what's the argument for doing that rather than storing the
>timestamps in your cell value? Is it the added abilities the API gives
>you around time ranges?
>
>Ian
>
>On Jan 18, 2012, at 1:51 AM, M. C. Srivas wrote:
>
>On Tue, Jan 17, 2012 at 8:56 PM, lars hofhansl
><lh...@yahoo.com>> wrote:
>
>The memstoreTS is used for visibility during an intra-row transaction.
>Are you proposing to do this only if the deletes/puts did not use the
>current time?
>
>The ability to define timestamps for all operations is crucial to HBase.
>o It ensures that HTable.batch works correctly (which reorders Deletes
>w.r.t. to Puts at the Region Server).
>o It ensures that replication works correctly.
>o many other scenarios
>
>If you do not use application defined timestamp the current time is used
>and everything works as expected.
>If you use application defined timestamps you are asking for a delete to
>be either in the future or the past, and you have to understand what that
>means.
>Maybe we should document the behavior better.
>
>
>I guess I am saying that I *do* understand the current "delete with TS"
>behavior, and I find the current implementation  unstable and
>non-deterministic.  Documenting it more thoroughly does not make it less
>quirky or more stable.  I propose fixing it along the lines suggested in
>option B.  Karthik seems to agree.
>
>
>
>
>
>-- Lars
>
>
>----- Original Message -----
>From: Karthik Ranganathan
><kr...@fb.com>>
>To: "dev@hbase.apache.org<ma...@hbase.apache.org>"
><de...@hbase.apache.org>>; lars hofhansl <
>lhofhansl@yahoo.com<ma...@yahoo.com>>
>Cc:
>Sent: Tuesday, January 17, 2012 3:27 PM
>Subject: Re: Delete client API.
>
>
>@Srivas - totally agree that B is the correct thing to do.
>
>One way we have talked about implementing this is using the memstore ts.
>Every insert of a KV into the memstore is given a memstore-ts. These are
>persisted only till they are needed (to ensure read atomicity for
>scanners) and then that value is zeroed out on a subsequent compaction
>(saves space). If we retained the memstore-ts even beyond these
>compactions, we could get a deterministic order for the puts and deletes
>(first insert ts < del ts < second insert ts).
>
>Thanks
>Karthik
>
>
>On 1/17/12 2:14 PM, "M. C. Srivas"
><mc...@gmail.com>> wrote:
>
>On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl
><lh...@yahoo.com>>
>wrote:
>
>Yeah, it's confusing if one expects it to work like in a relational
>database.
>You can even do worse. If you by accident place a delete in the future
>all
>current inserts will be hidden until the next major compaction. :)
>I got confused about this myself just recently (see my mail on the
>dev-list).
>
>
>In the end this is a pretty powerful feature and core to how HBase works
>(not saying that is not confusing though).
>
>
>If one keeps the following two points in mind it makes more sense:
>1. Delete just sets a tomb stone marker at a specific TS (marking
>everything older as deleted).
>2. Everything is versioned, if no version is specified the current time
>(at the regionserver) is used.
>
>In your example1 below t3 > 6, hence the insert is hidden.
>In example2 both delete and insert TS are 6, hence the insert is hidden.
>
>
>Lets consider my example2 for a little longer. Sequence of events
>
> 1.  ins  val1  with TS=6 set by client
> 2.  del  entire row at TS=6 set by client
> 3.  ins  val2  with TS=6  set by client
> 4.  read row
>
>The row returns nothing even though the insert at step 3 happened after
>the
>delete at step 2. (step 2 masks even future inserts)
>
>Now, the same sequence with a compaction thrown in the middle:
>
> 1.  ins  val1  with TS=6 set by client
> 2.  del  entire row at TS=6 set by client
> 3.  ---- table is compacted -----
> 4.  ins  val2  with TS=6  set by client
> 5.  read row
>
>The row returns val2.  (the delete at step2 got lost due to compaction).
>
>So we have different results depending upon whether an internal
>re-organization (like a compaction) happened or not. If we want both
>sequences to behave exactly the same, then we need to first choose what is
>the proper (and deterministic) behavior.
>
>A.  if we think that the first sequence is the correct one, then the
>delete
>at step 2 needs to be preserved forever.
>
>or,
>
>B. if we think that the second sequence is the correct behavior (ie, a
>read
>always produces the same results independent of compaction), then the
>record needs a second "internal TS" field to allow the RS to distinguish
>the real sequence of events, and not rely upon the TS field which is
>settable by the client.
>
>My opinion:
>
>We should do B.  It is normal for someone to write code that says  "if old
>exists, delete it;  add new". A subsequent read should always reliably
>return "new".
>
>The current way of relying on a client-settable TS field to determine
>causal order results in quirky behavior, and quirky is not good.
>
>
>
>Look at these two examples:
>
>1. insert Val1  at real time t1
>2. <del>  at real time t2 > t1
>3. insert  Val2 at real time  t3 > t2
>
>1. insert Val1  with TS=1 at real time t1
>2. <del>  with TS = 2 at real time t2 > t1
>
>3. insert  Val2 with TS = 3 at real time  t3 > t2
>
>
>In both cases Val2 is visible.
>
>If the your code sets your own timestamps, you better know what you're
>doing :)
>
>Note that my examples below are confusing even if you know how deletion
>in
>HBase works.
>You have to look at Delete.java to figure out what is happening.
>OK, since there were know objections in two days, I will commit my
>proposed change in HBASE-5205.
>
>
>-- Lars
>
>________________________________
>From: M. C. Srivas <mc...@gmail.com>>
>To: dev@hbase.apache.org<ma...@hbase.apache.org>; lars hofhansl
><lh...@yahoo.com>>
>Sent: Tuesday, January 17, 2012 8:13 AM
>Subject: Re: Delete client API.
>
>
>Delete seems to be confusing in general. Here are some examples that
>make
>me scratch my head (key is same in all the examples):
>
>Example1:
>----------------
>1. insert Val3  with TS=3  at real time t1
>2. insert Val5  with TS=5  at real time t2 > t1
>3. <del>    at real time t3 > t2
>4. insert  Val6  with TS=6  at real time  t4 > t3
>
>What does a read return?  (I would expect  Val6, since it was done
>last).
>But depending upon whether compaction happened or not between steps 3
>and
>4, I get either Val6 or  nothing.
>
>Example 2:
>-----------------
>1. insert Val3  with TS=3  at real time t1
>2. insert Val5  with TS=5  at real time t2 > t1
>3. <del>  TS=6  at real time t3 > t2
>4. insert  Val6  with TS=6  at real time  t4 > t3
>
>Note the difference in step 3 is this time a TS was specified by the
>client.
>
>What does a read return?  Again, I expect Val6 to be returned. But
>depending upon what's going on, I seem to get either Val5 or Val6.
>
>
>
>
>
>On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl
><lh...@yahoo.com>>
>wrote:
>
>There are some confusing parts about the Delete client API:
>1. calling deleteFamily removes all prior column or columns markers
>without checking the TS.
>2. delete{Column|Columns|Family} do not use the timestamp passed to
>Delete at construction time, but instead default to LATEST_TIMESTAMP.
>
>Delete d = new Delete(R,T);
>d.deleteFamily(CF);
>
>Does not do what you expect (won't use T for the family delete, but
>rather the current time).
>
>Neither does
>d.deleteColumns(CF, C1, T2);
>d.deleteFamily(CF, T1); // T1 < T2
>
>
>(the columns marker will be removed)
>
>
>#1 prevents Delete from adding a family marker F for time T1 and a
>column/columns marker for columns of F at T2 even if T2 > T1.
>#2 is just unexpected and different from what Put is doing.
>
>In HBASE-5205 I propose a simple patch to fix this.
>
>Since this is a (slight) API change, please provide feed back.
>
>Thanks.
>
>-- Lars
>
>
>
>
>

Re: Delete client API.

Posted by Ian Varley <iv...@salesforce.com>.

M.C., why would option B be superior to simply letting the native timestamps in HBase do what they were meant to do, and then storing your app-level logical timestamps in the cell itself along with the data? The (admittedly more correct) behavior you want is already the normal behavior when you're not setting application-defined timestamps.

In other words: HBase already has a timestamp that behaves as you describe, and only when you intentionally use it for another purpose does the behavior become non-intuitive. And, other things will become non-intuitive too, like replication.

In the FB messaging case, if I'm not mistaken, the official timestamp value is in use for something that isn't a timestamp at all (message ids, or something along those lines). So in that case, it would make sense that you'd want to also have another timestamp. I'm tempted to assert that that's an unusual use of the timestamp field, but then again, if the biggest use case of a product does something, it's hardly "unusual". :)

At the very least, since it would add overhead to every cell, this should be an opt-in behavior (the ability to say, "I'm setting my own timestamps, so HBase should also keep its own real timestamp"). But then again, what's the argument for doing that rather than storing the timestamps in your cell value? Is it the added abilities the API gives you around time ranges?

Ian

On Jan 18, 2012, at 1:51 AM, M. C. Srivas wrote:

On Tue, Jan 17, 2012 at 8:56 PM, lars hofhansl <lh...@yahoo.com>> wrote:

The memstoreTS is used for visibility during an intra-row transaction.
Are you proposing to do this only if the deletes/puts did not use the
current time?

The ability to define timestamps for all operations is crucial to HBase.
o It ensures that HTable.batch works correctly (which reorders Deletes
w.r.t. to Puts at the Region Server).
o It ensures that replication works correctly.
o many other scenarios

If you do not use application defined timestamp the current time is used
and everything works as expected.
If you use application defined timestamps you are asking for a delete to
be either in the future or the past, and you have to understand what that
means.
Maybe we should document the behavior better.


I guess I am saying that I *do* understand the current "delete with TS"
behavior, and I find the current implementation  unstable and
non-deterministic.  Documenting it more thoroughly does not make it less
quirky or more stable.  I propose fixing it along the lines suggested in
option B.  Karthik seems to agree.





-- Lars


----- Original Message -----
From: Karthik Ranganathan <kr...@fb.com>>
To: "dev@hbase.apache.org<ma...@hbase.apache.org>" <de...@hbase.apache.org>>; lars hofhansl <
lhofhansl@yahoo.com<ma...@yahoo.com>>
Cc:
Sent: Tuesday, January 17, 2012 3:27 PM
Subject: Re: Delete client API.


@Srivas - totally agree that B is the correct thing to do.

One way we have talked about implementing this is using the memstore ts.
Every insert of a KV into the memstore is given a memstore-ts. These are
persisted only till they are needed (to ensure read atomicity for
scanners) and then that value is zeroed out on a subsequent compaction
(saves space). If we retained the memstore-ts even beyond these
compactions, we could get a deterministic order for the puts and deletes
(first insert ts < del ts < second insert ts).

Thanks
Karthik


On 1/17/12 2:14 PM, "M. C. Srivas" <mc...@gmail.com>> wrote:

On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <lh...@yahoo.com>>
wrote:

Yeah, it's confusing if one expects it to work like in a relational
database.
You can even do worse. If you by accident place a delete in the future
all
current inserts will be hidden until the next major compaction. :)
I got confused about this myself just recently (see my mail on the
dev-list).


In the end this is a pretty powerful feature and core to how HBase works
(not saying that is not confusing though).


If one keeps the following two points in mind it makes more sense:
1. Delete just sets a tomb stone marker at a specific TS (marking
everything older as deleted).
2. Everything is versioned, if no version is specified the current time
(at the regionserver) is used.

In your example1 below t3 > 6, hence the insert is hidden.
In example2 both delete and insert TS are 6, hence the insert is hidden.


Lets consider my example2 for a little longer. Sequence of events

 1.  ins  val1  with TS=6 set by client
 2.  del  entire row at TS=6 set by client
 3.  ins  val2  with TS=6  set by client
 4.  read row

The row returns nothing even though the insert at step 3 happened after
the
delete at step 2. (step 2 masks even future inserts)

Now, the same sequence with a compaction thrown in the middle:

 1.  ins  val1  with TS=6 set by client
 2.  del  entire row at TS=6 set by client
 3.  ---- table is compacted -----
 4.  ins  val2  with TS=6  set by client
 5.  read row

The row returns val2.  (the delete at step2 got lost due to compaction).

So we have different results depending upon whether an internal
re-organization (like a compaction) happened or not. If we want both
sequences to behave exactly the same, then we need to first choose what is
the proper (and deterministic) behavior.

A.  if we think that the first sequence is the correct one, then the
delete
at step 2 needs to be preserved forever.

or,

B. if we think that the second sequence is the correct behavior (ie, a
read
always produces the same results independent of compaction), then the
record needs a second "internal TS" field to allow the RS to distinguish
the real sequence of events, and not rely upon the TS field which is
settable by the client.

My opinion:

We should do B.  It is normal for someone to write code that says  "if old
exists, delete it;  add new". A subsequent read should always reliably
return "new".

The current way of relying on a client-settable TS field to determine
causal order results in quirky behavior, and quirky is not good.



Look at these two examples:

1. insert Val1  at real time t1
2. <del>  at real time t2 > t1
3. insert  Val2 at real time  t3 > t2

1. insert Val1  with TS=1 at real time t1
2. <del>  with TS = 2 at real time t2 > t1

3. insert  Val2 with TS = 3 at real time  t3 > t2


In both cases Val2 is visible.

If the your code sets your own timestamps, you better know what you're
doing :)

Note that my examples below are confusing even if you know how deletion
in
HBase works.
You have to look at Delete.java to figure out what is happening.
OK, since there were know objections in two days, I will commit my
proposed change in HBASE-5205.


-- Lars

________________________________
From: M. C. Srivas <mc...@gmail.com>>
To: dev@hbase.apache.org<ma...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com>>
Sent: Tuesday, January 17, 2012 8:13 AM
Subject: Re: Delete client API.


Delete seems to be confusing in general. Here are some examples that
make
me scratch my head (key is same in all the examples):

Example1:
----------------
1. insert Val3  with TS=3  at real time t1
2. insert Val5  with TS=5  at real time t2 > t1
3. <del>    at real time t3 > t2
4. insert  Val6  with TS=6  at real time  t4 > t3

What does a read return?  (I would expect  Val6, since it was done
last).
But depending upon whether compaction happened or not between steps 3
and
4, I get either Val6 or  nothing.

Example 2:
-----------------
1. insert Val3  with TS=3  at real time t1
2. insert Val5  with TS=5  at real time t2 > t1
3. <del>  TS=6  at real time t3 > t2
4. insert  Val6  with TS=6  at real time  t4 > t3

Note the difference in step 3 is this time a TS was specified by the
client.

What does a read return?  Again, I expect Val6 to be returned. But
depending upon what's going on, I seem to get either Val5 or Val6.





On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lh...@yahoo.com>>
wrote:

There are some confusing parts about the Delete client API:
1. calling deleteFamily removes all prior column or columns markers
without checking the TS.
2. delete{Column|Columns|Family} do not use the timestamp passed to
Delete at construction time, but instead default to LATEST_TIMESTAMP.

Delete d = new Delete(R,T);
d.deleteFamily(CF);

Does not do what you expect (won't use T for the family delete, but
rather the current time).

Neither does
d.deleteColumns(CF, C1, T2);
d.deleteFamily(CF, T1); // T1 < T2


(the columns marker will be removed)


#1 prevents Delete from adding a family marker F for time T1 and a
column/columns marker for columns of F at T2 even if T2 > T1.
#2 is just unexpected and different from what Put is doing.

In HBASE-5205 I propose a simple patch to fix this.

Since this is a (slight) API change, please provide feed back.

Thanks.

-- Lars

Re: Delete client API.

Posted by "M. C. Srivas" <mc...@gmail.com>.

On Tue, Jan 17, 2012 at 8:56 PM, lars hofhansl <lh...@yahoo.com> wrote:

> The memstoreTS is used for visibility during an intra-row transaction.
> Are you proposing to do this only if the deletes/puts did not use the
> current time?
>
> The ability to define timestamps for all operations is crucial to HBase.
> o It ensures that HTable.batch works correctly (which reorders Deletes
> w.r.t. to Puts at the Region Server).
> o It ensures that replication works correctly.
> o many other scenarios
>
> If you do not use application defined timestamp the current time is used
> and everything works as expected.
> If you use application defined timestamps you are asking for a delete to
> be either in the future or the past, and you have to understand what that
> means.
> Maybe we should document the behavior better.
>

I guess I am saying that I *do* understand the current "delete with TS"
behavior, and I find the current implementation  unstable and
non-deterministic.  Documenting it more thoroughly does not make it less
quirky or more stable.  I propose fixing it along the lines suggested in
option B.  Karthik seems to agree.




>
> -- Lars
>
>
> ----- Original Message -----
> From: Karthik Ranganathan <kr...@fb.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <
> lhofhansl@yahoo.com>
> Cc:
> Sent: Tuesday, January 17, 2012 3:27 PM
> Subject: Re: Delete client API.
>
>
> @Srivas - totally agree that B is the correct thing to do.
>
> One way we have talked about implementing this is using the memstore ts.
> Every insert of a KV into the memstore is given a memstore-ts. These are
> persisted only till they are needed (to ensure read atomicity for
> scanners) and then that value is zeroed out on a subsequent compaction
> (saves space). If we retained the memstore-ts even beyond these
> compactions, we could get a deterministic order for the puts and deletes
> (first insert ts < del ts < second insert ts).
>
> Thanks
> Karthik
>
>
> On 1/17/12 2:14 PM, "M. C. Srivas" <mc...@gmail.com> wrote:
>
> >On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <lh...@yahoo.com>
> >wrote:
> >
> >> Yeah, it's confusing if one expects it to work like in a relational
> >> database.
> >> You can even do worse. If you by accident place a delete in the future
> >>all
> >> current inserts will be hidden until the next major compaction. :)
> >> I got confused about this myself just recently (see my mail on the
> >> dev-list).
> >>
> >>
> >> In the end this is a pretty powerful feature and core to how HBase works
> >> (not saying that is not confusing though).
> >>
> >>
> >> If one keeps the following two points in mind it makes more sense:
> >> 1. Delete just sets a tomb stone marker at a specific TS (marking
> >> everything older as deleted).
> >> 2. Everything is versioned, if no version is specified the current time
> >> (at the regionserver) is used.
> >>
> >> In your example1 below t3 > 6, hence the insert is hidden.
> >> In example2 both delete and insert TS are 6, hence the insert is hidden.
> >>
> >
> >Lets consider my example2 for a little longer. Sequence of events
> >
> >   1.  ins  val1  with TS=6 set by client
> >   2.  del  entire row at TS=6 set by client
> >   3.  ins  val2  with TS=6  set by client
> >   4.  read row
> >
> >The row returns nothing even though the insert at step 3 happened after
> >the
> >delete at step 2. (step 2 masks even future inserts)
> >
> >Now, the same sequence with a compaction thrown in the middle:
> >
> >   1.  ins  val1  with TS=6 set by client
> >   2.  del  entire row at TS=6 set by client
> >   3.  ---- table is compacted -----
> >   4.  ins  val2  with TS=6  set by client
> >   5.  read row
> >
> >The row returns val2.  (the delete at step2 got lost due to compaction).
> >
> >So we have different results depending upon whether an internal
> >re-organization (like a compaction) happened or not. If we want both
> >sequences to behave exactly the same, then we need to first choose what is
> >the proper (and deterministic) behavior.
> >
> >A.  if we think that the first sequence is the correct one, then the
> >delete
> >at step 2 needs to be preserved forever.
> >
> >or,
> >
> >B. if we think that the second sequence is the correct behavior (ie, a
> >read
> >always produces the same results independent of compaction), then the
> >record needs a second "internal TS" field to allow the RS to distinguish
> >the real sequence of events, and not rely upon the TS field which is
> >settable by the client.
> >
> >My opinion:
> >
> >We should do B.  It is normal for someone to write code that says  "if old
> >exists, delete it;  add new". A subsequent read should always reliably
> >return "new".
> >
> >The current way of relying on a client-settable TS field to determine
> >causal order results in quirky behavior, and quirky is not good.
> >
> >
> >
> >> Look at these two examples:
> >>
> >> 1. insert Val1  at real time t1
> >> 2. <del>  at real time t2 > t1
> >> 3. insert  Val2 at real time  t3 > t2
> >>
> >> 1. insert Val1  with TS=1 at real time t1
> >> 2. <del>  with TS = 2 at real time t2 > t1
> >>
> >> 3. insert  Val2 with TS = 3 at real time  t3 > t2
> >>
> >>
> >> In both cases Val2 is visible.
> >>
> >> If the your code sets your own timestamps, you better know what you're
> >> doing :)
> >>
> >> Note that my examples below are confusing even if you know how deletion
> >>in
> >> HBase works.
> >> You have to look at Delete.java to figure out what is happening.
> >> OK, since there were know objections in two days, I will commit my
> >> proposed change in HBASE-5205.
> >>
> >>
> >> -- Lars
> >>
> >> ________________________________
> >> From: M. C. Srivas <mc...@gmail.com>
> >> To: dev@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> >> Sent: Tuesday, January 17, 2012 8:13 AM
> >> Subject: Re: Delete client API.
> >>
> >>
> >> Delete seems to be confusing in general. Here are some examples that
> >>make
> >> me scratch my head (key is same in all the examples):
> >>
> >> Example1:
> >> ----------------
> >> 1. insert Val3  with TS=3  at real time t1
> >> 2. insert Val5  with TS=5  at real time t2 > t1
> >> 3. <del>    at real time t3 > t2
> >> 4. insert  Val6  with TS=6  at real time  t4 > t3
> >>
> >> What does a read return?  (I would expect  Val6, since it was done
> >>last).
> >> But depending upon whether compaction happened or not between steps 3
> >>and
> >> 4, I get either Val6 or  nothing.
> >>
> >> Example 2:
> >> -----------------
> >> 1. insert Val3  with TS=3  at real time t1
> >> 2. insert Val5  with TS=5  at real time t2 > t1
> >> 3. <del>  TS=6  at real time t3 > t2
> >> 4. insert  Val6  with TS=6  at real time  t4 > t3
> >>
> >> Note the difference in step 3 is this time a TS was specified by the
> >> client.
> >>
> >> What does a read return?  Again, I expect Val6 to be returned. But
> >> depending upon what's going on, I seem to get either Val5 or Val6.
> >>
> >>
> >>
> >>
> >>
> >> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lh...@yahoo.com>
> >> wrote:
> >>
> >> There are some confusing parts about the Delete client API:
> >> >1. calling deleteFamily removes all prior column or columns markers
> >> without checking the TS.
> >> >2. delete{Column|Columns|Family} do not use the timestamp passed to
> >> Delete at construction time, but instead default to LATEST_TIMESTAMP.
> >> >
> >> >  Delete d = new Delete(R,T);
> >> >  d.deleteFamily(CF);
> >> >
> >> >Does not do what you expect (won't use T for the family delete, but
> >> rather the current time).
> >> >
> >> >Neither does
> >> >  d.deleteColumns(CF, C1, T2);
> >> >  d.deleteFamily(CF, T1); // T1 < T2
> >> >
> >> >
> >> >(the columns marker will be removed)
> >> >
> >> >
> >> >#1 prevents Delete from adding a family marker F for time T1 and a
> >> column/columns marker for columns of F at T2 even if T2 > T1.
> >> >#2 is just unexpected and different from what Put is doing.
> >> >
> >> >In HBASE-5205 I propose a simple patch to fix this.
> >> >
> >> >Since this is a (slight) API change, please provide feed back.
> >> >
> >> >Thanks.
> >> >
> >> >-- Lars
> >> >
> >> >
> >>
>

Re: Delete client API.

Posted by lars hofhansl <lh...@yahoo.com>.

The memstoreTS is used for visibility during an intra-row transaction.
Are you proposing to do this only if the deletes/puts did not use the current time?

The ability to define timestamps for all operations is crucial to HBase.
o It ensures that HTable.batch works correctly (which reorders Deletes w.r.t. to Puts at the Region Server).
o It ensures that replication works correctly.
o many other scenarios

If you do not use application defined timestamp the current time is used and everything works as expected.
If you use application defined timestamps you are asking for a delete to be either in the future or the past, and you have to understand what that means. 
Maybe we should document the behavior better.

-- Lars


----- Original Message -----
From: Karthik Ranganathan <kr...@fb.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com>
Cc: 
Sent: Tuesday, January 17, 2012 3:27 PM
Subject: Re: Delete client API.


@Srivas - totally agree that B is the correct thing to do.

One way we have talked about implementing this is using the memstore ts.
Every insert of a KV into the memstore is given a memstore-ts. These are
persisted only till they are needed (to ensure read atomicity for
scanners) and then that value is zeroed out on a subsequent compaction
(saves space). If we retained the memstore-ts even beyond these
compactions, we could get a deterministic order for the puts and deletes
(first insert ts < del ts < second insert ts).

Thanks
Karthik


On 1/17/12 2:14 PM, "M. C. Srivas" <mc...@gmail.com> wrote:

>On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <lh...@yahoo.com>
>wrote:
>
>> Yeah, it's confusing if one expects it to work like in a relational
>> database.
>> You can even do worse. If you by accident place a delete in the future
>>all
>> current inserts will be hidden until the next major compaction. :)
>> I got confused about this myself just recently (see my mail on the
>> dev-list).
>>
>>
>> In the end this is a pretty powerful feature and core to how HBase works
>> (not saying that is not confusing though).
>>
>>
>> If one keeps the following two points in mind it makes more sense:
>> 1. Delete just sets a tomb stone marker at a specific TS (marking
>> everything older as deleted).
>> 2. Everything is versioned, if no version is specified the current time
>> (at the regionserver) is used.
>>
>> In your example1 below t3 > 6, hence the insert is hidden.
>> In example2 both delete and insert TS are 6, hence the insert is hidden.
>>
>
>Lets consider my example2 for a little longer. Sequence of events
>
>   1.  ins  val1  with TS=6 set by client
>   2.  del  entire row at TS=6 set by client
>   3.  ins  val2  with TS=6  set by client
>   4.  read row
>
>The row returns nothing even though the insert at step 3 happened after
>the
>delete at step 2. (step 2 masks even future inserts)
>
>Now, the same sequence with a compaction thrown in the middle:
>
>   1.  ins  val1  with TS=6 set by client
>   2.  del  entire row at TS=6 set by client
>   3.  ---- table is compacted -----
>   4.  ins  val2  with TS=6  set by client
>   5.  read row
>
>The row returns val2.  (the delete at step2 got lost due to compaction).
>
>So we have different results depending upon whether an internal
>re-organization (like a compaction) happened or not. If we want both
>sequences to behave exactly the same, then we need to first choose what is
>the proper (and deterministic) behavior.
>
>A.  if we think that the first sequence is the correct one, then the
>delete
>at step 2 needs to be preserved forever.
>
>or,
>
>B. if we think that the second sequence is the correct behavior (ie, a
>read
>always produces the same results independent of compaction), then the
>record needs a second "internal TS" field to allow the RS to distinguish
>the real sequence of events, and not rely upon the TS field which is
>settable by the client.
>
>My opinion:
>
>We should do B.  It is normal for someone to write code that says  "if old
>exists, delete it;  add new". A subsequent read should always reliably
>return "new".
>
>The current way of relying on a client-settable TS field to determine
>causal order results in quirky behavior, and quirky is not good.
>
>
>
>> Look at these two examples:
>>
>> 1. insert Val1  at real time t1
>> 2. <del>  at real time t2 > t1
>> 3. insert  Val2 at real time  t3 > t2
>>
>> 1. insert Val1  with TS=1 at real time t1
>> 2. <del>  with TS = 2 at real time t2 > t1
>>
>> 3. insert  Val2 with TS = 3 at real time  t3 > t2
>>
>>
>> In both cases Val2 is visible.
>>
>> If the your code sets your own timestamps, you better know what you're
>> doing :)
>>
>> Note that my examples below are confusing even if you know how deletion
>>in
>> HBase works.
>> You have to look at Delete.java to figure out what is happening.
>> OK, since there were know objections in two days, I will commit my
>> proposed change in HBASE-5205.
>>
>>
>> -- Lars
>>
>> ________________________________
>> From: M. C. Srivas <mc...@gmail.com>
>> To: dev@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>> Sent: Tuesday, January 17, 2012 8:13 AM
>> Subject: Re: Delete client API.
>>
>>
>> Delete seems to be confusing in general. Here are some examples that
>>make
>> me scratch my head (key is same in all the examples):
>>
>> Example1:
>> ----------------
>> 1. insert Val3  with TS=3  at real time t1
>> 2. insert Val5  with TS=5  at real time t2 > t1
>> 3. <del>    at real time t3 > t2
>> 4. insert  Val6  with TS=6  at real time  t4 > t3
>>
>> What does a read return?  (I would expect  Val6, since it was done
>>last).
>> But depending upon whether compaction happened or not between steps 3
>>and
>> 4, I get either Val6 or  nothing.
>>
>> Example 2:
>> -----------------
>> 1. insert Val3  with TS=3  at real time t1
>> 2. insert Val5  with TS=5  at real time t2 > t1
>> 3. <del>  TS=6  at real time t3 > t2
>> 4. insert  Val6  with TS=6  at real time  t4 > t3
>>
>> Note the difference in step 3 is this time a TS was specified by the
>> client.
>>
>> What does a read return?  Again, I expect Val6 to be returned. But
>> depending upon what's going on, I seem to get either Val5 or Val6.
>>
>>
>>
>>
>>
>> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lh...@yahoo.com>
>> wrote:
>>
>> There are some confusing parts about the Delete client API:
>> >1. calling deleteFamily removes all prior column or columns markers
>> without checking the TS.
>> >2. delete{Column|Columns|Family} do not use the timestamp passed to
>> Delete at construction time, but instead default to LATEST_TIMESTAMP.
>> >
>> >  Delete d = new Delete(R,T);
>> >  d.deleteFamily(CF);
>> >
>> >Does not do what you expect (won't use T for the family delete, but
>> rather the current time).
>> >
>> >Neither does
>> >  d.deleteColumns(CF, C1, T2);
>> >  d.deleteFamily(CF, T1); // T1 < T2
>> >
>> >
>> >(the columns marker will be removed)
>> >
>> >
>> >#1 prevents Delete from adding a family marker F for time T1 and a
>> column/columns marker for columns of F at T2 even if T2 > T1.
>> >#2 is just unexpected and different from what Put is doing.
>> >
>> >In HBASE-5205 I propose a simple patch to fix this.
>> >
>> >Since this is a (slight) API change, please provide feed back.
>> >
>> >Thanks.
>> >
>> >-- Lars
>> >
>> >
>>

Re: Delete client API.

Posted by Karthik Ranganathan <kr...@fb.com>.

@Srivas - totally agree that B is the correct thing to do.

One way we have talked about implementing this is using the memstore ts.
Every insert of a KV into the memstore is given a memstore-ts. These are
persisted only till they are needed (to ensure read atomicity for
scanners) and then that value is zeroed out on a subsequent compaction
(saves space). If we retained the memstore-ts even beyond these
compactions, we could get a deterministic order for the puts and deletes
(first insert ts < del ts < second insert ts).

Thanks
Karthik


On 1/17/12 2:14 PM, "M. C. Srivas" <mc...@gmail.com> wrote:

>On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <lh...@yahoo.com>
>wrote:
>
>> Yeah, it's confusing if one expects it to work like in a relational
>> database.
>> You can even do worse. If you by accident place a delete in the future
>>all
>> current inserts will be hidden until the next major compaction. :)
>> I got confused about this myself just recently (see my mail on the
>> dev-list).
>>
>>
>> In the end this is a pretty powerful feature and core to how HBase works
>> (not saying that is not confusing though).
>>
>>
>> If one keeps the following two points in mind it makes more sense:
>> 1. Delete just sets a tomb stone marker at a specific TS (marking
>> everything older as deleted).
>> 2. Everything is versioned, if no version is specified the current time
>> (at the regionserver) is used.
>>
>> In your example1 below t3 > 6, hence the insert is hidden.
>> In example2 both delete and insert TS are 6, hence the insert is hidden.
>>
>
>Lets consider my example2 for a little longer. Sequence of events
>
>   1.  ins  val1  with TS=6 set by client
>   2.  del  entire row at TS=6 set by client
>   3.  ins  val2  with TS=6  set by client
>   4.  read row
>
>The row returns nothing even though the insert at step 3 happened after
>the
>delete at step 2. (step 2 masks even future inserts)
>
>Now, the same sequence with a compaction thrown in the middle:
>
>   1.  ins  val1  with TS=6 set by client
>   2.  del  entire row at TS=6 set by client
>   3.  ---- table is compacted -----
>   4.  ins  val2  with TS=6  set by client
>   5.  read row
>
>The row returns val2.  (the delete at step2 got lost due to compaction).
>
>So we have different results depending upon whether an internal
>re-organization (like a compaction) happened or not. If we want both
>sequences to behave exactly the same, then we need to first choose what is
>the proper (and deterministic) behavior.
>
>A.  if we think that the first sequence is the correct one, then the
>delete
>at step 2 needs to be preserved forever.
>
>or,
>
>B. if we think that the second sequence is the correct behavior (ie, a
>read
>always produces the same results independent of compaction), then the
>record needs a second "internal TS" field to allow the RS to distinguish
>the real sequence of events, and not rely upon the TS field which is
>settable by the client.
>
>My opinion:
>
>We should do B.  It is normal for someone to write code that says  "if old
>exists, delete it;  add new". A subsequent read should always reliably
>return "new".
>
>The current way of relying on a client-settable TS field to determine
>causal order results in quirky behavior, and quirky is not good.
>
>
>
>> Look at these two examples:
>>
>> 1. insert Val1  at real time t1
>> 2. <del>  at real time t2 > t1
>> 3. insert  Val2 at real time  t3 > t2
>>
>> 1. insert Val1  with TS=1 at real time t1
>> 2. <del>  with TS = 2 at real time t2 > t1
>>
>> 3. insert  Val2 with TS = 3 at real time  t3 > t2
>>
>>
>> In both cases Val2 is visible.
>>
>> If the your code sets your own timestamps, you better know what you're
>> doing :)
>>
>> Note that my examples below are confusing even if you know how deletion
>>in
>> HBase works.
>> You have to look at Delete.java to figure out what is happening.
>> OK, since there were know objections in two days, I will commit my
>> proposed change in HBASE-5205.
>>
>>
>> -- Lars
>>
>> ________________________________
>> From: M. C. Srivas <mc...@gmail.com>
>> To: dev@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>> Sent: Tuesday, January 17, 2012 8:13 AM
>> Subject: Re: Delete client API.
>>
>>
>> Delete seems to be confusing in general. Here are some examples that
>>make
>> me scratch my head (key is same in all the examples):
>>
>> Example1:
>> ----------------
>> 1. insert Val3  with TS=3  at real time t1
>> 2. insert Val5  with TS=5  at real time t2 > t1
>> 3. <del>    at real time t3 > t2
>> 4. insert  Val6  with TS=6  at real time  t4 > t3
>>
>> What does a read return?  (I would expect  Val6, since it was done
>>last).
>> But depending upon whether compaction happened or not between steps 3
>>and
>> 4, I get either Val6 or  nothing.
>>
>> Example 2:
>> -----------------
>> 1. insert Val3  with TS=3  at real time t1
>> 2. insert Val5  with TS=5  at real time t2 > t1
>> 3. <del>  TS=6  at real time t3 > t2
>> 4. insert  Val6  with TS=6  at real time  t4 > t3
>>
>> Note the difference in step 3 is this time a TS was specified by the
>> client.
>>
>> What does a read return?  Again, I expect Val6 to be returned. But
>> depending upon what's going on, I seem to get either Val5 or Val6.
>>
>>
>>
>>
>>
>> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lh...@yahoo.com>
>> wrote:
>>
>> There are some confusing parts about the Delete client API:
>> >1. calling deleteFamily removes all prior column or columns markers
>> without checking the TS.
>> >2. delete{Column|Columns|Family} do not use the timestamp passed to
>> Delete at construction time, but instead default to LATEST_TIMESTAMP.
>> >
>> >  Delete d = new Delete(R,T);
>> >  d.deleteFamily(CF);
>> >
>> >Does not do what you expect (won't use T for the family delete, but
>> rather the current time).
>> >
>> >Neither does
>> >  d.deleteColumns(CF, C1, T2);
>> >  d.deleteFamily(CF, T1); // T1 < T2
>> >
>> >
>> >(the columns marker will be removed)
>> >
>> >
>> >#1 prevents Delete from adding a family marker F for time T1 and a
>> column/columns marker for columns of F at T2 even if T2 > T1.
>> >#2 is just unexpected and different from what Put is doing.
>> >
>> >In HBASE-5205 I propose a simple patch to fix this.
>> >
>> >Since this is a (slight) API change, please provide feed back.
>> >
>> >Thanks.
>> >
>> >-- Lars
>> >
>> >
>>

Re: Delete client API.

Posted by "M. C. Srivas" <mc...@gmail.com>.

On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <lh...@yahoo.com> wrote:

> Yeah, it's confusing if one expects it to work like in a relational
> database.
> You can even do worse. If you by accident place a delete in the future all
> current inserts will be hidden until the next major compaction. :)
> I got confused about this myself just recently (see my mail on the
> dev-list).
>
>
> In the end this is a pretty powerful feature and core to how HBase works
> (not saying that is not confusing though).
>
>
> If one keeps the following two points in mind it makes more sense:
> 1. Delete just sets a tomb stone marker at a specific TS (marking
> everything older as deleted).
> 2. Everything is versioned, if no version is specified the current time
> (at the regionserver) is used.
>
> In your example1 below t3 > 6, hence the insert is hidden.
> In example2 both delete and insert TS are 6, hence the insert is hidden.
>

Lets consider my example2 for a little longer. Sequence of events

   1.  ins  val1  with TS=6 set by client
   2.  del  entire row at TS=6 set by client
   3.  ins  val2  with TS=6  set by client
   4.  read row

The row returns nothing even though the insert at step 3 happened after the
delete at step 2. (step 2 masks even future inserts)

Now, the same sequence with a compaction thrown in the middle:

   1.  ins  val1  with TS=6 set by client
   2.  del  entire row at TS=6 set by client
   3.  ---- table is compacted -----
   4.  ins  val2  with TS=6  set by client
   5.  read row

The row returns val2.  (the delete at step2 got lost due to compaction).

So we have different results depending upon whether an internal
re-organization (like a compaction) happened or not. If we want both
sequences to behave exactly the same, then we need to first choose what is
the proper (and deterministic) behavior.

A.  if we think that the first sequence is the correct one, then the delete
at step 2 needs to be preserved forever.

or,

B. if we think that the second sequence is the correct behavior (ie, a read
always produces the same results independent of compaction), then the
record needs a second "internal TS" field to allow the RS to distinguish
the real sequence of events, and not rely upon the TS field which is
settable by the client.

My opinion:

We should do B.  It is normal for someone to write code that says  "if old
exists, delete it;  add new". A subsequent read should always reliably
return "new".

The current way of relying on a client-settable TS field to determine
causal order results in quirky behavior, and quirky is not good.

> Look at these two examples:
>
> 1. insert Val1  at real time t1
> 2. <del>  at real time t2 > t1
> 3. insert  Val2 at real time  t3 > t2
>
> 1. insert Val1  with TS=1 at real time t1
> 2. <del>  with TS = 2 at real time t2 > t1
>
> 3. insert  Val2 with TS = 3 at real time  t3 > t2
>
>
> In both cases Val2 is visible.
>
> If the your code sets your own timestamps, you better know what you're
> doing :)
>
> Note that my examples below are confusing even if you know how deletion in
> HBase works.
> You have to look at Delete.java to figure out what is happening.
> OK, since there were know objections in two days, I will commit my
> proposed change in HBASE-5205.
>
>
> -- Lars
>
> ________________________________
> From: M. C. Srivas <mc...@gmail.com>
> To: dev@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Sent: Tuesday, January 17, 2012 8:13 AM
> Subject: Re: Delete client API.
>
>
> Delete seems to be confusing in general. Here are some examples that make
> me scratch my head (key is same in all the examples):
>
> Example1:
> ----------------
> 1. insert Val3  with TS=3  at real time t1
> 2. insert Val5  with TS=5  at real time t2 > t1
> 3. <del>    at real time t3 > t2
> 4. insert  Val6  with TS=6  at real time  t4 > t3
>
> What does a read return?  (I would expect  Val6, since it was done last).
> But depending upon whether compaction happened or not between steps 3 and
> 4, I get either Val6 or  nothing.
>
> Example 2:
> -----------------
> 1. insert Val3  with TS=3  at real time t1
> 2. insert Val5  with TS=5  at real time t2 > t1
> 3. <del>  TS=6  at real time t3 > t2
> 4. insert  Val6  with TS=6  at real time  t4 > t3
>
> Note the difference in step 3 is this time a TS was specified by the
> client.
>
> What does a read return?  Again, I expect Val6 to be returned. But
> depending upon what's going on, I seem to get either Val5 or Val6.
>
>
>
>
>
> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> There are some confusing parts about the Delete client API:
> >1. calling deleteFamily removes all prior column or columns markers
> without checking the TS.
> >2. delete{Column|Columns|Family} do not use the timestamp passed to
> Delete at construction time, but instead default to LATEST_TIMESTAMP.
> >
> >  Delete d = new Delete(R,T);
> >  d.deleteFamily(CF);
> >
> >Does not do what you expect (won't use T for the family delete, but
> rather the current time).
> >
> >Neither does
> >  d.deleteColumns(CF, C1, T2);
> >  d.deleteFamily(CF, T1); // T1 < T2
> >
> >
> >(the columns marker will be removed)
> >
> >
> >#1 prevents Delete from adding a family marker F for time T1 and a
> column/columns marker for columns of F at T2 even if T2 > T1.
> >#2 is just unexpected and different from what Put is doing.
> >
> >In HBASE-5205 I propose a simple patch to fix this.
> >
> >Since this is a (slight) API change, please provide feed back.
> >
> >Thanks.
> >
> >-- Lars
> >
> >
>

Re: Delete client API.

Posted by lars hofhansl <lh...@yahoo.com>.

Yeah, it's confusing if one expects it to work like in a relational database.
You can even do worse. If you by accident place a delete in the future all current inserts will be hidden until the next major compaction. :)
I got confused about this myself just recently (see my mail on the dev-list).

In the end this is a pretty powerful feature and core to how HBase works (not saying that is not confusing though).

If one keeps the following two points in mind it makes more sense:
1. Delete just sets a tomb stone marker at a specific TS (marking everything older as deleted).
2. Everything is versioned, if no version is specified the current time (at the regionserver) is used.

In your example1 below t3 > 6, hence the insert is hidden.
In example2 both delete and insert TS are 6, hence the insert is hidden.

Look at these two examples:

1. insert Val1  at real time t1
2. <del>  at real time t2 > t1
3. insert  Val2 at real time  t3 > t2

1. insert Val1  with TS=1 at real time t1
2. <del>  with TS = 2 at real time t2 > t1 

3. insert  Val2 with TS = 3 at real time  t3 > t2

In both cases Val2 is visible.

If the your code sets your own timestamps, you better know what you're doing :)

Note that my examples below are confusing even if you know how deletion in HBase works.
You have to look at Delete.java to figure out what is happening.
OK, since there were know objections in two days, I will commit my proposed change in HBASE-5205.

-- Lars

________________________________
From: M. C. Srivas <mc...@gmail.com>
To: dev@hbase.apache.org; lars hofhansl <lh...@yahoo.com> 
Sent: Tuesday, January 17, 2012 8:13 AM
Subject: Re: Delete client API.

Delete seems to be confusing in general. Here are some examples that make me scratch my head (key is same in all the examples):

Example1:
----------------
1. insert Val3  with TS=3  at real time t1
2. insert Val5  with TS=5  at real time t2 > t1
3. <del>    at real time t3 > t2
4. insert  Val6  with TS=6  at real time  t4 > t3

What does a read return?  (I would expect  Val6, since it was done last). But depending upon whether compaction happened or not between steps 3 and 4, I get either Val6 or  nothing.

Example 2:
-----------------
1. insert Val3  with TS=3  at real time t1
2. insert Val5  with TS=5  at real time t2 > t1
3. <del>  TS=6  at real time t3 > t2
4. insert  Val6  with TS=6  at real time  t4 > t3

Note the difference in step 3 is this time a TS was specified by the client.

What does a read return?  Again, I expect Val6 to be returned. But depending upon what's going on, I seem to get either Val5 or Val6.

On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lh...@yahoo.com> wrote:

There are some confusing parts about the Delete client API:
>1. calling deleteFamily removes all prior column or columns markers without checking the TS.
>2. delete{Column|Columns|Family} do not use the timestamp passed to Delete at construction time, but instead default to LATEST_TIMESTAMP.
>
>  Delete d = new Delete(R,T);
>  d.deleteFamily(CF);
>
>Does not do what you expect (won't use T for the family delete, but rather the current time).
>
>Neither does
>  d.deleteColumns(CF, C1, T2);
>  d.deleteFamily(CF, T1); // T1 < T2
>
>
>(the columns marker will be removed)
>
>
>#1 prevents Delete from adding a family marker F for time T1 and a column/columns marker for columns of F at T2 even if T2 > T1.
>#2 is just unexpected and different from what Put is doing.
>
>In HBASE-5205 I propose a simple patch to fix this.
>
>Since this is a (slight) API change, please provide feed back.
>
>Thanks.
>
>-- Lars
>
>

Re: Delete client API.

Posted by "M. C. Srivas" <mc...@gmail.com>.

Delete seems to be confusing in general. Here are some examples that make
me scratch my head (key is same in all the examples):

Example1:
----------------
1. insert Val3  with TS=3  at real time t1
2. insert Val5  with TS=5  at real time t2 > t1
3. <del>    at real time t3 > t2
4. insert  Val6  with TS=6  at real time  t4 > t3

What does a read return?  (I would expect  Val6, since it was done last).
But depending upon whether compaction happened or not between steps 3 and
4, I get either Val6 or  nothing.

Example 2:
-----------------
1. insert Val3  with TS=3  at real time t1
2. insert Val5  with TS=5  at real time t2 > t1
3. <del>  TS=6  at real time t3 > t2
4. insert  Val6  with TS=6  at real time  t4 > t3

Note the difference in step 3 is this time a TS was specified by the client.

What does a read return?  Again, I expect Val6 to be returned. But
depending upon what's going on, I seem to get either Val5 or Val6.

On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <lh...@yahoo.com> wrote:

> There are some confusing parts about the Delete client API:
> 1. calling deleteFamily removes all prior column or columns markers
> without checking the TS.
> 2. delete{Column|Columns|Family} do not use the timestamp passed to Delete
> at construction time, but instead default to LATEST_TIMESTAMP.
>
>   Delete d = new Delete(R,T);
>   d.deleteFamily(CF);
>
> Does not do what you expect (won't use T for the family delete, but rather
> the current time).
>
> Neither does
>   d.deleteColumns(CF, C1, T2);
>   d.deleteFamily(CF, T1); // T1 < T2
>
>
> (the columns marker will be removed)
>
>
> #1 prevents Delete from adding a family marker F for time T1 and a
> column/columns marker for columns of F at T2 even if T2 > T1.
> #2 is just unexpected and different from what Put is doing.
>
> In HBASE-5205 I propose a simple patch to fix this.
>
> Since this is a (slight) API change, please provide feed back.
>
> Thanks.
>
> -- Lars
>
>