You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ferdy <fe...@kalooga.com> on 2010/03/08 13:53:50 UTC

Best way to do a clean update of a row

Hi,

Sometimes we wish to do a clean update of a row, that is: Make sure any 
old column values are removed that are not in the new Put.

This is how we're doing this now (hbase 0.20.3):

//delRow and putRow are the same row,
//but the row may currently contains columns that are not redefined in 
putRow
HTable htable = new HTable("tablename");
htable.delete(delRow);
htable.put(putRow);

We just call these sequentially (single-threaded). However, could it be 
possible that the delete is issued somehow AFTER the put? The htable 
object has default settings (in other words there is no ). The reason 
why I'm asking is that we are probably experiencing missing row issues.

If so, is there a better way to do an update of a row and discarding old 
column values?

Regards,
Ferdy

Re: Best way to do a clean update of a row

Posted by Ferdy <fe...@kalooga.com>.
Hey,

The column names indeed change between versions. For now I will adopt 
solution B, and accept the fact that in very rare cases old columns may 
not be deleted. (Which could happen when a client does a put with a 
clock ahead). Shouldn't occur very often since our systemtimes are 
pretty accurate and updates will not happen more frequent than once 
every hour.

Ferdy

Jonathan Gray wrote:
> Ferdy,
>
> Another strategy might be to not issue the delete and just insert a new
> version on top of the old one.
>
> Whether this makes sense or not depends on whether the columns for that row
> change between versions.  If it's always the same columns then you can just
> re-insert and when you grab the latest version you will only see the new
> one.  If they change, you would need to follow one of your other strategies.
>
> I would probably not use solution A just because there's not really a need
> to introduce a client-side pause.  I would opt for grabbing now() and
> incrementing the Put stamp by 1.
>
> This issue is currently under discussion and we'd really like to get this
> kind of unexpected (but understandable) behavior to be a little more user
> friendly so that if you put after a delete you would actually see it.
> There's no estimated time for it but until then you can try the workarounds.
>
> JG
>
> -----Original Message-----
> From: Erik Holstad [mailto:erikholstad@gmail.com] 
> Sent: Monday, March 08, 2010 8:58 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Best way to do a clean update of a row
>
> Hey Ferdy!
>
> On Mon, Mar 8, 2010 at 8:45 AM, Ferdy <fe...@kalooga.com> wrote:
>
>   
>> Hey,
>>
>> Great! That is exactly what I meant. So that implies that firing a Delete
>> and a Put right after eachother is a pretty bad practise, if you want the
>> Put to persist. Please note, I only need one version. (All my families are
>>  VERSIONS => '1') .
>>
>> I guess I have the following choice of solutions:
>>
>> // Solution A: Issue a client-side pause
>> htable.delete(delete);
>> try {Thread.sleep(10);} catch (InterruptedException e) {}
>> htable.put(put);
>>
>> But wait, the javadoc for Delete states that if no timestamp is specified,
>> the SERVER will use the 'now' time. This means that if the Delete and the
>> Put can still be determined to have the same timestamp.
>>
>>     
> Not, really sure why they would still get the same timestamp if you wait 10
> millis on the client, should be the same resolution on the server, right?
>
>
>   
>> // Solution B: specify timestamps
>> long deleteTS = System.currentTimeMillis();
>> long putTS = deleteTS+1;
>> Delete delete = new Delete(row, deleteTS  null);
>> htable.delete(delete);
>> Put put = new Put(row);
>> put.add(family, column, putTS, value);
>> htable.put(put);
>>
>> How about this solution? I'm guessing the only disadvantage to this one
>>     
> is:
>   
>> A client machine with an incorrectly set systemtime (let's say a few days
>> ahead) will not be able to be removed by another machine (with a correct
>> systemtime) shortly after, because the deleteTS of the correct client will
>> be smaller than the timestamp in the table.
>>
>>     
> This is the reason that it might be tricky to use your own client timestamp
> and makes server setting of timestamps a better option.
>
> But is seems like you have a good understanding of the consequences, so good
> luck!
>
>
>   
>> Regards,
>> Ferdy
>>
>>
>> Erik Holstad wrote:
>>
>>     
>>> Hey Ferdy!
>>> Not really sure what you are asking now. But if you do a deleteRow and
>>> then
>>> a put in the same
>>> milli second the put will be "shadowed" by the delete so that it will not
>>> show up when you look
>>> for later, if that makes sense? The reason for this is that deletes are
>>> sorted before puts for the
>>> same timestamp, so for a put to be viewable it needs to have a newer
>>> timestamp than the delete.
>>>
>>>
>>>
>>>
>>>       
>
>
>   

RE: Best way to do a clean update of a row

Posted by Jonathan Gray <jl...@streamy.com>.
Ferdy,

Another strategy might be to not issue the delete and just insert a new
version on top of the old one.

Whether this makes sense or not depends on whether the columns for that row
change between versions.  If it's always the same columns then you can just
re-insert and when you grab the latest version you will only see the new
one.  If they change, you would need to follow one of your other strategies.

I would probably not use solution A just because there's not really a need
to introduce a client-side pause.  I would opt for grabbing now() and
incrementing the Put stamp by 1.

This issue is currently under discussion and we'd really like to get this
kind of unexpected (but understandable) behavior to be a little more user
friendly so that if you put after a delete you would actually see it.
There's no estimated time for it but until then you can try the workarounds.

JG

-----Original Message-----
From: Erik Holstad [mailto:erikholstad@gmail.com] 
Sent: Monday, March 08, 2010 8:58 AM
To: hbase-user@hadoop.apache.org
Subject: Re: Best way to do a clean update of a row

Hey Ferdy!

On Mon, Mar 8, 2010 at 8:45 AM, Ferdy <fe...@kalooga.com> wrote:

> Hey,
>
> Great! That is exactly what I meant. So that implies that firing a Delete
> and a Put right after eachother is a pretty bad practise, if you want the
> Put to persist. Please note, I only need one version. (All my families are
>  VERSIONS => '1') .
>
> I guess I have the following choice of solutions:
>
> // Solution A: Issue a client-side pause
> htable.delete(delete);
> try {Thread.sleep(10);} catch (InterruptedException e) {}
> htable.put(put);
>
> But wait, the javadoc for Delete states that if no timestamp is specified,
> the SERVER will use the 'now' time. This means that if the Delete and the
> Put can still be determined to have the same timestamp.
>
Not, really sure why they would still get the same timestamp if you wait 10
millis on the client, should be the same resolution on the server, right?


>
> // Solution B: specify timestamps
> long deleteTS = System.currentTimeMillis();
> long putTS = deleteTS+1;
> Delete delete = new Delete(row, deleteTS  null);
> htable.delete(delete);
> Put put = new Put(row);
> put.add(family, column, putTS, value);
> htable.put(put);
>
> How about this solution? I'm guessing the only disadvantage to this one
is:
> A client machine with an incorrectly set systemtime (let's say a few days
> ahead) will not be able to be removed by another machine (with a correct
> systemtime) shortly after, because the deleteTS of the correct client will
> be smaller than the timestamp in the table.
>
This is the reason that it might be tricky to use your own client timestamp
and makes server setting of timestamps a better option.

But is seems like you have a good understanding of the consequences, so good
luck!


>
> Regards,
> Ferdy
>
>
> Erik Holstad wrote:
>
>> Hey Ferdy!
>> Not really sure what you are asking now. But if you do a deleteRow and
>> then
>> a put in the same
>> milli second the put will be "shadowed" by the delete so that it will not
>> show up when you look
>> for later, if that makes sense? The reason for this is that deletes are
>> sorted before puts for the
>> same timestamp, so for a put to be viewable it needs to have a newer
>> timestamp than the delete.
>>
>>
>>
>>
>


-- 
Regards Erik


Re: Best way to do a clean update of a row

Posted by Erik Holstad <er...@gmail.com>.
Hey Ferdy!

On Mon, Mar 8, 2010 at 8:45 AM, Ferdy <fe...@kalooga.com> wrote:

> Hey,
>
> Great! That is exactly what I meant. So that implies that firing a Delete
> and a Put right after eachother is a pretty bad practise, if you want the
> Put to persist. Please note, I only need one version. (All my families are
>  VERSIONS => '1') .
>
> I guess I have the following choice of solutions:
>
> // Solution A: Issue a client-side pause
> htable.delete(delete);
> try {Thread.sleep(10);} catch (InterruptedException e) {}
> htable.put(put);
>
> But wait, the javadoc for Delete states that if no timestamp is specified,
> the SERVER will use the 'now' time. This means that if the Delete and the
> Put can still be determined to have the same timestamp.
>
Not, really sure why they would still get the same timestamp if you wait 10
millis on the client, should be the same resolution on the server, right?


>
> // Solution B: specify timestamps
> long deleteTS = System.currentTimeMillis();
> long putTS = deleteTS+1;
> Delete delete = new Delete(row, deleteTS  null);
> htable.delete(delete);
> Put put = new Put(row);
> put.add(family, column, putTS, value);
> htable.put(put);
>
> How about this solution? I'm guessing the only disadvantage to this one is:
> A client machine with an incorrectly set systemtime (let's say a few days
> ahead) will not be able to be removed by another machine (with a correct
> systemtime) shortly after, because the deleteTS of the correct client will
> be smaller than the timestamp in the table.
>
This is the reason that it might be tricky to use your own client timestamp
and makes server setting of timestamps a better option.

But is seems like you have a good understanding of the consequences, so good
luck!


>
> Regards,
> Ferdy
>
>
> Erik Holstad wrote:
>
>> Hey Ferdy!
>> Not really sure what you are asking now. But if you do a deleteRow and
>> then
>> a put in the same
>> milli second the put will be "shadowed" by the delete so that it will not
>> show up when you look
>> for later, if that makes sense? The reason for this is that deletes are
>> sorted before puts for the
>> same timestamp, so for a put to be viewable it needs to have a newer
>> timestamp than the delete.
>>
>>
>>
>>
>


-- 
Regards Erik

Re: Best way to do a clean update of a row

Posted by Ferdy <fe...@kalooga.com>.
Hey,

Great! That is exactly what I meant. So that implies that firing a 
Delete and a Put right after eachother is a pretty bad practise, if you 
want the Put to persist. Please note, I only need one version. (All my 
families are  VERSIONS => '1') .

I guess I have the following choice of solutions:

// Solution A: Issue a client-side pause
htable.delete(delete);
try {Thread.sleep(10);} catch (InterruptedException e) {}
htable.put(put);

But wait, the javadoc for Delete states that if no timestamp is 
specified, the SERVER will use the 'now' time. This means that if the 
Delete and the Put can still be determined to have the same timestamp.

// Solution B: specify timestamps
long deleteTS = System.currentTimeMillis();
long putTS = deleteTS+1;
Delete delete = new Delete(row, deleteTS  null);
htable.delete(delete);
Put put = new Put(row);
put.add(family, column, putTS, value);
htable.put(put);

How about this solution? I'm guessing the only disadvantage to this one 
is: A client machine with an incorrectly set systemtime (let's say a few 
days ahead) will not be able to be removed by another machine (with a 
correct systemtime) shortly after, because the deleteTS of the correct 
client will be smaller than the timestamp in the table.

Regards,
Ferdy

Erik Holstad wrote:
> Hey Ferdy!
> Not really sure what you are asking now. But if you do a deleteRow and then
> a put in the same
> milli second the put will be "shadowed" by the delete so that it will not
> show up when you look
> for later, if that makes sense? The reason for this is that deletes are
> sorted before puts for the
> same timestamp, so for a put to be viewable it needs to have a newer
> timestamp than the delete.
>
>
>   

Re: Best way to do a clean update of a row

Posted by Erik Holstad <er...@gmail.com>.
Hey Ferdy!
Not really sure what you are asking now. But if you do a deleteRow and then
a put in the same
milli second the put will be "shadowed" by the delete so that it will not
show up when you look
for later, if that makes sense? The reason for this is that deletes are
sorted before puts for the
same timestamp, so for a put to be viewable it needs to have a newer
timestamp than the delete.


-- 
Regards Erik

Re: Best way to do a clean update of a row

Posted by Ferdy <fe...@kalooga.com>.
Hey Erik,

Thanks for replying.

Do you mean a delete and a put in the same milli? Otherwise I don't 
think I fully understand what your saying..

Ferdy.



Erik Holstad wrote:
> Hey Ferdy!
> There has been a lot of talk about this lately. HBase has a resolution of
> milli seconds so
> if you do a put and a get in the same milli the put will not be shown.
> There are a couple of solutions to this problem. Waiting one milli second
> with the put,
> setting the timestamps yourself or doing some kinda of swap between two
> rows.
>
> Erik
>
>   

Re: Best way to do a clean update of a row

Posted by Erik Holstad <er...@gmail.com>.
Hey Ferdy!
There has been a lot of talk about this lately. HBase has a resolution of
milli seconds so
if you do a put and a get in the same milli the put will not be shown.
There are a couple of solutions to this problem. Waiting one milli second
with the put,
setting the timestamps yourself or doing some kinda of swap between two
rows.

Erik