You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ferdy <fe...@kalooga.com> on 2010/03/08 13:53:50 UTC
Best way to do a clean update of a row
Hi,
Sometimes we wish to do a clean update of a row, that is: Make sure any
old column values are removed that are not in the new Put.
This is how we're doing this now (hbase 0.20.3):
//delRow and putRow are the same row,
//but the row may currently contains columns that are not redefined in
putRow
HTable htable = new HTable("tablename");
htable.delete(delRow);
htable.put(putRow);
We just call these sequentially (single-threaded). However, could it be
possible that the delete is issued somehow AFTER the put? The htable
object has default settings (in other words there is no ). The reason
why I'm asking is that we are probably experiencing missing row issues.
If so, is there a better way to do an update of a row and discarding old
column values?
Regards,
Ferdy
Re: Best way to do a clean update of a row
Posted by Ferdy <fe...@kalooga.com>.
Hey,
The column names indeed change between versions. For now I will adopt
solution B, and accept the fact that in very rare cases old columns may
not be deleted. (Which could happen when a client does a put with a
clock ahead). Shouldn't occur very often since our systemtimes are
pretty accurate and updates will not happen more frequent than once
every hour.
Ferdy
Jonathan Gray wrote:
> Ferdy,
>
> Another strategy might be to not issue the delete and just insert a new
> version on top of the old one.
>
> Whether this makes sense or not depends on whether the columns for that row
> change between versions. If it's always the same columns then you can just
> re-insert and when you grab the latest version you will only see the new
> one. If they change, you would need to follow one of your other strategies.
>
> I would probably not use solution A just because there's not really a need
> to introduce a client-side pause. I would opt for grabbing now() and
> incrementing the Put stamp by 1.
>
> This issue is currently under discussion and we'd really like to get this
> kind of unexpected (but understandable) behavior to be a little more user
> friendly so that if you put after a delete you would actually see it.
> There's no estimated time for it but until then you can try the workarounds.
>
> JG
>
> -----Original Message-----
> From: Erik Holstad [mailto:erikholstad@gmail.com]
> Sent: Monday, March 08, 2010 8:58 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Best way to do a clean update of a row
>
> Hey Ferdy!
>
> On Mon, Mar 8, 2010 at 8:45 AM, Ferdy <fe...@kalooga.com> wrote:
>
>
>> Hey,
>>
>> Great! That is exactly what I meant. So that implies that firing a Delete
>> and a Put right after eachother is a pretty bad practise, if you want the
>> Put to persist. Please note, I only need one version. (All my families are
>> VERSIONS => '1') .
>>
>> I guess I have the following choice of solutions:
>>
>> // Solution A: Issue a client-side pause
>> htable.delete(delete);
>> try {Thread.sleep(10);} catch (InterruptedException e) {}
>> htable.put(put);
>>
>> But wait, the javadoc for Delete states that if no timestamp is specified,
>> the SERVER will use the 'now' time. This means that if the Delete and the
>> Put can still be determined to have the same timestamp.
>>
>>
> Not, really sure why they would still get the same timestamp if you wait 10
> millis on the client, should be the same resolution on the server, right?
>
>
>
>> // Solution B: specify timestamps
>> long deleteTS = System.currentTimeMillis();
>> long putTS = deleteTS+1;
>> Delete delete = new Delete(row, deleteTS null);
>> htable.delete(delete);
>> Put put = new Put(row);
>> put.add(family, column, putTS, value);
>> htable.put(put);
>>
>> How about this solution? I'm guessing the only disadvantage to this one
>>
> is:
>
>> A client machine with an incorrectly set systemtime (let's say a few days
>> ahead) will not be able to be removed by another machine (with a correct
>> systemtime) shortly after, because the deleteTS of the correct client will
>> be smaller than the timestamp in the table.
>>
>>
> This is the reason that it might be tricky to use your own client timestamp
> and makes server setting of timestamps a better option.
>
> But is seems like you have a good understanding of the consequences, so good
> luck!
>
>
>
>> Regards,
>> Ferdy
>>
>>
>> Erik Holstad wrote:
>>
>>
>>> Hey Ferdy!
>>> Not really sure what you are asking now. But if you do a deleteRow and
>>> then
>>> a put in the same
>>> milli second the put will be "shadowed" by the delete so that it will not
>>> show up when you look
>>> for later, if that makes sense? The reason for this is that deletes are
>>> sorted before puts for the
>>> same timestamp, so for a put to be viewable it needs to have a newer
>>> timestamp than the delete.
>>>
>>>
>>>
>>>
>>>
>
>
>
RE: Best way to do a clean update of a row
Posted by Jonathan Gray <jl...@streamy.com>.
Ferdy,
Another strategy might be to not issue the delete and just insert a new
version on top of the old one.
Whether this makes sense or not depends on whether the columns for that row
change between versions. If it's always the same columns then you can just
re-insert and when you grab the latest version you will only see the new
one. If they change, you would need to follow one of your other strategies.
I would probably not use solution A just because there's not really a need
to introduce a client-side pause. I would opt for grabbing now() and
incrementing the Put stamp by 1.
This issue is currently under discussion and we'd really like to get this
kind of unexpected (but understandable) behavior to be a little more user
friendly so that if you put after a delete you would actually see it.
There's no estimated time for it but until then you can try the workarounds.
JG
-----Original Message-----
From: Erik Holstad [mailto:erikholstad@gmail.com]
Sent: Monday, March 08, 2010 8:58 AM
To: hbase-user@hadoop.apache.org
Subject: Re: Best way to do a clean update of a row
Hey Ferdy!
On Mon, Mar 8, 2010 at 8:45 AM, Ferdy <fe...@kalooga.com> wrote:
> Hey,
>
> Great! That is exactly what I meant. So that implies that firing a Delete
> and a Put right after eachother is a pretty bad practise, if you want the
> Put to persist. Please note, I only need one version. (All my families are
> VERSIONS => '1') .
>
> I guess I have the following choice of solutions:
>
> // Solution A: Issue a client-side pause
> htable.delete(delete);
> try {Thread.sleep(10);} catch (InterruptedException e) {}
> htable.put(put);
>
> But wait, the javadoc for Delete states that if no timestamp is specified,
> the SERVER will use the 'now' time. This means that if the Delete and the
> Put can still be determined to have the same timestamp.
>
Not, really sure why they would still get the same timestamp if you wait 10
millis on the client, should be the same resolution on the server, right?
>
> // Solution B: specify timestamps
> long deleteTS = System.currentTimeMillis();
> long putTS = deleteTS+1;
> Delete delete = new Delete(row, deleteTS null);
> htable.delete(delete);
> Put put = new Put(row);
> put.add(family, column, putTS, value);
> htable.put(put);
>
> How about this solution? I'm guessing the only disadvantage to this one
is:
> A client machine with an incorrectly set systemtime (let's say a few days
> ahead) will not be able to be removed by another machine (with a correct
> systemtime) shortly after, because the deleteTS of the correct client will
> be smaller than the timestamp in the table.
>
This is the reason that it might be tricky to use your own client timestamp
and makes server setting of timestamps a better option.
But is seems like you have a good understanding of the consequences, so good
luck!
>
> Regards,
> Ferdy
>
>
> Erik Holstad wrote:
>
>> Hey Ferdy!
>> Not really sure what you are asking now. But if you do a deleteRow and
>> then
>> a put in the same
>> milli second the put will be "shadowed" by the delete so that it will not
>> show up when you look
>> for later, if that makes sense? The reason for this is that deletes are
>> sorted before puts for the
>> same timestamp, so for a put to be viewable it needs to have a newer
>> timestamp than the delete.
>>
>>
>>
>>
>
--
Regards Erik
Re: Best way to do a clean update of a row
Posted by Erik Holstad <er...@gmail.com>.
Hey Ferdy!
On Mon, Mar 8, 2010 at 8:45 AM, Ferdy <fe...@kalooga.com> wrote:
> Hey,
>
> Great! That is exactly what I meant. So that implies that firing a Delete
> and a Put right after eachother is a pretty bad practise, if you want the
> Put to persist. Please note, I only need one version. (All my families are
> VERSIONS => '1') .
>
> I guess I have the following choice of solutions:
>
> // Solution A: Issue a client-side pause
> htable.delete(delete);
> try {Thread.sleep(10);} catch (InterruptedException e) {}
> htable.put(put);
>
> But wait, the javadoc for Delete states that if no timestamp is specified,
> the SERVER will use the 'now' time. This means that if the Delete and the
> Put can still be determined to have the same timestamp.
>
Not, really sure why they would still get the same timestamp if you wait 10
millis on the client, should be the same resolution on the server, right?
>
> // Solution B: specify timestamps
> long deleteTS = System.currentTimeMillis();
> long putTS = deleteTS+1;
> Delete delete = new Delete(row, deleteTS null);
> htable.delete(delete);
> Put put = new Put(row);
> put.add(family, column, putTS, value);
> htable.put(put);
>
> How about this solution? I'm guessing the only disadvantage to this one is:
> A client machine with an incorrectly set systemtime (let's say a few days
> ahead) will not be able to be removed by another machine (with a correct
> systemtime) shortly after, because the deleteTS of the correct client will
> be smaller than the timestamp in the table.
>
This is the reason that it might be tricky to use your own client timestamp
and makes server setting of timestamps a better option.
But is seems like you have a good understanding of the consequences, so good
luck!
>
> Regards,
> Ferdy
>
>
> Erik Holstad wrote:
>
>> Hey Ferdy!
>> Not really sure what you are asking now. But if you do a deleteRow and
>> then
>> a put in the same
>> milli second the put will be "shadowed" by the delete so that it will not
>> show up when you look
>> for later, if that makes sense? The reason for this is that deletes are
>> sorted before puts for the
>> same timestamp, so for a put to be viewable it needs to have a newer
>> timestamp than the delete.
>>
>>
>>
>>
>
--
Regards Erik
Re: Best way to do a clean update of a row
Posted by Ferdy <fe...@kalooga.com>.
Hey,
Great! That is exactly what I meant. So that implies that firing a
Delete and a Put right after eachother is a pretty bad practise, if you
want the Put to persist. Please note, I only need one version. (All my
families are VERSIONS => '1') .
I guess I have the following choice of solutions:
// Solution A: Issue a client-side pause
htable.delete(delete);
try {Thread.sleep(10);} catch (InterruptedException e) {}
htable.put(put);
But wait, the javadoc for Delete states that if no timestamp is
specified, the SERVER will use the 'now' time. This means that if the
Delete and the Put can still be determined to have the same timestamp.
// Solution B: specify timestamps
long deleteTS = System.currentTimeMillis();
long putTS = deleteTS+1;
Delete delete = new Delete(row, deleteTS null);
htable.delete(delete);
Put put = new Put(row);
put.add(family, column, putTS, value);
htable.put(put);
How about this solution? I'm guessing the only disadvantage to this one
is: A client machine with an incorrectly set systemtime (let's say a few
days ahead) will not be able to be removed by another machine (with a
correct systemtime) shortly after, because the deleteTS of the correct
client will be smaller than the timestamp in the table.
Regards,
Ferdy
Erik Holstad wrote:
> Hey Ferdy!
> Not really sure what you are asking now. But if you do a deleteRow and then
> a put in the same
> milli second the put will be "shadowed" by the delete so that it will not
> show up when you look
> for later, if that makes sense? The reason for this is that deletes are
> sorted before puts for the
> same timestamp, so for a put to be viewable it needs to have a newer
> timestamp than the delete.
>
>
>
Re: Best way to do a clean update of a row
Posted by Erik Holstad <er...@gmail.com>.
Hey Ferdy!
Not really sure what you are asking now. But if you do a deleteRow and then
a put in the same
milli second the put will be "shadowed" by the delete so that it will not
show up when you look
for later, if that makes sense? The reason for this is that deletes are
sorted before puts for the
same timestamp, so for a put to be viewable it needs to have a newer
timestamp than the delete.
--
Regards Erik
Re: Best way to do a clean update of a row
Posted by Ferdy <fe...@kalooga.com>.
Hey Erik,
Thanks for replying.
Do you mean a delete and a put in the same milli? Otherwise I don't
think I fully understand what your saying..
Ferdy.
Erik Holstad wrote:
> Hey Ferdy!
> There has been a lot of talk about this lately. HBase has a resolution of
> milli seconds so
> if you do a put and a get in the same milli the put will not be shown.
> There are a couple of solutions to this problem. Waiting one milli second
> with the put,
> setting the timestamps yourself or doing some kinda of swap between two
> rows.
>
> Erik
>
>
Re: Best way to do a clean update of a row
Posted by Erik Holstad <er...@gmail.com>.
Hey Ferdy!
There has been a lot of talk about this lately. HBase has a resolution of
milli seconds so
if you do a put and a get in the same milli the put will not be shown.
There are a couple of solutions to this problem. Waiting one milli second
with the put,
setting the timestamps yourself or doing some kinda of swap between two
rows.
Erik