You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Ryan J. McDonough" <ry...@damnhandy.com> on 2009/06/02 03:10:16 UTC

Clarifying the role of HBase Versions

I'm trying to get some clarity on the role of versions in HBase. Our  
table design is such that a an object can have multiple property  
values for a given property name. For example, we could have an  
nickname property that a given person is known by. In the current set  
up, if a person has 3 nicknames, only the last one gets stored. We  
have considered using the column versions as an added data dimension,  
but that just doesn't feel quite right. Given that columns have a  
limit (granted that it's quite large) as to how many versions it can  
store, it's still a limit none the less.

 From what I gather from reading the BigTable doc, is that version  
could be considered a form of optimistic locking so that concurrent  
writes don't conflict. Is that understanding correct? If not, is using  
versions as an added data dimension a good idea?

Ryan-

Re: Clarifying the role of HBase Versions

Posted by Jonathan Gray <jl...@streamy.com>.

I don't see anything inherently wrong with your design.

On Tue, June 2, 2009 4:16 am, Ryan J. McDonough wrote:
>

> On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote:
>
>
>> Ryan,
>>
>>
>> You are currently only storing the latest nickname, not all 3?  I'm
>> trying to understand your use case exactly.
>
> Yes, the multiple values are being stored, in fact far more than 3.
> We've defined the tables to use the max number of versions. We
> currently can store something to the effect of:
>
> user123=>props:nickname:1243940086:Ryan
> user123=>props:nickname:1243940087:Ryan McDonough
> user123=>props:nickname:1243940088:Some guy asking questions
> user123=>props:nickname:1243940089:Ryan
> user123=>props:nickname:1243940090:Ryan
> user123=>props:nickname:1243940091:
> user123=>props:nickname:1243940092:Ryan McDonough
>
>
> Where "props" is the column family. One thing that is challenging is
> that because the versions are keyed by timestamp, you don't have a
> mechanism to handle duplicate values, thus it's possible to have the same
> value repeated multiple times. Also, you don't have insight into whether
> or not the value was the result of an insert or an accidental dupe, or a
> deletion. Additionally, we can only evaluate a row filter the most recent
> column value,but IIRC, that's fixed in 0.20.
>
>>
>> Whether you want to use versions or not depends on what you want to do
>> with these multiple values.
>>
>> Versions are intended for versioning, as in, multiple values for the
>> same column that are timestamped and sorted with most recent first.
>
> Yes, I understand that part. But what I'm trying to clarify is why
> store versions keyed only by timestamp and not by another arbitrary value?
> As I mentioned in my initial question, I'm starting to see
> versions as a means to provide some means of optimistic locking. To quote
> the BigTable paper:
>
> "Applications that need to avoid collisions must generate unique
> timestamps themselves. Different versions of a cell are stored in
> decreasing timestamp order, so that the most recent versions can be read
> ï¬rst.  To make the management of versioned data less onerous, we support
> two per-column-family settings that tell Bigtable to garbage- collect cell
> versions automatically. The client can specify either that only the last n
> versions of a cell be kept, or that only new- enough versions be kept
> (e.g., only keep values that were written in
> the last seven days). "
>
> With that said, I'm just trying to get some clarity on how HBase
> utilizes versions internally and if there's any change of seeing some
> unintended consequences of using versions for something other than
> versions? For example, does having multiple versions add additional
> overhead at compaction time or when region splits occur?
>
> To put it another way:Based on my current understanding of HBase
> versions, I could equate it to using an audit schema in an RDBMS to join
> multiple values. While it's possible, it's not what you'd use an audit
> schema for.
>
>> It seems from what you said that versions will work nicely.  With
>> the new API in the upcoming 0.20, there is much better support dealing
>> with multiple versions.
>
> Yes, it does work quite nicely, however I just feel like something's
> wrong with our design. Thanks for the response.
>
> Ryan-
>
>
>>
>> JG
>>
>>
>> On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote:
>>
>>> I'm trying to get some clarity on the role of versions in HBase. Our
>>> table design is such that a an object can have multiple property values
>>> for a given property name. For example, we could have an nickname
>>> property that a given person is known by. In the current set up, if a
>>> person has 3 nicknames, only the last one gets stored. We have
>>> considered using the column versions as an added data dimension, but
>>> that just doesn't feel quite right. Given that columns have a limit
>>> (granted that it's quite
>>> large) as to how many versions it can store, it's still a limit none
>>> the less.
>>>
>>> From what I gather from reading the BigTable doc, is that version
>>> could be considered a form of optimistic locking so that concurrent
>>> writes don't conflict. Is that understanding correct? If not, is using
>>>  versions as an added data dimension a good idea?
>>>
>>> Ryan-
>>>
>>>
>>>
>>>
>>
>
>

Re: Clarifying the role of HBase Versions

Posted by Erik Holstad <er...@gmail.com>.

Hi Ryan!

I previous versions of HBase when dealing with querying versions they were
stored in a TreeMap which added complexity and made the query somewhat
slower. With 0.20 data returned to the client is just an array of KeyValues
which is the new storage format. When it comes down to splits, regions are
split by rows, so it doesn't realy matter if you have many qualifiers or
versions in that case. When it comes to compactions there should be no
difference either compared to qualifiers.

If we wouldn't use timestamp as the major key for versions what do you have
in mind? You can set your own timestamp clientside if you which, but I must
warn you that this might give you unexpected results if you don't fully
understand how to use this feature.

Regards Erik

Re: Clarifying the role of HBase Versions

Posted by "Ryan J. McDonough" <ry...@damnhandy.com>.

On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote:

> Ryan,
>
> You are currently only storing the latest nickname, not all 3?  I'm  
> trying
> to understand your use case exactly.

Yes, the multiple values are being stored, in fact far more than 3.  
We've defined the tables to use the max number of versions. We  
currently can store something to the effect of:

user123=>props:nickname:1243940086:Ryan
user123=>props:nickname:1243940087:Ryan McDonough
user123=>props:nickname:1243940088:Some guy asking questions
user123=>props:nickname:1243940089:Ryan
user123=>props:nickname:1243940090:Ryan
user123=>props:nickname:1243940091:
user123=>props:nickname:1243940092:Ryan McDonough

Where "props" is the column family. One thing that is challenging is  
that because the versions are keyed by timestamp, you don't have a  
mechanism to handle
duplicate values, thus it's possible to have the same value repeated  
multiple times. Also, you don't have insight into whether or not the  
value was the result of an insert or an accidental dupe, or a  
deletion. Additionally, we can only evaluate a row filter the most  
recent column value,but IIRC, that's fixed in 0.20.

>
> Whether you want to use versions or not depends on what you want to do
> with these multiple values.
>
> Versions are intended for versioning, as in, multiple values for the  
> same
> column that are timestamped and sorted with most recent first.

Yes, I understand that part. But what I'm trying to clarify is why  
store versions keyed only by timestamp and not by another arbitrary  
value? As I mentioned in my initial question, I'm starting to see  
versions as a means to provide some means of optimistic locking. To  
quote the BigTable paper:

"Applications that need to avoid collisions must generate unique  
timestamps themselves. Different versions of a cell are stored in  
decreasing timestamp order, so that the most recent versions can be  
read ﬁrst.  To make the management of versioned data less onerous, we  
support two per-column-family settings that tell Bigtable to garbage- 
collect cell versions automatically. The client can specify either  
that only the last n versions of a cell be kept, or that only new- 
enough versions be kept (e.g., only keep values that were written in  
the last seven days). "

With that said, I'm just trying to get some clarity on how HBase  
utilizes versions internally and if there's any change of seeing some  
unintended consequences of using versions for something other than  
versions? For example, does having multiple versions add additional  
overhead at compaction time or when region splits occur?

To put it another way:Based on my current understanding of HBase  
versions, I could equate it to using an audit schema in an RDBMS to  
join multiple values. While it's possible, it's not what you'd use an  
audit schema for.

> It seems from what you said that versions will work nicely.  With  
> the new
> API in the upcoming 0.20, there is much better support dealing with
> multiple versions.

Yes, it does work quite nicely, however I just feel like something's  
wrong with our design. Thanks for the response.

Ryan-

>
> JG
>
> On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote:
>> I'm trying to get some clarity on the role of versions in HBase. Our
>> table design is such that a an object can have multiple property  
>> values for
>> a given property name. For example, we could have an nickname  
>> property
>> that a given person is known by. In the current set up, if a person  
>> has 3
>> nicknames, only the last one gets stored. We have considered using  
>> the
>> column versions as an added data dimension, but that just doesn't  
>> feel
>> quite right. Given that columns have a limit (granted that it's quite
>> large) as to how many versions it can store, it's still a limit  
>> none the
>> less.
>>
>> From what I gather from reading the BigTable doc, is that version
>> could be considered a form of optimistic locking so that concurrent  
>> writes
>> don't conflict. Is that understanding correct? If not, is using  
>> versions
>> as an added data dimension a good idea?
>>
>> Ryan-
>>
>>
>>
>

Re: Clarifying the role of HBase Versions

Posted by Jonathan Gray <jl...@streamy.com>.

Ryan,

You are currently only storing the latest nickname, not all 3?  I'm trying
to understand your use case exactly.

Whether you want to use versions or not depends on what you want to do
with these multiple values.

Versions are intended for versioning, as in, multiple values for the same
column that are timestamped and sorted with most recent first.

It seems from what you said that versions will work nicely.  With the new
API in the upcoming 0.20, there is much better support dealing with
multiple versions.

JG

On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote:
> I'm trying to get some clarity on the role of versions in HBase. Our
> table design is such that a an object can have multiple property values for
> a given property name. For example, we could have an nickname property
> that a given person is known by. In the current set up, if a person has 3
> nicknames, only the last one gets stored. We have considered using the
> column versions as an added data dimension, but that just doesn't feel
> quite right. Given that columns have a limit (granted that it's quite
> large) as to how many versions it can store, it's still a limit none the
> less.
>
> From what I gather from reading the BigTable doc, is that version
> could be considered a form of optimistic locking so that concurrent writes
> don't conflict. Is that understanding correct? If not, is using versions
> as an added data dimension a good idea?
>
> Ryan-
>
>
>