You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Mark <st...@gmail.com> on 2011/08/17 17:53:01 UTC

Versioning

I'm trying to fully understand all the possibilities of what HBase has 
to offer but I can determine a valid use case for multiple versions. Can 
someone please explain some real life use cases for this?

Also, at what point is there "too many versions". For example to store 
all the queries a user has performed couldn't we create a column family 
and have max versions set to something really high (1M). Using this 
method we could then ask for the last X amount of queries by setting the 
max versions to X. It seems like this can also be accomplished by 
creating a separate row for each query but I'm not sure why one strategy 
would be better than the other.

Please help me understand. Thanks!

Re: Versioning

Posted by Doug Meil <do...@explorysmedical.com>.
Good observation Bill...  I'll add it.



On 8/26/11 12:27 PM, "Bill Graham" <bi...@gmail.com> wrote:

>This issue is a common pitfall to those new to HBase and I think it could
>be
>a good thing to have in the HBase book. Once someone realizes that you can
>store multiple values for the same cell, each with a timestamp there can
>be
>a natural tendency to think "hey, I can store a one-to-many using multiple
>version of a cell". That's not the intent of versioned cell values.
>
>Versioned cell values can be thought of as a way to keep a history of
>change
>for a single entity that at any given time only has one value. Like
>keeping
>track of a state change over time. For a one-to-many relationship (i.e., a
>user with many events), favor either multiple rows or multiple columns
>instead.
>
>Bill
>
>
>On Fri, Aug 26, 2011 at 9:16 AM, Buttler, David <bu...@llnl.gov> wrote:
>
>> Physically, you will be storing the same data.  Hbase stores everything
>>as
>> key-value pairs.  The cell identifier is "row key, column family, column
>> qualifier, timestamp"
>>
>> However, by storing items in different rows it is more convenient to
>>query
>> and delete old values.  By default you only get the most recent version
>>of a
>> column during a scan.
>>
>> One way to think about it is: versions are for when you don't want to
>> forget previous versions, but you typically only want the most recent
>> version.  If you want to be continuously accessing old versions, you
>>would
>> be better off putting them in separate rows.
>>
>> Dave
>>
>> -----Original Message-----
>> From: Sheng Chen [mailto:chensheng2010@gmail.com]
>> Sent: Friday, August 26, 2011 1:38 AM
>> To: user@hbase.apache.org
>> Subject: Re: Versioning
>>
>> Hi, I just saw your recent update of the hbase book on the version
>>number
>> question, and I'm also confused about it.
>> As said on the book (HBASE-4251), it is not recommended setting the
>>number
>> of versions to an exceedingly high level (e.g., hundreds or more) unless
>> those old values are very dear to you because this will greatly increase
>> StoreFile size.
>>
>> But sometimes, we do need to save multiple versions of values, such as
>> logging events, or messages of Facebook. In these cases, what is the
>>trade
>> off between saving them in different rows, and in different versions of
>>one
>> row?
>>
>> Thank you.
>> Sean
>>
>>
>> 2011/8/18 Doug Meil <do...@explorysmedical.com>
>>
>> >
>> > Versioning can be used to see the previous state of a record.  Some
>> people
>> > need this feature, others don't.
>> >
>> > One thing that may be worth a review is this...
>> >
>> > http://hbase.apache.org/book.html#keysize
>> >
>> > ... and specifically the fact about all the values being freighted
>>with
>> > timestamp (aka version) too.  I don't know your use case, and I'm not
>> sure
>> > I have the time to understand it, but 1 million versions seems like a
>> lot.
>> >  You're going to use a lot of space doing that.
>> >
>> >
>> >
>> >
>> > On 8/17/11 11:53 AM, "Mark" <st...@gmail.com> wrote:
>> >
>> > >I'm trying to fully understand all the possibilities of what HBase
>>has
>> > >to offer but I can determine a valid use case for multiple versions.
>>Can
>> > >someone please explain some real life use cases for this?
>> > >
>> > >Also, at what point is there "too many versions". For example to
>>store
>> > >all the queries a user has performed couldn't we create a column
>>family
>> > >and have max versions set to something really high (1M). Using this
>> > >method we could then ask for the last X amount of queries by setting
>>the
>> > >max versions to X. It seems like this can also be accomplished by
>> > >creating a separate row for each query but I'm not sure why one
>>strategy
>> > >would be better than the other.
>> > >
>> > >Please help me understand. Thanks!
>> >
>> >
>>


Re: Versioning

Posted by Bill Graham <bi...@gmail.com>.
This issue is a common pitfall to those new to HBase and I think it could be
a good thing to have in the HBase book. Once someone realizes that you can
store multiple values for the same cell, each with a timestamp there can be
a natural tendency to think "hey, I can store a one-to-many using multiple
version of a cell". That's not the intent of versioned cell values.

Versioned cell values can be thought of as a way to keep a history of change
for a single entity that at any given time only has one value. Like keeping
track of a state change over time. For a one-to-many relationship (i.e., a
user with many events), favor either multiple rows or multiple columns
instead.

Bill


On Fri, Aug 26, 2011 at 9:16 AM, Buttler, David <bu...@llnl.gov> wrote:

> Physically, you will be storing the same data.  Hbase stores everything as
> key-value pairs.  The cell identifier is "row key, column family, column
> qualifier, timestamp"
>
> However, by storing items in different rows it is more convenient to query
> and delete old values.  By default you only get the most recent version of a
> column during a scan.
>
> One way to think about it is: versions are for when you don't want to
> forget previous versions, but you typically only want the most recent
> version.  If you want to be continuously accessing old versions, you would
> be better off putting them in separate rows.
>
> Dave
>
> -----Original Message-----
> From: Sheng Chen [mailto:chensheng2010@gmail.com]
> Sent: Friday, August 26, 2011 1:38 AM
> To: user@hbase.apache.org
> Subject: Re: Versioning
>
> Hi, I just saw your recent update of the hbase book on the version number
> question, and I'm also confused about it.
> As said on the book (HBASE-4251), it is not recommended setting the number
> of versions to an exceedingly high level (e.g., hundreds or more) unless
> those old values are very dear to you because this will greatly increase
> StoreFile size.
>
> But sometimes, we do need to save multiple versions of values, such as
> logging events, or messages of Facebook. In these cases, what is the trade
> off between saving them in different rows, and in different versions of one
> row?
>
> Thank you.
> Sean
>
>
> 2011/8/18 Doug Meil <do...@explorysmedical.com>
>
> >
> > Versioning can be used to see the previous state of a record.  Some
> people
> > need this feature, others don't.
> >
> > One thing that may be worth a review is this...
> >
> > http://hbase.apache.org/book.html#keysize
> >
> > ... and specifically the fact about all the values being freighted with
> > timestamp (aka version) too.  I don't know your use case, and I'm not
> sure
> > I have the time to understand it, but 1 million versions seems like a
> lot.
> >  You're going to use a lot of space doing that.
> >
> >
> >
> >
> > On 8/17/11 11:53 AM, "Mark" <st...@gmail.com> wrote:
> >
> > >I'm trying to fully understand all the possibilities of what HBase has
> > >to offer but I can determine a valid use case for multiple versions. Can
> > >someone please explain some real life use cases for this?
> > >
> > >Also, at what point is there "too many versions". For example to store
> > >all the queries a user has performed couldn't we create a column family
> > >and have max versions set to something really high (1M). Using this
> > >method we could then ask for the last X amount of queries by setting the
> > >max versions to X. It seems like this can also be accomplished by
> > >creating a separate row for each query but I'm not sure why one strategy
> > >would be better than the other.
> > >
> > >Please help me understand. Thanks!
> >
> >
>

RE: Versioning

Posted by "Buttler, David" <bu...@llnl.gov>.
Physically, you will be storing the same data.  Hbase stores everything as key-value pairs.  The cell identifier is "row key, column family, column qualifier, timestamp"

However, by storing items in different rows it is more convenient to query and delete old values.  By default you only get the most recent version of a column during a scan.

One way to think about it is: versions are for when you don't want to forget previous versions, but you typically only want the most recent version.  If you want to be continuously accessing old versions, you would be better off putting them in separate rows.

Dave

-----Original Message-----
From: Sheng Chen [mailto:chensheng2010@gmail.com] 
Sent: Friday, August 26, 2011 1:38 AM
To: user@hbase.apache.org
Subject: Re: Versioning

Hi, I just saw your recent update of the hbase book on the version number
question, and I'm also confused about it.
As said on the book (HBASE-4251), it is not recommended setting the number
of versions to an exceedingly high level (e.g., hundreds or more) unless
those old values are very dear to you because this will greatly increase
StoreFile size.

But sometimes, we do need to save multiple versions of values, such as
logging events, or messages of Facebook. In these cases, what is the trade
off between saving them in different rows, and in different versions of one
row?

Thank you.
Sean


2011/8/18 Doug Meil <do...@explorysmedical.com>

>
> Versioning can be used to see the previous state of a record.  Some people
> need this feature, others don't.
>
> One thing that may be worth a review is this...
>
> http://hbase.apache.org/book.html#keysize
>
> ... and specifically the fact about all the values being freighted with
> timestamp (aka version) too.  I don't know your use case, and I'm not sure
> I have the time to understand it, but 1 million versions seems like a lot.
>  You're going to use a lot of space doing that.
>
>
>
>
> On 8/17/11 11:53 AM, "Mark" <st...@gmail.com> wrote:
>
> >I'm trying to fully understand all the possibilities of what HBase has
> >to offer but I can determine a valid use case for multiple versions. Can
> >someone please explain some real life use cases for this?
> >
> >Also, at what point is there "too many versions". For example to store
> >all the queries a user has performed couldn't we create a column family
> >and have max versions set to something really high (1M). Using this
> >method we could then ask for the last X amount of queries by setting the
> >max versions to X. It seems like this can also be accomplished by
> >creating a separate row for each query but I'm not sure why one strategy
> >would be better than the other.
> >
> >Please help me understand. Thanks!
>
>

RE: Versioning

Posted by Michael Segel <mi...@hotmail.com>.
Sean, 
You wrote the following:
"> But sometimes, we do need to save multiple versions of values, such as
> logging events, or messages of Facebook. In these cases, what is the trade
> off between saving them in different rows, and in different versions of one
> row?
> "
You're not updating logging events, so why would you consider versioning since each log event is unique. You'd store them as separate rows.
Think of versioning as allowing one to roll back an update in a transactional system. (Note: HBase doesn't have transactions or 'updates'. I'm just trying to translate the concept.)

HTH

-Mike


> Date: Fri, 26 Aug 2011 16:37:46 +0800
> Subject: Re: Versioning
> From: chensheng2010@gmail.com
> To: user@hbase.apache.org
> 
> Hi, I just saw your recent update of the hbase book on the version number
> question, and I'm also confused about it.
> As said on the book (HBASE-4251), it is not recommended setting the number
> of versions to an exceedingly high level (e.g., hundreds or more) unless
> those old values are very dear to you because this will greatly increase
> StoreFile size.
> 
> But sometimes, we do need to save multiple versions of values, such as
> logging events, or messages of Facebook. In these cases, what is the trade
> off between saving them in different rows, and in different versions of one
> row?
> 
> Thank you.
> Sean
> 
> 
> 2011/8/18 Doug Meil <do...@explorysmedical.com>
> 
> >
> > Versioning can be used to see the previous state of a record.  Some people
> > need this feature, others don't.
> >
> > One thing that may be worth a review is this...
> >
> > http://hbase.apache.org/book.html#keysize
> >
> > ... and specifically the fact about all the values being freighted with
> > timestamp (aka version) too.  I don't know your use case, and I'm not sure
> > I have the time to understand it, but 1 million versions seems like a lot.
> >  You're going to use a lot of space doing that.
> >
> >
> >
> >
> > On 8/17/11 11:53 AM, "Mark" <st...@gmail.com> wrote:
> >
> > >I'm trying to fully understand all the possibilities of what HBase has
> > >to offer but I can determine a valid use case for multiple versions. Can
> > >someone please explain some real life use cases for this?
> > >
> > >Also, at what point is there "too many versions". For example to store
> > >all the queries a user has performed couldn't we create a column family
> > >and have max versions set to something really high (1M). Using this
> > >method we could then ask for the last X amount of queries by setting the
> > >max versions to X. It seems like this can also be accomplished by
> > >creating a separate row for each query but I'm not sure why one strategy
> > >would be better than the other.
> > >
> > >Please help me understand. Thanks!
> >
> >
 		 	   		  

Re: Versioning

Posted by Sheng Chen <ch...@gmail.com>.
Hi, I just saw your recent update of the hbase book on the version number
question, and I'm also confused about it.
As said on the book (HBASE-4251), it is not recommended setting the number
of versions to an exceedingly high level (e.g., hundreds or more) unless
those old values are very dear to you because this will greatly increase
StoreFile size.

But sometimes, we do need to save multiple versions of values, such as
logging events, or messages of Facebook. In these cases, what is the trade
off between saving them in different rows, and in different versions of one
row?

Thank you.
Sean


2011/8/18 Doug Meil <do...@explorysmedical.com>

>
> Versioning can be used to see the previous state of a record.  Some people
> need this feature, others don't.
>
> One thing that may be worth a review is this...
>
> http://hbase.apache.org/book.html#keysize
>
> ... and specifically the fact about all the values being freighted with
> timestamp (aka version) too.  I don't know your use case, and I'm not sure
> I have the time to understand it, but 1 million versions seems like a lot.
>  You're going to use a lot of space doing that.
>
>
>
>
> On 8/17/11 11:53 AM, "Mark" <st...@gmail.com> wrote:
>
> >I'm trying to fully understand all the possibilities of what HBase has
> >to offer but I can determine a valid use case for multiple versions. Can
> >someone please explain some real life use cases for this?
> >
> >Also, at what point is there "too many versions". For example to store
> >all the queries a user has performed couldn't we create a column family
> >and have max versions set to something really high (1M). Using this
> >method we could then ask for the last X amount of queries by setting the
> >max versions to X. It seems like this can also be accomplished by
> >creating a separate row for each query but I'm not sure why one strategy
> >would be better than the other.
> >
> >Please help me understand. Thanks!
>
>

Re: Versioning

Posted by Doug Meil <do...@explorysmedical.com>.
Versioning can be used to see the previous state of a record.  Some people
need this feature, others don't.

One thing that may be worth a review is this...

http://hbase.apache.org/book.html#keysize

... and specifically the fact about all the values being freighted with
timestamp (aka version) too.  I don't know your use case, and I'm not sure
I have the time to understand it, but 1 million versions seems like a lot.
 You're going to use a lot of space doing that.




On 8/17/11 11:53 AM, "Mark" <st...@gmail.com> wrote:

>I'm trying to fully understand all the possibilities of what HBase has
>to offer but I can determine a valid use case for multiple versions. Can
>someone please explain some real life use cases for this?
>
>Also, at what point is there "too many versions". For example to store
>all the queries a user has performed couldn't we create a column family
>and have max versions set to something really high (1M). Using this
>method we could then ask for the last X amount of queries by setting the
>max versions to X. It seems like this can also be accomplished by
>creating a separate row for each query but I'm not sure why one strategy
>would be better than the other.
>
>Please help me understand. Thanks!