You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Michael Segel <ms...@hotmail.com> on 2013/12/08 18:35:53 UTC

Purpose of versions in HBase...

Hi, 

In a different thread, we were discussing good and better schema designs. 
In order to really understand why one should or should not do something, its kind of important to understand the underlying reasons why HBase was designed the way it was.

So since we have a bunch of committers here, and cc'ing the Dev list, 

I'd like to explore why does HBase have cell versioning. What's its purpose.  How is it implemented. and Why. 

This may seem a bit esoteric, but it would go a long way in educating many of the users on the hbase mailing list. 

Also it may be a good couple of paragraphs to add to the online reference...

-Mike

Re: Purpose of versions in HBase...

Posted by Stack <st...@duboce.net>.

On Mon, Dec 9, 2013 at 3:17 PM, Michael Segel <mi...@hotmail.com>wrote:

> I believe there's a bit more to it...
>
>
Such as?



> Which is why I am asking.
>


> As to #3... What happens to a column when you put a tombstone marker on it?
>
>
We have this in the doc.  If it does not answer your question, lets fix it.

Thanks Michael.
St.Ack



> On Dec 9, 2013, at 11:56 AM, Sergey Shelukhin <se...@hortonworks.com>
> wrote:
>
> > I suspect the honest answer would be "because BigTable paper had it" :P
> >
> > There are several aspects to cell versioning (I may be missing some).
> > First (not the most important), due to the way HBase stores things
> > (write-once files), it comes very cheaply - very little runtime cost, and
> > not so much code needs to be written to have it.
> > Second, internally, versioning allows for snapshot isolation (within a
> > server) to work - with multiple versions present, scanners can read all
> > ones to get a consistent view (that's MVCC).
> > Third, user-visible, timestamp-based cell versioning is there so that
> users
> > could control the order of things (e.g. delete all cells before...),
> either
> > thru fabricated timestamps, or using external timestamps, e.g. from
> > external logs. In fact, with current HBase implementation of auto-ts
> (there
> > are JIRAs to fix it), that's the only "bulletproof" way to use HBase;
> > internal HBase versioning relies on server clocks, which is fraught with
> > peril (granted, most systems will rarely hit this problems, and may be ok
> > with some reordering anyway).
> > Fourth, multi-versions as such could be used for some application
> specific
> > scenarios, Percolator paper is a good example.
> >
> >
> >
> > On Sun, Dec 8, 2013 at 9:35 AM, Michael Segel <msegel_hadoop@hotmail.com
> >wrote:
> >
> >>
> >> Hi,
> >>
> >> In a different thread, we were discussing good and better schema
> designs.
> >> In order to really understand why one should or should not do something,
> >> its kind of important to understand the underlying reasons why HBase was
> >> designed the way it was.
> >>
> >> So since we have a bunch of committers here, and cc'ing the Dev list,
> >>
> >> I'd like to explore why does HBase have cell versioning. What's its
> >> purpose.  How is it implemented. and Why.
> >>
> >> This may seem a bit esoteric, but it would go a long way in educating
> many
> >> of the users on the hbase mailing list.
> >>
> >> Also it may be a good couple of paragraphs to add to the online
> >> reference...
> >>
> >> -Mike
> >>
> >>
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Purpose of versions in HBase...

Posted by Michael Segel <mi...@hotmail.com>.

I believe there's a bit more to it... 

Which is why I am asking.

As to #3... What happens to a column when you put a tombstone marker on it? 

On Dec 9, 2013, at 11:56 AM, Sergey Shelukhin <se...@hortonworks.com> wrote:

> I suspect the honest answer would be "because BigTable paper had it" :P
> 
> There are several aspects to cell versioning (I may be missing some).
> First (not the most important), due to the way HBase stores things
> (write-once files), it comes very cheaply - very little runtime cost, and
> not so much code needs to be written to have it.
> Second, internally, versioning allows for snapshot isolation (within a
> server) to work - with multiple versions present, scanners can read all
> ones to get a consistent view (that's MVCC).
> Third, user-visible, timestamp-based cell versioning is there so that users
> could control the order of things (e.g. delete all cells before...), either
> thru fabricated timestamps, or using external timestamps, e.g. from
> external logs. In fact, with current HBase implementation of auto-ts (there
> are JIRAs to fix it), that's the only "bulletproof" way to use HBase;
> internal HBase versioning relies on server clocks, which is fraught with
> peril (granted, most systems will rarely hit this problems, and may be ok
> with some reordering anyway).
> Fourth, multi-versions as such could be used for some application specific
> scenarios, Percolator paper is a good example.
> 
> 
> 
> On Sun, Dec 8, 2013 at 9:35 AM, Michael Segel <ms...@hotmail.com>wrote:
> 
>> 
>> Hi,
>> 
>> In a different thread, we were discussing good and better schema designs.
>> In order to really understand why one should or should not do something,
>> its kind of important to understand the underlying reasons why HBase was
>> designed the way it was.
>> 
>> So since we have a bunch of committers here, and cc'ing the Dev list,
>> 
>> I'd like to explore why does HBase have cell versioning. What's its
>> purpose.  How is it implemented. and Why.
>> 
>> This may seem a bit esoteric, but it would go a long way in educating many
>> of the users on the hbase mailing list.
>> 
>> Also it may be a good couple of paragraphs to add to the online
>> reference...
>> 
>> -Mike
>> 
>> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Purpose of versions in HBase...

Posted by Michael Segel <mi...@hotmail.com>.

I believe there's a bit more to it... 

Which is why I am asking.

As to #3... What happens to a column when you put a tombstone marker on it? 

On Dec 9, 2013, at 11:56 AM, Sergey Shelukhin <se...@hortonworks.com> wrote:

> I suspect the honest answer would be "because BigTable paper had it" :P
> 
> There are several aspects to cell versioning (I may be missing some).
> First (not the most important), due to the way HBase stores things
> (write-once files), it comes very cheaply - very little runtime cost, and
> not so much code needs to be written to have it.
> Second, internally, versioning allows for snapshot isolation (within a
> server) to work - with multiple versions present, scanners can read all
> ones to get a consistent view (that's MVCC).
> Third, user-visible, timestamp-based cell versioning is there so that users
> could control the order of things (e.g. delete all cells before...), either
> thru fabricated timestamps, or using external timestamps, e.g. from
> external logs. In fact, with current HBase implementation of auto-ts (there
> are JIRAs to fix it), that's the only "bulletproof" way to use HBase;
> internal HBase versioning relies on server clocks, which is fraught with
> peril (granted, most systems will rarely hit this problems, and may be ok
> with some reordering anyway).
> Fourth, multi-versions as such could be used for some application specific
> scenarios, Percolator paper is a good example.
> 
> 
> 
> On Sun, Dec 8, 2013 at 9:35 AM, Michael Segel <ms...@hotmail.com>wrote:
> 
>> 
>> Hi,
>> 
>> In a different thread, we were discussing good and better schema designs.
>> In order to really understand why one should or should not do something,
>> its kind of important to understand the underlying reasons why HBase was
>> designed the way it was.
>> 
>> So since we have a bunch of committers here, and cc'ing the Dev list,
>> 
>> I'd like to explore why does HBase have cell versioning. What's its
>> purpose.  How is it implemented. and Why.
>> 
>> This may seem a bit esoteric, but it would go a long way in educating many
>> of the users on the hbase mailing list.
>> 
>> Also it may be a good couple of paragraphs to add to the online
>> reference...
>> 
>> -Mike
>> 
>> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Purpose of versions in HBase...

Posted by Sergey Shelukhin <se...@hortonworks.com>.

I suspect the honest answer would be "because BigTable paper had it" :P

There are several aspects to cell versioning (I may be missing some).
First (not the most important), due to the way HBase stores things
(write-once files), it comes very cheaply - very little runtime cost, and
not so much code needs to be written to have it.
Second, internally, versioning allows for snapshot isolation (within a
server) to work - with multiple versions present, scanners can read all
ones to get a consistent view (that's MVCC).
Third, user-visible, timestamp-based cell versioning is there so that users
could control the order of things (e.g. delete all cells before...), either
thru fabricated timestamps, or using external timestamps, e.g. from
external logs. In fact, with current HBase implementation of auto-ts (there
are JIRAs to fix it), that's the only "bulletproof" way to use HBase;
internal HBase versioning relies on server clocks, which is fraught with
peril (granted, most systems will rarely hit this problems, and may be ok
with some reordering anyway).
Fourth, multi-versions as such could be used for some application specific
scenarios, Percolator paper is a good example.

On Sun, Dec 8, 2013 at 9:35 AM, Michael Segel <ms...@hotmail.com>wrote:

>
> Hi,
>
> In a different thread, we were discussing good and better schema designs.
> In order to really understand why one should or should not do something,
> its kind of important to understand the underlying reasons why HBase was
> designed the way it was.
>
> So since we have a bunch of committers here, and cc'ing the Dev list,
>
> I'd like to explore why does HBase have cell versioning. What's its
> purpose.  How is it implemented. and Why.
>
> This may seem a bit esoteric, but it would go a long way in educating many
> of the users on the hbase mailing list.
>
> Also it may be a good couple of paragraphs to add to the online
> reference...
>
> -Mike
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Purpose of versions in HBase...

Posted by Sergey Shelukhin <se...@hortonworks.com>.

I suspect the honest answer would be "because BigTable paper had it" :P

There are several aspects to cell versioning (I may be missing some).
First (not the most important), due to the way HBase stores things
(write-once files), it comes very cheaply - very little runtime cost, and
not so much code needs to be written to have it.
Second, internally, versioning allows for snapshot isolation (within a
server) to work - with multiple versions present, scanners can read all
ones to get a consistent view (that's MVCC).
Third, user-visible, timestamp-based cell versioning is there so that users
could control the order of things (e.g. delete all cells before...), either
thru fabricated timestamps, or using external timestamps, e.g. from
external logs. In fact, with current HBase implementation of auto-ts (there
are JIRAs to fix it), that's the only "bulletproof" way to use HBase;
internal HBase versioning relies on server clocks, which is fraught with
peril (granted, most systems will rarely hit this problems, and may be ok
with some reordering anyway).
Fourth, multi-versions as such could be used for some application specific
scenarios, Percolator paper is a good example.

On Sun, Dec 8, 2013 at 9:35 AM, Michael Segel <ms...@hotmail.com>wrote:

>
> Hi,
>
> In a different thread, we were discussing good and better schema designs.
> In order to really understand why one should or should not do something,
> its kind of important to understand the underlying reasons why HBase was
> designed the way it was.
>
> So since we have a bunch of committers here, and cc'ing the Dev list,
>
> I'd like to explore why does HBase have cell versioning. What's its
> purpose.  How is it implemented. and Why.
>
> This may seem a bit esoteric, but it would go a long way in educating many
> of the users on the hbase mailing list.
>
> Also it may be a good couple of paragraphs to add to the online
> reference...
>
> -Mike
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.