You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Serkan Uzunbaz <uz...@gmail.com> on 2019/03/30 05:13:42 UTC

Difference of n columns with 1 version vs 1 column with n versions

Hi all,
I have a question regarding the difference between storing a set of data as:
*a) n columns with 1 version each*
*b) 1 column with n versions*

Since the storage unit in hbase is a cell (rowkey, column family, column
qualifier, timestamp), is there a difference between the above two storage
options in terms of read/write performance, compaction/GC time, etc?

I know it is not recommended to use high number of versions if you do not
really need them. However, if those n versions of data are really needed
for reading, then will it cause any problem to store the data in a single
column with n versions. Also, even if max versions is set to 1 for a column
(option a), new values are still stored as a new cell and old cell is
deleted at compaction time. So, I also feel like compaction-wise two
options are identical.
I wonder if there is anything that makes one option superior to the other.

*Example*: To clarify more, say the data to be stored is set of urls
visited in certain time ranges and we want to keep the last 100 hours of
url sets:

*a) store each hour as column name with one url set in it (column names
will be used in cyclic manner (data for hour 101 will be written into
column 1))*
column_qualifier: value
---------------------------
urls_hour1: <abc.com, xyz.com, ...>
urls_hour2: <urls>
urls_hour3: <urls>
...
urls_hour100: <urls>


*b) store in a single column with 100 versions (one for each hour) (max
versions for column will be 100 and hbase will do the auto-compaction for
old versions)*
column_qualifier: value @ timestamp
---------------------------
urls: <abc.com, xyz.com, ...> @ ts_hour1, <urls> @ ts_hour2, <urls> @
ts_hour3, .... , <urls> @ ts_hour100

Thanks,
-Serkan

Re: Difference of n columns with 1 version vs 1 column with n versions

Posted by Serkan Uzunbaz <uz...@gmail.com>.
@Jean-Marc, could you please verify what you mean by "pagination over
versions" and why versions can cause it but not the multiple columns.

I understand that tall table design is different than wide table design.
However, among the wide table design options what is the internal
difference between 1 column with n versions vs n columns with 1 versions
each. Technically, my understanding is storage unit in hbase is a cell
which is defined by `rowkey:cf:cq:timestamp` and internally multi-versions
(1 column) is same as multi-columns (1 version).
Also, since even in one version case hbase does not update in place and it
appends a new version of the cell and old version is removed at major
compaction. So compaction-wise also they are same to me. Is there anything
I am missing?

Is there a benchmark that I can try and test the performance of these
options in terms of read/write performance and compaction effect?

Thanks,
-Serkan

On Sun, Mar 31, 2019 at 2:20 PM Wellington Chevreuil <
wellington.chevreuil@gmail.com> wrote:

> I would agree with JMS, to ideally avoid wide tables. Plus, there are still
> some inconsistent behaviour for versions feature (See HBASE-21596, for
> example). I would also favour option "a" over "b", as it seems to give more
> flexibility in the way you can access/delete these columns.
>
> Em dom, 31 de mar de 2019 às 00:12, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> escreveu:
>
> > Hi Serkan,
> >
> > This is my personal opinion and some might not share it ;)
> >
> > I tried to go with the deep versions approach for one project and I found
> > issues on some of the calls (pagination over versions as an example). So
> if
> > for you both (The deep version and wide columns) are the same, I will
> say,
> > better go with the wide columns.
> >
> > Also, why not good with tall table instead of wide?
> >
> > JMS
> >
> > Le sam. 30 mars 2019 à 01:14, Serkan Uzunbaz <uz...@gmail.com> a
> écrit :
> >
> > > Hi all,
> > > I have a question regarding the difference between storing a set of
> data
> > > as:
> > > *a) n columns with 1 version each*
> > > *b) 1 column with n versions*
> > >
> > > Since the storage unit in hbase is a cell (rowkey, column family,
> column
> > > qualifier, timestamp), is there a difference between the above two
> > storage
> > > options in terms of read/write performance, compaction/GC time, etc?
> > >
> > > I know it is not recommended to use high number of versions if you do
> not
> > > really need them. However, if those n versions of data are really
> needed
> > > for reading, then will it cause any problem to store the data in a
> single
> > > column with n versions. Also, even if max versions is set to 1 for a
> > column
> > > (option a), new values are still stored as a new cell and old cell is
> > > deleted at compaction time. So, I also feel like compaction-wise two
> > > options are identical.
> > > I wonder if there is anything that makes one option superior to the
> > other.
> > >
> > > *Example*: To clarify more, say the data to be stored is set of urls
> > > visited in certain time ranges and we want to keep the last 100 hours
> of
> > > url sets:
> > >
> > > *a) store each hour as column name with one url set in it (column names
> > > will be used in cyclic manner (data for hour 101 will be written into
> > > column 1))*
> > > column_qualifier: value
> > > ---------------------------
> > > urls_hour1: <abc.com, xyz.com, ...>
> > > urls_hour2: <urls>
> > > urls_hour3: <urls>
> > > ...
> > > urls_hour100: <urls>
> > >
> > >
> > > *b) store in a single column with 100 versions (one for each hour) (max
> > > versions for column will be 100 and hbase will do the auto-compaction
> for
> > > old versions)*
> > > column_qualifier: value @ timestamp
> > > ---------------------------
> > > urls: <abc.com, xyz.com, ...> @ ts_hour1, <urls> @ ts_hour2, <urls> @
> > > ts_hour3, .... , <urls> @ ts_hour100
> > >
> > > Thanks,
> > > -Serkan
> > >
> >
>

Re: Difference of n columns with 1 version vs 1 column with n versions

Posted by Wellington Chevreuil <we...@gmail.com>.
I would agree with JMS, to ideally avoid wide tables. Plus, there are still
some inconsistent behaviour for versions feature (See HBASE-21596, for
example). I would also favour option "a" over "b", as it seems to give more
flexibility in the way you can access/delete these columns.

Em dom, 31 de mar de 2019 às 00:12, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> escreveu:

> Hi Serkan,
>
> This is my personal opinion and some might not share it ;)
>
> I tried to go with the deep versions approach for one project and I found
> issues on some of the calls (pagination over versions as an example). So if
> for you both (The deep version and wide columns) are the same, I will say,
> better go with the wide columns.
>
> Also, why not good with tall table instead of wide?
>
> JMS
>
> Le sam. 30 mars 2019 à 01:14, Serkan Uzunbaz <uz...@gmail.com> a écrit :
>
> > Hi all,
> > I have a question regarding the difference between storing a set of data
> > as:
> > *a) n columns with 1 version each*
> > *b) 1 column with n versions*
> >
> > Since the storage unit in hbase is a cell (rowkey, column family, column
> > qualifier, timestamp), is there a difference between the above two
> storage
> > options in terms of read/write performance, compaction/GC time, etc?
> >
> > I know it is not recommended to use high number of versions if you do not
> > really need them. However, if those n versions of data are really needed
> > for reading, then will it cause any problem to store the data in a single
> > column with n versions. Also, even if max versions is set to 1 for a
> column
> > (option a), new values are still stored as a new cell and old cell is
> > deleted at compaction time. So, I also feel like compaction-wise two
> > options are identical.
> > I wonder if there is anything that makes one option superior to the
> other.
> >
> > *Example*: To clarify more, say the data to be stored is set of urls
> > visited in certain time ranges and we want to keep the last 100 hours of
> > url sets:
> >
> > *a) store each hour as column name with one url set in it (column names
> > will be used in cyclic manner (data for hour 101 will be written into
> > column 1))*
> > column_qualifier: value
> > ---------------------------
> > urls_hour1: <abc.com, xyz.com, ...>
> > urls_hour2: <urls>
> > urls_hour3: <urls>
> > ...
> > urls_hour100: <urls>
> >
> >
> > *b) store in a single column with 100 versions (one for each hour) (max
> > versions for column will be 100 and hbase will do the auto-compaction for
> > old versions)*
> > column_qualifier: value @ timestamp
> > ---------------------------
> > urls: <abc.com, xyz.com, ...> @ ts_hour1, <urls> @ ts_hour2, <urls> @
> > ts_hour3, .... , <urls> @ ts_hour100
> >
> > Thanks,
> > -Serkan
> >
>

Re: Difference of n columns with 1 version vs 1 column with n versions

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Serkan,

This is my personal opinion and some might not share it ;)

I tried to go with the deep versions approach for one project and I found
issues on some of the calls (pagination over versions as an example). So if
for you both (The deep version and wide columns) are the same, I will say,
better go with the wide columns.

Also, why not good with tall table instead of wide?

JMS

Le sam. 30 mars 2019 à 01:14, Serkan Uzunbaz <uz...@gmail.com> a écrit :

> Hi all,
> I have a question regarding the difference between storing a set of data
> as:
> *a) n columns with 1 version each*
> *b) 1 column with n versions*
>
> Since the storage unit in hbase is a cell (rowkey, column family, column
> qualifier, timestamp), is there a difference between the above two storage
> options in terms of read/write performance, compaction/GC time, etc?
>
> I know it is not recommended to use high number of versions if you do not
> really need them. However, if those n versions of data are really needed
> for reading, then will it cause any problem to store the data in a single
> column with n versions. Also, even if max versions is set to 1 for a column
> (option a), new values are still stored as a new cell and old cell is
> deleted at compaction time. So, I also feel like compaction-wise two
> options are identical.
> I wonder if there is anything that makes one option superior to the other.
>
> *Example*: To clarify more, say the data to be stored is set of urls
> visited in certain time ranges and we want to keep the last 100 hours of
> url sets:
>
> *a) store each hour as column name with one url set in it (column names
> will be used in cyclic manner (data for hour 101 will be written into
> column 1))*
> column_qualifier: value
> ---------------------------
> urls_hour1: <abc.com, xyz.com, ...>
> urls_hour2: <urls>
> urls_hour3: <urls>
> ...
> urls_hour100: <urls>
>
>
> *b) store in a single column with 100 versions (one for each hour) (max
> versions for column will be 100 and hbase will do the auto-compaction for
> old versions)*
> column_qualifier: value @ timestamp
> ---------------------------
> urls: <abc.com, xyz.com, ...> @ ts_hour1, <urls> @ ts_hour2, <urls> @
> ts_hour3, .... , <urls> @ ts_hour100
>
> Thanks,
> -Serkan
>