You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by yonghu <yo...@gmail.com> on 2015/01/19 20:28:14 UTC

multiple data versions vs. multiple rows?

Dear all,

I want to record the user history data. I know there exists two options,
one is to store user events in a single row with multiple data versions and
the other one is to use multiple rows. I wonder which one is better for
performance?

Thanks!

Yong

Re: multiple data versions vs. multiple rows?

Posted by yonghu <yo...@gmail.com>.

I think we need to take a look different situations.

1. One column gets frequently updated and the others not. If we use row
representation, we will include the unchanged data value for each tuple.
This may cause a large data redundancy. So, I think it can explain why in
my test the multiple data version approach is better than multiple row
approach.

2. All columns get even updates. Hence, there will be not much data volume
difference between these two, as each data version is actually stored as a
key-value pair. In this situation, the performance between these two
approaches will not be significant.

Yong

On Tue, Jan 20, 2015 at 8:16 AM, Serega Sheypak <se...@gmail.com>
wrote:

> does performance should differ significantly if row value size is small and
> we don't have too much versions.
> Assume, that a pack of versions for key is less than recommended HFile
> block (8KB to 1MB
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html
> ),
> which is minimal "read unit", should we see any difference at all?
> Am I right?
>
>
> 2015-01-20 0:33 GMT+03:00 Jean-Marc Spaggiari <je...@spaggiari.org>:
>
> > Hi Yong,
> >
> > If you want to compare the performances, you need to run way bigger and
> > longer tests. Dont run them in parallete. Run them at least 10 time each
> to
> > make sure you have a good trend. Is the difference between the 2
> > significant? It should not.
> >
> > JM
> >
> > 2015-01-19 15:17 GMT-05:00 yonghu <yo...@gmail.com>:
> >
> > > Hi,
> > >
> > > Thanks for your suggestion. I have already considered the first issue
> > that
> > > one row  is not allowed to be split between 2 regions.
> > >
> > > However, I have made a small scan-test with MapReduce. I first created
> a
> > > table t1 with 1 million rows and allowed each column to store 10 data
> > > versions. Then, I translated t1 into t2 in which multiple data versions
> > in
> > > t1 were transformed into multiple rows in t2. I wrote two MapReduce
> > > programs to scan t1 and t2 individually. What I got is the table
> scanning
> > > time of t1 is shorter than t2. So, I think for performance reason,
> > multiple
> > > data versions may be a better option than multiple rows.
> > >
> > > But just as you said, which approach to use depends on how many
> > historical
> > > events you want to keep.
> > >
> > > regards!
> > >
> > > Yong
> > >
> > >
> > > On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Hi Yong,
> > > >
> > > > A row will not split between 2 regions. If you plan having thousands
> of
> > > > versions, based on the size of your data, you might end up having a
> row
> > > > bigger than your preferred region size.
> > > >
> > > > If you plan just keep few versions of the history to have a look at
> > it, I
> > > > will say go with it. If you plan to have one million version because
> > you
> > > > want to keep all the events history, go with the row approach.
> > > >
> > > > You can also consider going with the Column Qualifier approach. This
> > has
> > > > the same constraint as the versions regarding the split in 2 regions,
> > but
> > > > it might me easier to manage and still give you the consistency of
> > being
> > > > within a row.
> > > >
> > > > JM
> > > >
> > > > 2015-01-19 14:28 GMT-05:00 yonghu <yo...@gmail.com>:
> > > >
> > > > > Dear all,
> > > > >
> > > > > I want to record the user history data. I know there exists two
> > > options,
> > > > > one is to store user events in a single row with multiple data
> > versions
> > > > and
> > > > > the other one is to use multiple rows. I wonder which one is better
> > for
> > > > > performance?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Yong
> > > > >
> > > >
> > >
> >
>

Re: multiple data versions vs. multiple rows?

Posted by Serega Sheypak <se...@gmail.com>.

does performance should differ significantly if row value size is small and
we don't have too much versions.
Assume, that a pack of versions for key is less than recommended HFile
block (8KB to 1MB
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html),
which is minimal "read unit", should we see any difference at all?
Am I right?


2015-01-20 0:33 GMT+03:00 Jean-Marc Spaggiari <je...@spaggiari.org>:

> Hi Yong,
>
> If you want to compare the performances, you need to run way bigger and
> longer tests. Dont run them in parallete. Run them at least 10 time each to
> make sure you have a good trend. Is the difference between the 2
> significant? It should not.
>
> JM
>
> 2015-01-19 15:17 GMT-05:00 yonghu <yo...@gmail.com>:
>
> > Hi,
> >
> > Thanks for your suggestion. I have already considered the first issue
> that
> > one row  is not allowed to be split between 2 regions.
> >
> > However, I have made a small scan-test with MapReduce. I first created a
> > table t1 with 1 million rows and allowed each column to store 10 data
> > versions. Then, I translated t1 into t2 in which multiple data versions
> in
> > t1 were transformed into multiple rows in t2. I wrote two MapReduce
> > programs to scan t1 and t2 individually. What I got is the table scanning
> > time of t1 is shorter than t2. So, I think for performance reason,
> multiple
> > data versions may be a better option than multiple rows.
> >
> > But just as you said, which approach to use depends on how many
> historical
> > events you want to keep.
> >
> > regards!
> >
> > Yong
> >
> >
> > On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Yong,
> > >
> > > A row will not split between 2 regions. If you plan having thousands of
> > > versions, based on the size of your data, you might end up having a row
> > > bigger than your preferred region size.
> > >
> > > If you plan just keep few versions of the history to have a look at
> it, I
> > > will say go with it. If you plan to have one million version because
> you
> > > want to keep all the events history, go with the row approach.
> > >
> > > You can also consider going with the Column Qualifier approach. This
> has
> > > the same constraint as the versions regarding the split in 2 regions,
> but
> > > it might me easier to manage and still give you the consistency of
> being
> > > within a row.
> > >
> > > JM
> > >
> > > 2015-01-19 14:28 GMT-05:00 yonghu <yo...@gmail.com>:
> > >
> > > > Dear all,
> > > >
> > > > I want to record the user history data. I know there exists two
> > options,
> > > > one is to store user events in a single row with multiple data
> versions
> > > and
> > > > the other one is to use multiple rows. I wonder which one is better
> for
> > > > performance?
> > > >
> > > > Thanks!
> > > >
> > > > Yong
> > > >
> > >
> >
>

Re: multiple data versions vs. multiple rows?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Yong,

If you want to compare the performances, you need to run way bigger and
longer tests. Dont run them in parallete. Run them at least 10 time each to
make sure you have a good trend. Is the difference between the 2
significant? It should not.

JM

2015-01-19 15:17 GMT-05:00 yonghu <yo...@gmail.com>:

> Hi,
>
> Thanks for your suggestion. I have already considered the first issue that
> one row  is not allowed to be split between 2 regions.
>
> However, I have made a small scan-test with MapReduce. I first created a
> table t1 with 1 million rows and allowed each column to store 10 data
> versions. Then, I translated t1 into t2 in which multiple data versions in
> t1 were transformed into multiple rows in t2. I wrote two MapReduce
> programs to scan t1 and t2 individually. What I got is the table scanning
> time of t1 is shorter than t2. So, I think for performance reason, multiple
> data versions may be a better option than multiple rows.
>
> But just as you said, which approach to use depends on how many historical
> events you want to keep.
>
> regards!
>
> Yong
>
>
> On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Yong,
> >
> > A row will not split between 2 regions. If you plan having thousands of
> > versions, based on the size of your data, you might end up having a row
> > bigger than your preferred region size.
> >
> > If you plan just keep few versions of the history to have a look at it, I
> > will say go with it. If you plan to have one million version because you
> > want to keep all the events history, go with the row approach.
> >
> > You can also consider going with the Column Qualifier approach. This has
> > the same constraint as the versions regarding the split in 2 regions, but
> > it might me easier to manage and still give you the consistency of being
> > within a row.
> >
> > JM
> >
> > 2015-01-19 14:28 GMT-05:00 yonghu <yo...@gmail.com>:
> >
> > > Dear all,
> > >
> > > I want to record the user history data. I know there exists two
> options,
> > > one is to store user events in a single row with multiple data versions
> > and
> > > the other one is to use multiple rows. I wonder which one is better for
> > > performance?
> > >
> > > Thanks!
> > >
> > > Yong
> > >
> >
>

Re: multiple data versions vs. multiple rows?

Posted by yonghu <yo...@gmail.com>.

Hi,

Thanks for your suggestion. I have already considered the first issue that
one row  is not allowed to be split between 2 regions.

However, I have made a small scan-test with MapReduce. I first created a
table t1 with 1 million rows and allowed each column to store 10 data
versions. Then, I translated t1 into t2 in which multiple data versions in
t1 were transformed into multiple rows in t2. I wrote two MapReduce
programs to scan t1 and t2 individually. What I got is the table scanning
time of t1 is shorter than t2. So, I think for performance reason, multiple
data versions may be a better option than multiple rows.

But just as you said, which approach to use depends on how many historical
events you want to keep.

regards!

Yong

On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Yong,
>
> A row will not split between 2 regions. If you plan having thousands of
> versions, based on the size of your data, you might end up having a row
> bigger than your preferred region size.
>
> If you plan just keep few versions of the history to have a look at it, I
> will say go with it. If you plan to have one million version because you
> want to keep all the events history, go with the row approach.
>
> You can also consider going with the Column Qualifier approach. This has
> the same constraint as the versions regarding the split in 2 regions, but
> it might me easier to manage and still give you the consistency of being
> within a row.
>
> JM
>
> 2015-01-19 14:28 GMT-05:00 yonghu <yo...@gmail.com>:
>
> > Dear all,
> >
> > I want to record the user history data. I know there exists two options,
> > one is to store user events in a single row with multiple data versions
> and
> > the other one is to use multiple rows. I wonder which one is better for
> > performance?
> >
> > Thanks!
> >
> > Yong
> >
>

Re: multiple data versions vs. multiple rows?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Yong,

A row will not split between 2 regions. If you plan having thousands of
versions, based on the size of your data, you might end up having a row
bigger than your preferred region size.

If you plan just keep few versions of the history to have a look at it, I
will say go with it. If you plan to have one million version because you
want to keep all the events history, go with the row approach.

You can also consider going with the Column Qualifier approach. This has
the same constraint as the versions regarding the split in 2 regions, but
it might me easier to manage and still give you the consistency of being
within a row.

JM

2015-01-19 14:28 GMT-05:00 yonghu <yo...@gmail.com>:

> Dear all,
>
> I want to record the user history data. I know there exists two options,
> one is to store user events in a single row with multiple data versions and
> the other one is to use multiple rows. I wonder which one is better for
> performance?
>
> Thanks!
>
> Yong
>