You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by yutoo yanio <yu...@gmail.com> on 2012/10/10 17:24:57 UTC

key design

hi
i have a question about key & column design.
in my application we have 3,000,000,000 record in every day
each record contain : user-id, "time stamp", content(max 1KB).
we need to store records for one year, this means we will have about
1,000,000,000,000 after 1 year.
we just search a user-id over rang of "time stamp"
table can design in two way
1.key=userid-timestamp and column:=content
2.key=userid-yyyyMMdd and column:HHmmss=content


in first design we have tall-narrow table but we have very very records, in
second design we have flat-wide table.
which of them have better performance?

thanks.

RE: key design

Posted by Anoop Sam John <an...@huawei.com>.

>we just search a user-id over rang of "time stamp"
In that case you can go with your 1st approach IMO
"1.key=userid-timestamp and column:=content"

>we have 200,000,000 user-id and i think user-id is good for lead position of the key. is it ok?
Yes it is...

-Anoop-
________________________________________
From: yutoo yanio [yutoo.yanio@gmail.com]
Sent: Thursday, October 11, 2012 1:57 PM
To: user@hbase.apache.org
Subject: Re: key design

we have 200,000,000 user-id and i think user-id is good for lead position
of the key. is it ok?

what about search performance?  which approach has better result?

On Wed, Oct 10, 2012 at 11:21 PM, Shumin Wu <sh...@gmail.com> wrote:

> The Definitive Guide has a good discussion in Chapter 9 Tall-Narrow vs.
> Flat-Wide tables. The suggested style is to design the table tall-narrow to
> make splitting easy.
>
> Also in approach 2, why do you need the "-yyyyMMdd" part? If you want to
> keep a creation time, I think it's better to create a column to store it.
> Just think about every row would have the overheads of this tailing part on
> storage.
>
> Shumin
>
> On Wed, Oct 10, 2012 at 12:08 PM, Jerry Lam <ch...@gmail.com> wrote:
>
> > That's true.Then there would be max. 86,400 records per day per userid.
> > That is about 100MB per day. I don't see much difference in both
> approaches
> > from the storage perspective.
> >
> > On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil <
> doug.meil@explorysmedical.com
> > >wrote:
> >
> > > Hi there-
> > >
> > > Given the fact that the userid is in the lead position of the key in
> both
> > > approaches, I'm not sure that he'd have a region hotspotting problem
> > > because the userid should be able to offer some spread.
> > >
> > >
> > >
> > >
> > > On 10/10/12 12:55 PM, "Jerry Lam" <ch...@gmail.com> wrote:
> > >
> > > >Hi:
> > > >
> > > >So you are saying you have ~3TB of data stored per day?
> > > >
> > > >Using the second approach, all data for one day will go to only 1
> > > >regionserver no matter what you do because HBase doesn't split a
> single
> > > >row.
> > > >
> > > >Using the first approach, data will spread across regionservers but
> > there
> > > >will be hotspotted to each regionserver during write since this is a
> > > >time-series problem.
> > > >
> > > >Best Regards,
> > > >
> > > >Jerry
> > > >
> > > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <yu...@gmail.com>
> > > >wrote:
> > > >
> > > >> hi
> > > >> i have a question about key & column design.
> > > >> in my application we have 3,000,000,000 record in every day
> > > >> each record contain : user-id, "time stamp", content(max 1KB).
> > > >> we need to store records for one year, this means we will have about
> > > >> 1,000,000,000,000 after 1 year.
> > > >> we just search a user-id over rang of "time stamp"
> > > >> table can design in two way
> > > >> 1.key=userid-timestamp and column:=content
> > > >> 2.key=userid-yyyyMMdd and column:HHmmss=content
> > > >>
> > > >>
> > > >> in first design we have tall-narrow table but we have very very
> > > >>records, in
> > > >> second design we have flat-wide table.
> > > >> which of them have better performance?
> > > >>
> > > >> thanks.
> > > >>
> > >
> > >
> > >
> >
>

Re: key design

Posted by yutoo yanio <yu...@gmail.com>.

we have 200,000,000 user-id and i think user-id is good for lead position
of the key. is it ok?

what about search performance?  which approach has better result?

On Wed, Oct 10, 2012 at 11:21 PM, Shumin Wu <sh...@gmail.com> wrote:

> The Definitive Guide has a good discussion in Chapter 9 Tall-Narrow vs.
> Flat-Wide tables. The suggested style is to design the table tall-narrow to
> make splitting easy.
>
> Also in approach 2, why do you need the "-yyyyMMdd" part? If you want to
> keep a creation time, I think it's better to create a column to store it.
> Just think about every row would have the overheads of this tailing part on
> storage.
>
> Shumin
>
> On Wed, Oct 10, 2012 at 12:08 PM, Jerry Lam <ch...@gmail.com> wrote:
>
> > That's true.Then there would be max. 86,400 records per day per userid.
> > That is about 100MB per day. I don't see much difference in both
> approaches
> > from the storage perspective.
> >
> > On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil <
> doug.meil@explorysmedical.com
> > >wrote:
> >
> > > Hi there-
> > >
> > > Given the fact that the userid is in the lead position of the key in
> both
> > > approaches, I'm not sure that he'd have a region hotspotting problem
> > > because the userid should be able to offer some spread.
> > >
> > >
> > >
> > >
> > > On 10/10/12 12:55 PM, "Jerry Lam" <ch...@gmail.com> wrote:
> > >
> > > >Hi:
> > > >
> > > >So you are saying you have ~3TB of data stored per day?
> > > >
> > > >Using the second approach, all data for one day will go to only 1
> > > >regionserver no matter what you do because HBase doesn't split a
> single
> > > >row.
> > > >
> > > >Using the first approach, data will spread across regionservers but
> > there
> > > >will be hotspotted to each regionserver during write since this is a
> > > >time-series problem.
> > > >
> > > >Best Regards,
> > > >
> > > >Jerry
> > > >
> > > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <yu...@gmail.com>
> > > >wrote:
> > > >
> > > >> hi
> > > >> i have a question about key & column design.
> > > >> in my application we have 3,000,000,000 record in every day
> > > >> each record contain : user-id, "time stamp", content(max 1KB).
> > > >> we need to store records for one year, this means we will have about
> > > >> 1,000,000,000,000 after 1 year.
> > > >> we just search a user-id over rang of "time stamp"
> > > >> table can design in two way
> > > >> 1.key=userid-timestamp and column:=content
> > > >> 2.key=userid-yyyyMMdd and column:HHmmss=content
> > > >>
> > > >>
> > > >> in first design we have tall-narrow table but we have very very
> > > >>records, in
> > > >> second design we have flat-wide table.
> > > >> which of them have better performance?
> > > >>
> > > >> thanks.
> > > >>
> > >
> > >
> > >
> >
>

Re: key design

Posted by Shumin Wu <sh...@gmail.com>.

The Definitive Guide has a good discussion in Chapter 9 Tall-Narrow vs.
Flat-Wide tables. The suggested style is to design the table tall-narrow to
make splitting easy.

Also in approach 2, why do you need the "-yyyyMMdd" part? If you want to
keep a creation time, I think it's better to create a column to store it.
Just think about every row would have the overheads of this tailing part on
storage.

Shumin

On Wed, Oct 10, 2012 at 12:08 PM, Jerry Lam <ch...@gmail.com> wrote:

> That's true.Then there would be max. 86,400 records per day per userid.
> That is about 100MB per day. I don't see much difference in both approaches
> from the storage perspective.
>
> On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil <doug.meil@explorysmedical.com
> >wrote:
>
> > Hi there-
> >
> > Given the fact that the userid is in the lead position of the key in both
> > approaches, I'm not sure that he'd have a region hotspotting problem
> > because the userid should be able to offer some spread.
> >
> >
> >
> >
> > On 10/10/12 12:55 PM, "Jerry Lam" <ch...@gmail.com> wrote:
> >
> > >Hi:
> > >
> > >So you are saying you have ~3TB of data stored per day?
> > >
> > >Using the second approach, all data for one day will go to only 1
> > >regionserver no matter what you do because HBase doesn't split a single
> > >row.
> > >
> > >Using the first approach, data will spread across regionservers but
> there
> > >will be hotspotted to each regionserver during write since this is a
> > >time-series problem.
> > >
> > >Best Regards,
> > >
> > >Jerry
> > >
> > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <yu...@gmail.com>
> > >wrote:
> > >
> > >> hi
> > >> i have a question about key & column design.
> > >> in my application we have 3,000,000,000 record in every day
> > >> each record contain : user-id, "time stamp", content(max 1KB).
> > >> we need to store records for one year, this means we will have about
> > >> 1,000,000,000,000 after 1 year.
> > >> we just search a user-id over rang of "time stamp"
> > >> table can design in two way
> > >> 1.key=userid-timestamp and column:=content
> > >> 2.key=userid-yyyyMMdd and column:HHmmss=content
> > >>
> > >>
> > >> in first design we have tall-narrow table but we have very very
> > >>records, in
> > >> second design we have flat-wide table.
> > >> which of them have better performance?
> > >>
> > >> thanks.
> > >>
> >
> >
> >
>

Re: key design

Posted by Jerry Lam <ch...@gmail.com>.

That's true.Then there would be max. 86,400 records per day per userid.
That is about 100MB per day. I don't see much difference in both approaches
from the storage perspective.

On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil <do...@explorysmedical.com>wrote:

> Hi there-
>
> Given the fact that the userid is in the lead position of the key in both
> approaches, I'm not sure that he'd have a region hotspotting problem
> because the userid should be able to offer some spread.
>
>
>
>
> On 10/10/12 12:55 PM, "Jerry Lam" <ch...@gmail.com> wrote:
>
> >Hi:
> >
> >So you are saying you have ~3TB of data stored per day?
> >
> >Using the second approach, all data for one day will go to only 1
> >regionserver no matter what you do because HBase doesn't split a single
> >row.
> >
> >Using the first approach, data will spread across regionservers but there
> >will be hotspotted to each regionserver during write since this is a
> >time-series problem.
> >
> >Best Regards,
> >
> >Jerry
> >
> >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <yu...@gmail.com>
> >wrote:
> >
> >> hi
> >> i have a question about key & column design.
> >> in my application we have 3,000,000,000 record in every day
> >> each record contain : user-id, "time stamp", content(max 1KB).
> >> we need to store records for one year, this means we will have about
> >> 1,000,000,000,000 after 1 year.
> >> we just search a user-id over rang of "time stamp"
> >> table can design in two way
> >> 1.key=userid-timestamp and column:=content
> >> 2.key=userid-yyyyMMdd and column:HHmmss=content
> >>
> >>
> >> in first design we have tall-narrow table but we have very very
> >>records, in
> >> second design we have flat-wide table.
> >> which of them have better performance?
> >>
> >> thanks.
> >>
>
>
>

Re: key design

Posted by Doug Meil <do...@explorysmedical.com>.

Hi there-

Given the fact that the userid is in the lead position of the key in both
approaches, I'm not sure that he'd have a region hotspotting problem
because the userid should be able to offer some spread.




On 10/10/12 12:55 PM, "Jerry Lam" <ch...@gmail.com> wrote:

>Hi:
>
>So you are saying you have ~3TB of data stored per day?
>
>Using the second approach, all data for one day will go to only 1
>regionserver no matter what you do because HBase doesn't split a single
>row.
>
>Using the first approach, data will spread across regionservers but there
>will be hotspotted to each regionserver during write since this is a
>time-series problem.
>
>Best Regards,
>
>Jerry
>
>On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <yu...@gmail.com>
>wrote:
>
>> hi
>> i have a question about key & column design.
>> in my application we have 3,000,000,000 record in every day
>> each record contain : user-id, "time stamp", content(max 1KB).
>> we need to store records for one year, this means we will have about
>> 1,000,000,000,000 after 1 year.
>> we just search a user-id over rang of "time stamp"
>> table can design in two way
>> 1.key=userid-timestamp and column:=content
>> 2.key=userid-yyyyMMdd and column:HHmmss=content
>>
>>
>> in first design we have tall-narrow table but we have very very
>>records, in
>> second design we have flat-wide table.
>> which of them have better performance?
>>
>> thanks.
>>

Re: key design

Posted by Jerry Lam <ch...@gmail.com>.

Hi:

So you are saying you have ~3TB of data stored per day?

Using the second approach, all data for one day will go to only 1
regionserver no matter what you do because HBase doesn't split a single
row.

Using the first approach, data will spread across regionservers but there
will be hotspotted to each regionserver during write since this is a
time-series problem.

Best Regards,

Jerry

On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <yu...@gmail.com> wrote:

> hi
> i have a question about key & column design.
> in my application we have 3,000,000,000 record in every day
> each record contain : user-id, "time stamp", content(max 1KB).
> we need to store records for one year, this means we will have about
> 1,000,000,000,000 after 1 year.
> we just search a user-id over rang of "time stamp"
> table can design in two way
> 1.key=userid-timestamp and column:=content
> 2.key=userid-yyyyMMdd and column:HHmmss=content
>
>
> in first design we have tall-narrow table but we have very very records, in
> second design we have flat-wide table.
> which of them have better performance?
>
> thanks.
>