You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Mark <st...@gmail.com> on 2011/08/21 17:59:11 UTC

Number of tables

We are logging all user actions into hbase. These actions include 
searches, product views and clicks.

We are currently storing them in one table with row keys like so: 
"#{type}/#{user}/#{time}", where type is either click, search, view and 
user is the current user logged in. Obviously using this method lead to 
region hot spotting as the start of each key is fairly static. This got 
me to thinking on what alternatives ways I could model this type of data 
and I was hoping I could get some suggestions from the community.

Which would be more advisable?

1) Keep the current all logs go to one table pattern that is describe above.
2) Keep the current all logs go to one table pattern that is describe 
above but switch the type and user fields which would lead to more 
randomized keys thus reducing hot spots
3) Create separate tables for each type of log we are saving... ie have 
search table, click table, view table.

Our use case does not require us searching across multiple types so I'm 
leaning towards #3 now but I was wondering if there were any cons to 
using this method? Is it worse to have more tables than less?

Thanks for help

-M

Re: Number of tables

Posted by Mark <st...@gmail.com>.

As far as we are concerned a user can only search once per second, view 
a product once per second, etc so the keys are unique. If we were going 
to be extra paranoid I suppose we could use epoch in ms instead of 
seconds to ensure this constraint.

On 8/26/11 3:40 AM, Sheng Chen wrote:
> Hi, Mark, just follow your question.
> How do you make sure the uniqueness of the row key #{type}/#{user}/#{time}?
> If the action logs are generated from different app servers, it is possible
> to have several actions with the same type/user and timestamp.
>
> Thanks.
> Sean
>
> 2011/8/21 Mark<st...@gmail.com>
>
>> We are logging all user actions into hbase. These actions include searches,
>> product views and clicks.
>>
>> We are currently storing them in one table with row keys like so:
>> "#{type}/#{user}/#{time}", where type is either click, search, view and user
>> is the current user logged in. Obviously using this method lead to region
>> hot spotting as the start of each key is fairly static. This got me to
>> thinking on what alternatives ways I could model this type of data and I was
>> hoping I could get some suggestions from the community.
>>
>> Which would be more advisable?
>>
>> 1) Keep the current all logs go to one table pattern that is describe
>> above.
>> 2) Keep the current all logs go to one table pattern that is describe above
>> but switch the type and user fields which would lead to more randomized keys
>> thus reducing hot spots
>> 3) Create separate tables for each type of log we are saving... ie have
>> search table, click table, view table.
>>
>> Our use case does not require us searching across multiple types so I'm
>> leaning towards #3 now but I was wondering if there were any cons to using
>> this method? Is it worse to have more tables than less?
>>
>> Thanks for help
>>
>> -M
>>
>>
>>
>>
>>

Re: Number of tables

Posted by Sheng Chen <ch...@gmail.com>.

Hi, Mark, just follow your question.
How do you make sure the uniqueness of the row key #{type}/#{user}/#{time}?
If the action logs are generated from different app servers, it is possible
to have several actions with the same type/user and timestamp.

Thanks.
Sean

2011/8/21 Mark <st...@gmail.com>

> We are logging all user actions into hbase. These actions include searches,
> product views and clicks.
>
> We are currently storing them in one table with row keys like so:
> "#{type}/#{user}/#{time}", where type is either click, search, view and user
> is the current user logged in. Obviously using this method lead to region
> hot spotting as the start of each key is fairly static. This got me to
> thinking on what alternatives ways I could model this type of data and I was
> hoping I could get some suggestions from the community.
>
> Which would be more advisable?
>
> 1) Keep the current all logs go to one table pattern that is describe
> above.
> 2) Keep the current all logs go to one table pattern that is describe above
> but switch the type and user fields which would lead to more randomized keys
> thus reducing hot spots
> 3) Create separate tables for each type of log we are saving... ie have
> search table, click table, view table.
>
> Our use case does not require us searching across multiple types so I'm
> leaning towards #3 now but I was wondering if there were any cons to using
> this method? Is it worse to have more tables than less?
>
> Thanks for help
>
> -M
>
>
>
>
>

Re: Number of tables

Posted by Jean-Daniel Cryans <jd...@apache.org>.

> Is there a disadvantage to create more tables?

Not on the HBase side, in the end it's all regions.

J-D

Re: Number of tables

Posted by Michel Segel <mi...@hotmail.com>.

Mark,
Looks like you have your key setup correctly... What happens if you make your user as the first element in the key? 

You can go with multiple tables. This may also help to improve performance too.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 21, 2011, at 1:34 PM, Mark <st...@gmail.com> wrote:

> About a million rows per day per table.
> 
> Is there a disadvantage to create more tables?
> 
> On 8/21/11 10:49 AM, Sonal Goyal wrote:
>> If your data size is big enough to warrant 3 tables, go for it. This would
>> be the case where there are really lots of entries for user#type.
>> 
>> Best Regards,
>> Sonal
>> Crux: Reporting for HBase<https://github.com/sonalgoyal/crux>
>> Nube Technologies<http://www.nubetech.co>
>> 
>> <http://in.linkedin.com/in/sonalgoyal>
>> 
>> 
>> 
>> 
>> 
>> On Sun, Aug 21, 2011 at 11:09 PM, Mark<st...@gmail.com>  wrote:
>> 
>>> Almost all use cases require type.. ie
>>> 
>>> Retrieve all searches performed by user 'foo':  scan "history", {STARTROW
>>> =>  "search/foo"}
>>> Retrieve all product views performed by user 'foo': scan "history",
>>> {STARTROW =>  "view/foo"}
>>> 
>>> 
>>> On 8/21/11 10:25 AM, Sonal Goyal wrote:
>>> 
>>>> Hi Mark,
>>>> 
>>>> When you say that your use case does not require searching across multiple
>>>> types, what do you mean? Do you have cases when you search with type?
>>>> 
>>>> Best Regards,
>>>> Sonal
>>>> Crux: Reporting for HBase<https://github.com/**sonalgoyal/crux<https://github.com/sonalgoyal/crux>
>>>> Nube Technologies<http://www.**nubetech.co<http://www.nubetech.co>>
>>>> 
>>>> <http://in.linkedin.com/in/**sonalgoyal<http://in.linkedin.com/in/sonalgoyal>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sun, Aug 21, 2011 at 9:29 PM, Mark<static.void.dev@gmail.com**>
>>>>  wrote:
>>>> 
>>>>  We are logging all user actions into hbase. These actions include
>>>>> searches,
>>>>> product views and clicks.
>>>>> 
>>>>> We are currently storing them in one table with row keys like so:
>>>>> "#{type}/#{user}/#{time}", where type is either click, search, view and
>>>>> user
>>>>> is the current user logged in. Obviously using this method lead to region
>>>>> hot spotting as the start of each key is fairly static. This got me to
>>>>> thinking on what alternatives ways I could model this type of data and I
>>>>> was
>>>>> hoping I could get some suggestions from the community.
>>>>> 
>>>>> Which would be more advisable?
>>>>> 
>>>>> 1) Keep the current all logs go to one table pattern that is describe
>>>>> above.
>>>>> 2) Keep the current all logs go to one table pattern that is describe
>>>>> above
>>>>> but switch the type and user fields which would lead to more randomized
>>>>> keys
>>>>> thus reducing hot spots
>>>>> 3) Create separate tables for each type of log we are saving... ie have
>>>>> search table, click table, view table.
>>>>> 
>>>>> Our use case does not require us searching across multiple types so I'm
>>>>> leaning towards #3 now but I was wondering if there were any cons to
>>>>> using
>>>>> this method? Is it worse to have more tables than less?
>>>>> 
>>>>> Thanks for help
>>>>> 
>>>>> -M
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>

Re: Number of tables

Posted by Mark <st...@gmail.com>.

About a million rows per day per table.

Is there a disadvantage to create more tables?

On 8/21/11 10:49 AM, Sonal Goyal wrote:
> If your data size is big enough to warrant 3 tables, go for it. This would
> be the case where there are really lots of entries for user#type.
>
> Best Regards,
> Sonal
> Crux: Reporting for HBase<https://github.com/sonalgoyal/crux>
> Nube Technologies<http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
>
> On Sun, Aug 21, 2011 at 11:09 PM, Mark<st...@gmail.com>  wrote:
>
>> Almost all use cases require type.. ie
>>
>> Retrieve all searches performed by user 'foo':  scan "history", {STARTROW
>> =>  "search/foo"}
>> Retrieve all product views performed by user 'foo': scan "history",
>> {STARTROW =>  "view/foo"}
>>
>>
>> On 8/21/11 10:25 AM, Sonal Goyal wrote:
>>
>>> Hi Mark,
>>>
>>> When you say that your use case does not require searching across multiple
>>> types, what do you mean? Do you have cases when you search with type?
>>>
>>> Best Regards,
>>> Sonal
>>> Crux: Reporting for HBase<https://github.com/**sonalgoyal/crux<https://github.com/sonalgoyal/crux>
>>> Nube Technologies<http://www.**nubetech.co<http://www.nubetech.co>>
>>>
>>> <http://in.linkedin.com/in/**sonalgoyal<http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Aug 21, 2011 at 9:29 PM, Mark<static.void.dev@gmail.com**>
>>>   wrote:
>>>
>>>   We are logging all user actions into hbase. These actions include
>>>> searches,
>>>> product views and clicks.
>>>>
>>>> We are currently storing them in one table with row keys like so:
>>>> "#{type}/#{user}/#{time}", where type is either click, search, view and
>>>> user
>>>> is the current user logged in. Obviously using this method lead to region
>>>> hot spotting as the start of each key is fairly static. This got me to
>>>> thinking on what alternatives ways I could model this type of data and I
>>>> was
>>>> hoping I could get some suggestions from the community.
>>>>
>>>> Which would be more advisable?
>>>>
>>>> 1) Keep the current all logs go to one table pattern that is describe
>>>> above.
>>>> 2) Keep the current all logs go to one table pattern that is describe
>>>> above
>>>> but switch the type and user fields which would lead to more randomized
>>>> keys
>>>> thus reducing hot spots
>>>> 3) Create separate tables for each type of log we are saving... ie have
>>>> search table, click table, view table.
>>>>
>>>> Our use case does not require us searching across multiple types so I'm
>>>> leaning towards #3 now but I was wondering if there were any cons to
>>>> using
>>>> this method? Is it worse to have more tables than less?
>>>>
>>>> Thanks for help
>>>>
>>>> -M
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>

Re: Number of tables

Posted by Sonal Goyal <so...@gmail.com>.

If your data size is big enough to warrant 3 tables, go for it. This would
be the case where there are really lots of entries for user#type.

Best Regards,
Sonal
Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Sun, Aug 21, 2011 at 11:09 PM, Mark <st...@gmail.com> wrote:

> Almost all use cases require type.. ie
>
> Retrieve all searches performed by user 'foo':  scan "history", {STARTROW
> => "search/foo"}
> Retrieve all product views performed by user 'foo': scan "history",
> {STARTROW => "view/foo"}
>
>
> On 8/21/11 10:25 AM, Sonal Goyal wrote:
>
>> Hi Mark,
>>
>> When you say that your use case does not require searching across multiple
>> types, what do you mean? Do you have cases when you search with type?
>>
>> Best Regards,
>> Sonal
>> Crux: Reporting for HBase<https://github.com/**sonalgoyal/crux<https://github.com/sonalgoyal/crux>
>> >
>> Nube Technologies<http://www.**nubetech.co <http://www.nubetech.co>>
>>
>> <http://in.linkedin.com/in/**sonalgoyal<http://in.linkedin.com/in/sonalgoyal>
>> >
>>
>>
>>
>>
>>
>>
>> On Sun, Aug 21, 2011 at 9:29 PM, Mark<static.void.dev@gmail.com**>
>>  wrote:
>>
>>  We are logging all user actions into hbase. These actions include
>>> searches,
>>> product views and clicks.
>>>
>>> We are currently storing them in one table with row keys like so:
>>> "#{type}/#{user}/#{time}", where type is either click, search, view and
>>> user
>>> is the current user logged in. Obviously using this method lead to region
>>> hot spotting as the start of each key is fairly static. This got me to
>>> thinking on what alternatives ways I could model this type of data and I
>>> was
>>> hoping I could get some suggestions from the community.
>>>
>>> Which would be more advisable?
>>>
>>> 1) Keep the current all logs go to one table pattern that is describe
>>> above.
>>> 2) Keep the current all logs go to one table pattern that is describe
>>> above
>>> but switch the type and user fields which would lead to more randomized
>>> keys
>>> thus reducing hot spots
>>> 3) Create separate tables for each type of log we are saving... ie have
>>> search table, click table, view table.
>>>
>>> Our use case does not require us searching across multiple types so I'm
>>> leaning towards #3 now but I was wondering if there were any cons to
>>> using
>>> this method? Is it worse to have more tables than less?
>>>
>>> Thanks for help
>>>
>>> -M
>>>
>>>
>>>
>>>
>>>
>>>

Re: Number of tables

Posted by Mark <st...@gmail.com>.

Almost all use cases require type.. ie

Retrieve all searches performed by user 'foo':  scan "history", 
{STARTROW => "search/foo"}
Retrieve all product views performed by user 'foo': scan "history", 
{STARTROW => "view/foo"}

On 8/21/11 10:25 AM, Sonal Goyal wrote:
> Hi Mark,
>
> When you say that your use case does not require searching across multiple
> types, what do you mean? Do you have cases when you search with type?
>
> Best Regards,
> Sonal
> Crux: Reporting for HBase<https://github.com/sonalgoyal/crux>
> Nube Technologies<http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
>
> On Sun, Aug 21, 2011 at 9:29 PM, Mark<st...@gmail.com>  wrote:
>
>> We are logging all user actions into hbase. These actions include searches,
>> product views and clicks.
>>
>> We are currently storing them in one table with row keys like so:
>> "#{type}/#{user}/#{time}", where type is either click, search, view and user
>> is the current user logged in. Obviously using this method lead to region
>> hot spotting as the start of each key is fairly static. This got me to
>> thinking on what alternatives ways I could model this type of data and I was
>> hoping I could get some suggestions from the community.
>>
>> Which would be more advisable?
>>
>> 1) Keep the current all logs go to one table pattern that is describe
>> above.
>> 2) Keep the current all logs go to one table pattern that is describe above
>> but switch the type and user fields which would lead to more randomized keys
>> thus reducing hot spots
>> 3) Create separate tables for each type of log we are saving... ie have
>> search table, click table, view table.
>>
>> Our use case does not require us searching across multiple types so I'm
>> leaning towards #3 now but I was wondering if there were any cons to using
>> this method? Is it worse to have more tables than less?
>>
>> Thanks for help
>>
>> -M
>>
>>
>>
>>
>>

Re: Number of tables

Posted by Sonal Goyal <so...@gmail.com>.

Hi Mark,

When you say that your use case does not require searching across multiple
types, what do you mean? Do you have cases when you search with type?

Best Regards,
Sonal
Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Sun, Aug 21, 2011 at 9:29 PM, Mark <st...@gmail.com> wrote:

> We are logging all user actions into hbase. These actions include searches,
> product views and clicks.
>
> We are currently storing them in one table with row keys like so:
> "#{type}/#{user}/#{time}", where type is either click, search, view and user
> is the current user logged in. Obviously using this method lead to region
> hot spotting as the start of each key is fairly static. This got me to
> thinking on what alternatives ways I could model this type of data and I was
> hoping I could get some suggestions from the community.
>
> Which would be more advisable?
>
> 1) Keep the current all logs go to one table pattern that is describe
> above.
> 2) Keep the current all logs go to one table pattern that is describe above
> but switch the type and user fields which would lead to more randomized keys
> thus reducing hot spots
> 3) Create separate tables for each type of log we are saving... ie have
> search table, click table, view table.
>
> Our use case does not require us searching across multiple types so I'm
> leaning towards #3 now but I was wondering if there were any cons to using
> this method? Is it worse to have more tables than less?
>
> Thanks for help
>
> -M
>
>
>
>
>