You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Miguel Costa <mi...@telecom.pt> on 2011/04/04 18:12:08 UTC

HBase design schema

Hi,

 

I need some help to a schema design on HBase.

 

I have 5 dimensions (Time,Site,Referrer Keyword,Country).

My row key is Site+Time.

 

Now I want to answer some questions like what is the top Referrer by Keyword
for a site on a Period of Time.

Basically I want to cross all the dimensions that I have. And if I have 30
dimensions?

 

What is the best schema design. 

 

Please let me know  if this isn't the right mailing list.

 

Thank you for your time.

 

Miguel

Re: HBase design schema

Posted by Ted Dunning <td...@maprtech.com>.

Take a look at OpenTSDB.

I think you will be impressed with the speed.

Regarding the exponential explosion.  Yes.  That is a risk in theory.  But
what happens in practice is that you only create the alternative forms of
the file where the simpler key forms are unacceptable due to volume of data.
 With OpenTSDB, the speed without any of these special versions of the data
file is quite impressive even with pretty large data.

On Mon, Apr 4, 2011 at 9:37 AM, Miguel Costa <mi...@telecom.pt>wrote:

> In Hive it is possible to make this queries if I have this dimensions on
> columns but the problem is that I need results on 3 seconds.
>
>

RE: HBase design schema

Posted by Miguel Costa <mi...@telecom.pt>.

Ted thanks for your help.

 

I considered the last option that you mentioned , "pushing one of you r
dimension to the key".

 

With that I can have results for that single dimension: For example key:
Time+Site+Referrer

But if I want now the top Keywords (where top can be any metric) of that
Key. Should I have another table with this key: Time+Site+Referrer+Keyword ?

And If I have 30 more dimensions and I want to cross all over them. The
number of tables will grow exponencially (dimension* the number of available
dimensions to cross). And this can be into several level for example to
level 5 Time+Site+Referrer+Keyword+Dim4 +Dim5.

And the time to update those tables maybe will be a lot.

 

In Hive it is possible to make this queries if I have this dimensions on
columns but the problem is that I need results on 3 seconds. 

 

Other option that I thought was to have the cross dimensions as
columnFamilies. For example key: Time+Site+Referrer and Column Family
Keyword: MyKeyword where the value could be the metrics that I need
separated by "\t".


 

What do you think is the best approach?

 

Thanks,

 


cid:image001.jpg@01CAE723.6653A5B0


 

Logo_pt_verde


  


Miguel Costa


DTS - Sapo Technology Department 
Web Analytics 
Tm: +351 92 672 60 54
 <ma...@telecom.pt> miguel.costa@telecom.pt

 

 

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: segunda-feira, 4 de Abril de 2011 17:25
To: user@hbase.apache.org
Cc: Miguel Costa
Subject: Re: HBase design schema

 

 

Miguel,

 

One option is to use the simplest design and use the key you have.  Scanning
for a particular period of time will give you all the data in that time
period which you can reduce in any way that you like.

 

If that becomes too inefficient, a common trick is to build a secondary file
that contains aggregated data at lower time resolution.

 

Another trick is to copy your original table pushing one of your dimension
into the key.  That will help by preventing you from scanning through data
you don't care about.  The space consumed is not so far off what an index in
a conventional database would consume.

 

In general, it is important to keep in mind that Hbase doesn't have
conventional relational indexes so lots of the design considerations that
motivate star schemas don't really apply.

On Mon, Apr 4, 2011 at 9:12 AM, Miguel Costa <mi...@telecom.pt>
wrote:

Hi,

 

I need some help to a schema design on HBase.

 

I have 5 dimensions (Time,Site,Referrer Keyword,Country).

My row key is Site+Time.

 

Now I want to answer some questions like what is the top Referrer by Keyword
for a site on a Period of Time.

Basically I want to cross all the dimensions that I have. And if I have 30
dimensions?

 

What is the best schema design. 

 

Please let me know  if this isn't the right mailing list.

 

Thank you for your time.

 

Miguel

Re: HBase design schema

Posted by Ted Dunning <td...@maprtech.com>.

Miguel,

One option is to use the simplest design and use the key you have.  Scanning
for a particular period of time will give you all the data in that time
period which you can reduce in any way that you like.

If that becomes too inefficient, a common trick is to build a secondary file
that contains aggregated data at lower time resolution.

Another trick is to copy your original table pushing one of your dimension
into the key.  That will help by preventing you from scanning through data
you don't care about.  The space consumed is not so far off what an index in
a conventional database would consume.

In general, it is important to keep in mind that Hbase doesn't have
conventional relational indexes so lots of the design considerations that
motivate star schemas don't really apply.

On Mon, Apr 4, 2011 at 9:12 AM, Miguel Costa <mi...@telecom.pt>wrote:

> Hi,
>
>
>
> I need some help to a schema design on HBase.
>
>
>
> I have 5 dimensions (Time,Site,Referrer Keyword,Country).
>
> My row key is Site+Time.
>
>
>
> Now I want to answer some questions like what is the top Referrer by
> Keyword for a site on a Period of Time.
>
> Basically I want to cross all the dimensions that I have. And if I have 30
> dimensions?
>
>
>
> What is the best schema design.
>
>
>
> Please let me know  if this isn’t the right mailing list.
>
>
>
> Thank you for your time.
>
>
>
> Miguel
>
>
>
>
>

RE: HBase design schema

Posted by Miguel Costa <mi...@telecom.pt>.

Thanks for all your help.

I will try your solutions. I also saw this link
http://static.last.fm/johan/huguk-20090414/fredrik-hypercubes-in-hbase.pdf.

I will try OpenTSDB and maybe Zhomg


  
Miguel 





-----Original Message-----
From: Peter Haidinyak [mailto:phaidinyak@local.com] 
Sent: segunda-feira, 4 de Abril de 2011 19:24
To: user@hbase.apache.org
Subject: RE: HBase design schema

I've done almost the same thing at my work. Since I'm running on a VERY
small number of servers (2), I pre-aggregate my data into tables in the
format...

[YYYY-MM-DD]|[Keyword]|[Referrer]  for the row key

And then for the data column I store the hit count for that referrer. This
approach has a problem during insert because having the date at the front of
the key is usually goes to one server. The upside is that during a client
scan you can set the start and end row, such as startRow =
'2011-03-05|hospital| ' and the End Row as  endRow = '2011-03-05|hospital|~'
this will return all of the referrers for the keyword hospital for the date
of 2011-03-05.

YMMV

-Pete

From: Miguel Costa [mailto:miguel-costa@telecom.pt]
Sent: Monday, April 04, 2011 9:12 AM
To: user@hbase.apache.org
Subject: HBase design schema

Hi,

I need some help to a schema design on HBase.

I have 5 dimensions (Time,Site,Referrer Keyword,Country).
My row key is Site+Time.

Now I want to answer some questions like what is the top Referrer by Keyword
for a site on a Period of Time.
Basically I want to cross all the dimensions that I have. And if I have 30
dimensions?

What is the best schema design.

Please let me know  if this isn't the right mailing list.

Thank you for your time.

Miguel

Re: HBase design schema

Posted by tsuna <ts...@gmail.com>.

On Mon, Apr 4, 2011 at 3:30 PM, Ted Dunning <td...@maprtech.com> wrote:
> OpenTSDB does an interesting thing where they put a primary key in front of
> the date.  This limits some of the hot-spotting on inserts.  Each different
> kind of query goes to a different machine as well.  The query balancing
> won't be as good as the insert balancing since some queries are much more
> popular.

I believe this leads to better use of the blockcache though.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: HBase design schema

Posted by Ted Dunning <td...@maprtech.com>.

OpenTSDB does an interesting thing where they put a primary key in front of
the date.  This limits some of the hot-spotting on inserts.  Each different
kind of query goes to a different machine as well.  The query balancing
won't be as good as the insert balancing since some queries are much more
popular.

On Mon, Apr 4, 2011 at 11:23 AM, Peter Haidinyak <ph...@local.com>wrote:

> I've done almost the same thing at my work. Since I'm running on a VERY
> small number of servers (2), I pre-aggregate my data into tables in the
> format...
>
> [YYYY-MM-DD]|[Keyword]|[Referrer]  for the row key
>
> And then for the data column I store the hit count for that referrer. This
> approach has a problem during insert because having the date at the front of
> the key is usually goes to one server. The upside is that during a client
> scan you can set the start and end row, such as startRow =
> '2011-03-05|hospital| ' and the End Row as  endRow = '2011-03-05|hospital|~'
> this will return all of the referrers for the keyword hospital for the date
> of 2011-03-05.
>
> YMMV
>
> -Pete
>
> From: Miguel Costa [mailto:miguel-costa@telecom.pt]
> Sent: Monday, April 04, 2011 9:12 AM
> To: user@hbase.apache.org
> Subject: HBase design schema
>
> Hi,
>
> I need some help to a schema design on HBase.
>
> I have 5 dimensions (Time,Site,Referrer Keyword,Country).
> My row key is Site+Time.
>
> Now I want to answer some questions like what is the top Referrer by
> Keyword for a site on a Period of Time.
> Basically I want to cross all the dimensions that I have. And if I have 30
> dimensions?
>
> What is the best schema design.
>
> Please let me know  if this isn't the right mailing list.
>
> Thank you for your time.
>
> Miguel
>
>
>

RE: HBase design schema

Posted by Peter Haidinyak <ph...@local.com>.

I've done almost the same thing at my work. Since I'm running on a VERY small number of servers (2), I pre-aggregate my data into tables in the format...

[YYYY-MM-DD]|[Keyword]|[Referrer]  for the row key

And then for the data column I store the hit count for that referrer. This approach has a problem during insert because having the date at the front of the key is usually goes to one server. The upside is that during a client scan you can set the start and end row, such as startRow = '2011-03-05|hospital| ' and the End Row as  endRow = '2011-03-05|hospital|~' this will return all of the referrers for the keyword hospital for the date of 2011-03-05.

YMMV

-Pete

From: Miguel Costa [mailto:miguel-costa@telecom.pt]
Sent: Monday, April 04, 2011 9:12 AM
To: user@hbase.apache.org
Subject: HBase design schema

Hi,

I need some help to a schema design on HBase.

I have 5 dimensions (Time,Site,Referrer Keyword,Country).
My row key is Site+Time.

Now I want to answer some questions like what is the top Referrer by Keyword for a site on a Period of Time.
Basically I want to cross all the dimensions that I have. And if I have 30 dimensions?

What is the best schema design.

Please let me know  if this isn't the right mailing list.

Thank you for your time.

Miguel