You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Alwin Roosen <dr...@gmail.com> on 2011/12/16 17:26:05 UTC

new to HBase/NoSQL, need some help with database design

Hello,


I have been suggested to use HBase for a project, but after reading
some manuals/guidelines, I am still not sure how to design the
database and getting more confused by the minute. I am new to any form
of NoSQL database and having a hard time figuring this one out.

I am hoping that someone can suggest a HBase design to me based on the
info below. It would also be nice to guide me to some of the HBase
classes/methods I need to use to get the results I need.

I basically have two tables, a category table and a keyword table. The
category table only contains a few hundred records, but the keyword
table could contain millions over time (hence the HBase suggestion).

The project is a bit more complex then this, but if I can get started
and understand the NoSQL concept for this example, I hope I am able to
figure out the rest by myself.

SQL approach:

===================================
category
===================================
id		name		parent
-----------------------------------
1		cat1		NULL
2		cat2		NULL
3		cat1-1		1
4		cat3		NULL
5		cat3-1		4
6		cat3-2		4
...

===================================
keyword
===================================
name	category	score
-----------------------------------
book	1			23
house	4			14
cup		5			75
shoe	2			3
phone	1			58
tablet	1			NULL
...

I need to be able to query HBase with the following example SQL scenarios:

1) Get the root categories

   SELECT * FROM `category` WHERE `parent` IS NULL

2) Get the child categories for a certain root category (one level)

   SELECT * FROM `category` WHERE `parent`=4

3) Get a list of root categories sorted by the total highest score
from their keywords

   SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS
`c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE
`parent` IS NULL GROUP BY `c`.`id` ORDER BY `cat_score` DESC

4) Get a list of child categories sorted by the total highest score
from their keywords

   SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS
`c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE
`parent`=4 GROUP BY `c`.`id` ORDER BY `cat_score` DESC

5) Get a list of keywords that do not have a score yet

   SELECT * FROM `keyword` WHERE `score` IS NULL

6) Get the total number of categories:

   SELECT COUNT(`id`) FROM `category`

7) Get the total number of root categories:

   SELECT COUNT(`id`) FROM `category` WHERE `parent` IS NULL

8) Get the total number of keywords:

   SELECT COUNT(`name`) FROM `keyword`

9) Get the total number of keywords without a score:

   SELECT COUNT(`name`) FROM `keyword` WHERE `score` IS NULL


HBase approach (what I have so far but is most certainly very wrong):

===================================
category
===================================
key = category name
columns:
 - info:id (id of the category)
 - relation:parent (id of the parent category)

I am able to parse a category-tree in java (using table.getScanner())
by just selecting all the rows (small table, no big deal) and creating
a tree.

Since the table is sorted by key, I get a nice category-tree
alphabetically sorted.

===================================
keyword
===================================
key = keyword name
columns:
 - info:name (name of the keyword)
 - info:score (score is available, otherwise not set)
 - relation:category (id of the category)

As I understand, you cannot sort on column values (info:score for
example), so how should I approach this? It would be possible to
process this in java, but I have a feeling this is not the correct
approach and the database design should be very different.


Any help would be much appreciated!

Re: new to HBase/NoSQL, need some help with database design

Posted by Bryan Beaudreault <bb...@hubspot.com>.

Disclaimer: I'm not a master at HBase schema design, so someone more
knowledgable feel free to refute the below.

I'd say the first thing you should do is take a step back; jumping into
HBase is not as easy as jumping into MySQL.  If you are going to be working
at scale, the efficiency of your design matters greatly, such as row key
size, column qualifier size, number of columns, separation of columns by
column family, etc.  A lot of what MySQL abstracts away is put right in
your face with HBase. This is great in a lot of ways, but provides for a
steep learning curve for someone looking to get in on the scene.  Before
contemplating such a project, I'd recommend taking a few days studying the
various documentation out there: http://hbase.apache.org/book.html and
http://ofps.oreilly.com/titles/9781449396107/ are a good place to start.
 Learning the inner architecture of how KeyValues work and how data is
retrieved by HBase is very important.

One of the first things you'll hopefully learn is that HBase/NoSQL is not
relational.  It doesn't make sense to have a category_id "foreign key"
column like you have in your current approach.  There is no "third normal
form" and such for NoSQL like there is for relation databases.  To that
point, compiling a list of SQL queries as a starting point for a NoSQL
project is probably not the best starting point.  If your data is highly
relational, while its certainly possible to make it work in a
non-relational system, it may not be recommended without some real
expertise or time to learn the new paradigms.

A couple of tips:

1) You mention sorting on columns.  You're right, this is not provided by
HBase.  In HBase there is a single primary key, and that is the row key.
 This is sorted lexicographically, as you have already found out.  Keep in
mind that when you sort in MySQL, it takes your data set and loads it into
memory to be sorted.  HBase doesn't do this for you, but you could easily
do it from your client, once you have the data you want.  You will see this
pattern a lot: MySQL does things for you that you now need to handle
yourself, and realize that there isn't much magical about how MySQL is
doing it.

2) There are filters that can do some of what you want, such as returning
only rows where a column is empty.  There are also coprocessors, depending
on the release you are using, which can do some extra work on the region
server before sending over the wire (such as more complex filtering or data
manipulation that might be expensive to do locally).

--

I'll take a very quick and naive stab at your specific example.  There
should be no id columns.  If you want to have a list of categories and
categories can have multiple children, maybe each child category would be a
column on the parent category's row.  e.g. rowkey = category name; columns
= one column (qualifer = 0x00 byte array) for the main category data, and
child categories are extra columns where qualifiers are the name.  The
value could be a protobuf or avro message with whatever fields are
important for a category.  If all keywords are linked to a category, you
might have that be part of the protobuf/avro message.  So for protobuf your
message would be:

message Category {
     repeated Keyword keyword = 1;
     optional string some_other_per_category_field = 2;
}

message Keyword {
     optional string name = 1;
     optional int32 score = 2;
}

Like I said, this was quick and naive.  Some of the queries you mentioned
above would be expensive with this approach.  You could always keep another
table (or even rowkey schema within the same table, or another column
family) to keep separate incremented counters for particular statistics you
are interested in.  Just an untested idea.

Hope this helps,

Bryan

On Fri, Dec 16, 2011 at 11:26 AM, Alwin Roosen <dr...@gmail.com> wrote:

> Hello,
>
>
> I have been suggested to use HBase for a project, but after reading
> some manuals/guidelines, I am still not sure how to design the
> database and getting more confused by the minute. I am new to any form
> of NoSQL database and having a hard time figuring this one out.
>
> I am hoping that someone can suggest a HBase design to me based on the
> info below. It would also be nice to guide me to some of the HBase
> classes/methods I need to use to get the results I need.
>
> I basically have two tables, a category table and a keyword table. The
> category table only contains a few hundred records, but the keyword
> table could contain millions over time (hence the HBase suggestion).
>
> The project is a bit more complex then this, but if I can get started
> and understand the NoSQL concept for this example, I hope I am able to
> figure out the rest by myself.
>
> SQL approach:
>
> ===================================
> category
> ===================================
> id              name            parent
> -----------------------------------
> 1               cat1            NULL
> 2               cat2            NULL
> 3               cat1-1          1
> 4               cat3            NULL
> 5               cat3-1          4
> 6               cat3-2          4
> ...
>
> ===================================
> keyword
> ===================================
> name    category        score
> -----------------------------------
> book    1                       23
> house   4                       14
> cup             5                       75
> shoe    2                       3
> phone   1                       58
> tablet  1                       NULL
> ...
>
> I need to be able to query HBase with the following example SQL scenarios:
>
> 1) Get the root categories
>
>   SELECT * FROM `category` WHERE `parent` IS NULL
>
> 2) Get the child categories for a certain root category (one level)
>
>   SELECT * FROM `category` WHERE `parent`=4
>
> 3) Get a list of root categories sorted by the total highest score
> from their keywords
>
>   SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS
> `c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE
> `parent` IS NULL GROUP BY `c`.`id` ORDER BY `cat_score` DESC
>
> 4) Get a list of child categories sorted by the total highest score
> from their keywords
>
>   SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS
> `c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE
> `parent`=4 GROUP BY `c`.`id` ORDER BY `cat_score` DESC
>
> 5) Get a list of keywords that do not have a score yet
>
>   SELECT * FROM `keyword` WHERE `score` IS NULL
>
> 6) Get the total number of categories:
>
>   SELECT COUNT(`id`) FROM `category`
>
> 7) Get the total number of root categories:
>
>   SELECT COUNT(`id`) FROM `category` WHERE `parent` IS NULL
>
> 8) Get the total number of keywords:
>
>   SELECT COUNT(`name`) FROM `keyword`
>
> 9) Get the total number of keywords without a score:
>
>   SELECT COUNT(`name`) FROM `keyword` WHERE `score` IS NULL
>
>
> HBase approach (what I have so far but is most certainly very wrong):
>
> ===================================
> category
> ===================================
> key = category name
> columns:
>  - info:id (id of the category)
>  - relation:parent (id of the parent category)
>
> I am able to parse a category-tree in java (using table.getScanner())
> by just selecting all the rows (small table, no big deal) and creating
> a tree.
>
> Since the table is sorted by key, I get a nice category-tree
> alphabetically sorted.
>
> ===================================
> keyword
> ===================================
> key = keyword name
> columns:
>  - info:name (name of the keyword)
>  - info:score (score is available, otherwise not set)
>  - relation:category (id of the category)
>
> As I understand, you cannot sort on column values (info:score for
> example), so how should I approach this? It would be possible to
> process this in java, but I have a feeling this is not the correct
> approach and the database design should be very different.
>
>
> Any help would be much appreciated!
>