You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by N Kapshoo <nk...@gmail.com> on 2010/06/19 05:52:30 UTC

Fwd: data redundancy in hbase tables for read performance

I never heard from anyone. I would appreciate if anyone has any insight on
this...

---------- Forwarded message ----------
From: N Kapshoo <nk...@gmail.com>
Date: Wed, May 12, 2010 at 2:21 PM
Subject: data redundancy in hbase tables for read performance
To: hbase-user@hadoop.apache.org


For the model I am designing, read speed is the highest priority. That being
said, I have a Customers table with information about Claims.

Here is the design today:

Table: Customers
RowId: CustomerId
Family: Claims
Column: ClaimId
Value: JSON(ClaimId, Status, Description, From)

I am storing the ClaimsInfo as a JSON object. This JSON object will be
displayed in a tabular format after querying.

Now I get an additional requirement to sort claims by status.

I resolve this by adding a new Family called 'Status'. (Denormalization +
Redundancy)

Table: Customers
RowId: CustomerId
Family: ClaimStatus
Column: ClaimId
Value: *String*


My concern is, do I continue down this path when more query requirements are
added to the system? For example, when they want to retrieve by 'From', then
I add another family called 'From'?

Should I be creating a new table in that case to support the new family?
Admittedly, the data in these columns is not huge, but I am worried about
doing multiple 'Puts' when the value changes.

Am I on the right track by adding redundancy to keep up with read
performance?

Thanks.

RE: data redundancy in hbase tables for read performance

Posted by "Gibbon, Robert, VF-Group" <Ro...@vodafone.com>.
Reading how, exactly? 

I think (but I am no expert) HBase is very good at sequential table scans but not quite so good at random reads. To help speed things up you can use a pathing technique in secondary index keys. See here:

http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/

So for example you might have

Customer
--------
CustomerId
...

Claim
-----
ClaimId
Status

IxClaim_ClaimIdCustomerId_Asc
-----
ClaimIdCustomerId

IxClaim_StatusClaimId_Asc
-----
StatusClaimId


Have the read of the article; it explains it better than I can here. 




-----Original Message-----
From: N Kapshoo [mailto:nkapshoo@gmail.com]
Sent: Sat 6/19/2010 5:52 AM
To: hbase-user@hadoop.apache.org
Subject: Fwd: data redundancy in hbase tables for read performance
 
I never heard from anyone. I would appreciate if anyone has any insight on
this...

---------- Forwarded message ----------
From: N Kapshoo <nk...@gmail.com>
Date: Wed, May 12, 2010 at 2:21 PM
Subject: data redundancy in hbase tables for read performance
To: hbase-user@hadoop.apache.org


For the model I am designing, read speed is the highest priority. That being
said, I have a Customers table with information about Claims.

Here is the design today:

Table: Customers
RowId: CustomerId
Family: Claims
Column: ClaimId
Value: JSON(ClaimId, Status, Description, From)

I am storing the ClaimsInfo as a JSON object. This JSON object will be
displayed in a tabular format after querying.

Now I get an additional requirement to sort claims by status.

I resolve this by adding a new Family called 'Status'. (Denormalization +
Redundancy)

Table: Customers
RowId: CustomerId
Family: ClaimStatus
Column: ClaimId
Value: *String*


My concern is, do I continue down this path when more query requirements are
added to the system? For example, when they want to retrieve by 'From', then
I add another family called 'From'?

Should I be creating a new table in that case to support the new family?
Admittedly, the data in these columns is not huge, but I am worried about
doing multiple 'Puts' when the value changes.

Am I on the right track by adding redundancy to keep up with read
performance?

Thanks.