You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Steven Wu <wu...@gmail.com> on 2013/12/11 00:35:02 UTC

hbase schema design

 

 

Hi

   I am very new to Hbase, still self-learning and do POC for our current
project.  I have a question about the row key design.

I have created  big table (called asset table), it  has more than 50M
records. Each asset has unique key (let's call it asset_key)

This table receives continuous updates from up-stream system (around 100
updates per min). The clients would like to receive real-time updates from
us. At current system, we have two indexed columns (asset_key, update_ts) on
asset DB table So the clients could query the db table based on update_ts
for lastest updates. However the db now become a bottleneck

So we are wondering how could we achieve the same function in Hbase. I don't
want to use scan filter function on the column as it will tiger full table
scan (correct me if I am wrong on this).

 

the best thing I could think of is to have timestamp built in to rowkey.
However, we still have a requirement, that client would like query data
based on unique asset_key

 

The usercase we have is the system has to support concurrently more than
1000 uses to query latest update from this table at lowest possible latency.
Also ,  clients would like query data based on unique asset_key  to retrieve
records from our system

 

 

Really appreciate your though on this.

 

 

 

Regards,

 

 

Steven

Re: hbase schema design

Posted by Silvio Di gregorio <si...@gmail.com>.

Hi
These are a characteristic time Series data. You must prefix rowkey TO
avoid workload TO only one regione server.
<something not monotonic variable>_timestamp.
Il 11/dic/2013 00:35 "Steven Wu" <wu...@gmail.com> ha scritto:

>
>
>
>
> Hi
>
>    I am very new to Hbase, still self-learning and do POC for our current
> project.  I have a question about the row key design.
>
> I have created  big table (called asset table), it  has more than 50M
> records. Each asset has unique key (let's call it asset_key)
>
> This table receives continuous updates from up-stream system (around 100
> updates per min). The clients would like to receive real-time updates from
> us. At current system, we have two indexed columns (asset_key, update_ts)
> on
> asset DB table So the clients could query the db table based on update_ts
> for lastest updates. However the db now become a bottleneck
>
> So we are wondering how could we achieve the same function in Hbase. I
> don't
> want to use scan filter function on the column as it will tiger full table
> scan (correct me if I am wrong on this).
>
>
>
> the best thing I could think of is to have timestamp built in to rowkey.
> However, we still have a requirement, that client would like query data
> based on unique asset_key
>
>
>
> The usercase we have is the system has to support concurrently more than
> 1000 uses to query latest update from this table at lowest possible
> latency.
> Also ,  clients would like query data based on unique asset_key  to
> retrieve
> records from our system
>
>
>
>
>
> Really appreciate your though on this.
>
>
>
>
>
>
>
> Regards,
>
>
>
>
>
> Steven
>
>
>
>
>
>
>
>

RE: hbase schema design

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

100 writes/updates per min is very low number and HBase, of course, is able to sustain 1.5 update/sec (if not GBs per update)
1000 concurrent users and minimum query latency - probably possible but we do not have enough info:
 What is SLA? requests per sec and latency requirements? How large is the typical result set?

You will definitely need to keep your hot data set in a RAM. If you can afford to store data twice and ACID transaction
is not your MUST HAVE feature:

Have two rows per your asset item:
rowkey1: asset_key + update_time
rowkey2: update_time + asset_key

This basically, gives you 2 covered indexes: by asset_key and by update_time, but because you duplicate data
you replaces many random look ups (as in case of a simple index) by one scan operation on a corresponding
rowkeys.

On asset update insert two rows into table (you can keep them in the same table) and make sure you have enough RAM
(cache) to keep all in memory.


Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Steven Wu [wulinux@gmail.com]
Sent: Tuesday, December 10, 2013 3:35 PM
To: user@hbase.apache.org
Subject: hbase schema design

Hi

   I am very new to Hbase, still self-learning and do POC for our current
project.  I have a question about the row key design.

I have created  big table (called asset table), it  has more than 50M
records. Each asset has unique key (let's call it asset_key)

This table receives continuous updates from up-stream system (around 100
updates per min). The clients would like to receive real-time updates from
us. At current system, we have two indexed columns (asset_key, update_ts) on
asset DB table So the clients could query the db table based on update_ts
for lastest updates. However the db now become a bottleneck

So we are wondering how could we achieve the same function in Hbase. I don't
want to use scan filter function on the column as it will tiger full table
scan (correct me if I am wrong on this).



the best thing I could think of is to have timestamp built in to rowkey.
However, we still have a requirement, that client would like query data
based on unique asset_key



The usercase we have is the system has to support concurrently more than
1000 uses to query latest update from this table at lowest possible latency.
Also ,  clients would like query data based on unique asset_key  to retrieve
records from our system





Really appreciate your though on this.







Regards,





Steven








Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.