You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by kishore g <g....@gmail.com> on 2010/01/08 00:00:28 UTC

Insert streamed data into hbase

Hi,

I see that the inserting into  hbase is not very efficient for large data.
For event logging i see the solution explained in
http://www.mail-archive.com/hbase-user@hadoop.apache.org/msg06010.html

If  my understanding is correct this is applicable if key is timestamp. Is
there a solution to achieve the following

--> we get stream of events and we want to insert into a table but our key
will be something different from timestamp.

Is there any way to achieve this efficiently apart from inserting every
event using hbase api's.

thanks
Kg

Re: Insert streamed data into hbase

Posted by Andrew Purtell <ap...@apache.org>.
No, you misunderstand. **Sequential** inserting into HBase is not very
efficient for large data. If basically you are inserting a high volume of
data with row keys that are all adjacent to each other, this will focus
all of the load on one region server only. If the keys for the data being
inserted are well distributed over the key space, then the load will be
well distributed over the region servers as well. 

If your data is keyed by timestamp and/or you are doing bulk uploading,
then the solution you refer to is appropriate. 

Since you say you are not keying by timestamp, then perhaps your keying
strategy will be fine. 

For example, for a Web crawling application of mine, for the retrieved
content I use a row key that is the SHA-1 hash of the content. Due to the
properties of the hash function, all inserts are well distributed in the
key space. 

Another example: If you are importing data via a MapReduce task, you can
build a trivial combiner that randomly distributes keys to a set of
reducers which will thus store values into HBase in parallel in random
order.

   - Andy



----- Original Message ----
> From: kishore g <g....@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Thu, January 7, 2010 3:00:28 PM
> Subject: Insert streamed data into hbase
> 
> Hi,
> 
> I see that the inserting into  hbase is not very efficient for large data.
> For event logging i see the solution explained in
> http://www.mail-archive.com/hbase-user@hadoop.apache.org/msg06010.html
> 
> If  my understanding is correct this is applicable if key is timestamp. Is
> there a solution to achieve the following
> 
> --> we get stream of events and we want to insert into a table but our key
> will be something different from timestamp.
> 
> Is there any way to achieve this efficiently apart from inserting every
> event using hbase api's.
> 
> thanks
> Kg