You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Leo Alekseyev <dn...@gmail.com> on 2010/11/01 10:28:31 UTC

Best strategy for row updates

We are populating some HBase tables from daily data streams that are
stored in Hive.  When we see a row key that's already in the table,
the data should be appended to that row's record.  What is the best
way to achieve this?..  Should we be using the Java API?..  Rely on
HBase cell timestamping?..  Create compound keys (row_id+date) and
periodically run a separate MR job to coalesce all the data belonging
to the same row_id?..

Any pointers greatly appreciated!

--Leo

Re: Best strategy for row updates

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Leo,

Maybe HBaseHUT can help, although you say "append", not "update" or "combine"...

See:
http://blog.sematext.com/2010/12/16/deferring-processing-updates-to-increase-hbase-write-performance/


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Leo Alekseyev <dn...@gmail.com>
> To: user@hbase.apache.org
> Sent: Mon, November 1, 2010 5:28:31 AM
> Subject: Best strategy for row updates
> 
> We are populating some HBase tables from daily data streams that are
> stored  in Hive.  When we see a row key that's already in the table,
> the data  should be appended to that row's record.  What is the best
> way to  achieve this?..  Should we be using the Java API?..  Rely on
> HBase  cell timestamping?..  Create compound keys (row_id+date)  and
> periodically run a separate MR job to coalesce all the data  belonging
> to the same row_id?..
> 
> Any pointers greatly  appreciated!
> 
> --Leo
>

Re: Best strategy for row updates

Posted by Leo Alekseyev <dn...@gmail.com>.

The data will be accessed both by MR jobs (if possible, via Hive,
using HBaseStorageHandler), and randomly via the REST API.  The rows
won't be too big.

Ideally, I would like to store lists of attributes for every row key
(example: store lists of visitors to a set of URLs, URL being the row
key).  Thus, one option is to create an insertion scheme where for
every row key, new data are appended to the existing list.  This makes
retrievals straightforward.

The second option is to store new data in separate rows by making
timestamp part of the row key, and scan through a set of rows on
retrieval.  This makes insertions easy, but would row scans be fast
enough for random accesses via REST API?

Third option is to store new data in a different family, i.e. making
timestamp a family qualifier.  I'm not sure what drawbacks that
entails...

Retrieving data that's been accumulating over time seems like a pretty
common use pattern; I'm a little surprised that I couldn't easily find
guidelines or descriptions of possible trade-offs...

--Leo

On Mon, Nov 1, 2010 at 7:17 AM, Michael Segel <mi...@hotmail.com> wrote:
>
> Best? That's pretty subjective.
>
> How are you planning on accessing the data?
> Since you don't want to overwrite the data you can't really rely on the timestamps.
> (Or is the updated data a replacement?)
>
> Depending on the data size and structure you could append to the same column family, column (record) You could create a new column and insert the data there.
>
> Not sure which would be best, it would depend on how you want to access the data.
>
>> Date: Mon, 1 Nov 2010 02:28:31 -0700
>> Subject: Best strategy for row updates
>> From: dnquark@gmail.com
>> To: user@hbase.apache.org
>>
>> We are populating some HBase tables from daily data streams that are
>> stored in Hive.  When we see a row key that's already in the table,
>> the data should be appended to that row's record.  What is the best
>> way to achieve this?..  Should we be using the Java API?..  Rely on
>> HBase cell timestamping?..  Create compound keys (row_id+date) and
>> periodically run a separate MR job to coalesce all the data belonging
>> to the same row_id?..
>>
>> Any pointers greatly appreciated!
>>
>> --Leo
>

RE: Best strategy for row updates

Posted by Michael Segel <mi...@hotmail.com>.

Best? That's pretty subjective.

How are you planning on accessing the data? 
Since you don't want to overwrite the data you can't really rely on the timestamps.
(Or is the updated data a replacement?)

Depending on the data size and structure you could append to the same column family, column (record) You could create a new column and insert the data there.

Not sure which would be best, it would depend on how you want to access the data.

> Date: Mon, 1 Nov 2010 02:28:31 -0700
> Subject: Best strategy for row updates
> From: dnquark@gmail.com
> To: user@hbase.apache.org
> 
> We are populating some HBase tables from daily data streams that are
> stored in Hive.  When we see a row key that's already in the table,
> the data should be appended to that row's record.  What is the best
> way to achieve this?..  Should we be using the Java API?..  Rely on
> HBase cell timestamping?..  Create compound keys (row_id+date) and
> periodically run a separate MR job to coalesce all the data belonging
> to the same row_id?..
> 
> Any pointers greatly appreciated!
> 
> --Leo