You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Steinmaurer Thomas <Th...@scch.at> on 2011/11/28 10:55:00 UTC

Strategies for aggregating data in a HBase table

Hello,

 

this has been already discussed a bit in the past, but I'm trying to
refresh this thread as this is an important design issue in our HBase
evaluation.

 

Basically, the result of our evaluation was that we gonna be happy with
what Hadoop/HBase offers for managing our measurement/sensor data.
Although one crucial thing for e.g. backend analysis tasks is, we need
access to aggregated data very quickly. The idea is to run a MapReduce
job and store the dialy aggregates in a RDBMS, which allows us to access
aggregated data more easily via different tools (BI frontends etc.).
Monthly and yearly aggregates are then handled with RDBMS concepts like
Materialized Views and Partitioning.

 

While it is an option processing the entire HBase table e.g. every night
when we go live, it probably isn't an option when data volume grows over
the years. So, what options are there for some kind of incremental
aggregating only new data?

 

- Perhaps using versioning (internal timestamp) might be an option?

- Perhaps having some kind of HBase (daily) staging table which is
truncated after aggregating data is an option?

- How could Co-processors help here (at the time of the Go-Live, they
might be available in e.g. Cloudera)?

 

etc.

 

Any ideas/comments are appreciated.

 

Thanks,

Thomas

 


Re: Strategies for aggregating data in a HBase table

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Or you could just prefix the row keys. Not sure if this is needed
natively, or as a tool on top of HBase. Hive for example could do
exactly that for you when Hive partitions are implemented for HBase.

J-D

On Wed, Nov 30, 2011 at 1:34 PM, Sam Seigal <se...@yahoo.com> wrote:
> What about "partitioning" at a table level. For example, create 12
> tables for the given year. Design the row keys however you like, let's
> say using SHA/MD hashes. Place transactions in the appropriate table
> and then do aggregations based on that table alone (this is assuming
> you won't get transactions with timestamps in the past going back a
> month). The idea is to archive the tables for a given year and start
> fresh the next. This is acceptable in my use case. I am in the process
> of trying this out, so do not have any performance numbers, issues yet
> ... Experts can comment.
>
> On a further note, having HBase support this natively i.e. one more
> level of partitioning above the row key , but below a table can be
> beneficial for use cases like these ones. Comments ... ?
>
> On Wed, Nov 30, 2011 at 11:53 AM, Jean-Daniel Cryans
> <jd...@apache.org> wrote:
>> Inline.
>>
>> J-D
>>
>> On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
>> <Th...@scch.at> wrote:
>>> Hello,
>>> ...
>>>
>>> While it is an option processing the entire HBase table e.g. every night
>>> when we go live, it probably isn't an option when data volume grows over
>>> the years. So, what options are there for some kind of incremental
>>> aggregating only new data?
>>
>> Yeah you don't want to go there.
>>
>>>
>>> - Perhaps using versioning (internal timestamp) might be an option?
>>
>> I guess you could do rollups and ditch the raw data, if you don't need it.
>>
>>>
>>> - Perhaps having some kind of HBase (daily) staging table which is
>>> truncated after aggregating data is an option?
>>
>> If you do the aggregations nightly then you won't have "access to
>> aggregated data very quickly".
>>
>>>
>>> - How could Co-processors help here (at the time of the Go-Live, they
>>> might be available in e.g. Cloudera)?
>>
>> Coprocessors are more like an internal HBase tool, so don't put all
>> your eggs there until you play with them. What you could do is get the
>> 0.92.0 RC0 tarball and try them out :)
>>
>>> Any ideas/comments are appreciated.
>>
>> Normally data is stored in a way that's not easy to query in a batch
>> or analytics mode, so an ETL step is introduced. You'll probably need
>> to do the same, as in you could asynchronously stream your data to
>> other HBase tables or Hive or Pig via logs or replication and then
>> directly insert it into the format it needs to be or stage it for
>> later aggregations. If you explore those avenues I'm sure you'll find
>> concepts that are very very similar to those you listed regarding
>> RDBMS.
>>
>> You could also keep live counts using atomic increments, you'd issue
>> those at write time or async.
>>
>> Hope this helps,
>>
>> J-D

Re: Strategies for aggregating data in a HBase table

Posted by Sam Seigal <se...@yahoo.com>.
What about "partitioning" at a table level. For example, create 12
tables for the given year. Design the row keys however you like, let's
say using SHA/MD hashes. Place transactions in the appropriate table
and then do aggregations based on that table alone (this is assuming
you won't get transactions with timestamps in the past going back a
month). The idea is to archive the tables for a given year and start
fresh the next. This is acceptable in my use case. I am in the process
of trying this out, so do not have any performance numbers, issues yet
... Experts can comment.

On a further note, having HBase support this natively i.e. one more
level of partitioning above the row key , but below a table can be
beneficial for use cases like these ones. Comments ... ?

On Wed, Nov 30, 2011 at 11:53 AM, Jean-Daniel Cryans
<jd...@apache.org> wrote:
> Inline.
>
> J-D
>
> On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
> <Th...@scch.at> wrote:
>> Hello,
>> ...
>>
>> While it is an option processing the entire HBase table e.g. every night
>> when we go live, it probably isn't an option when data volume grows over
>> the years. So, what options are there for some kind of incremental
>> aggregating only new data?
>
> Yeah you don't want to go there.
>
>>
>> - Perhaps using versioning (internal timestamp) might be an option?
>
> I guess you could do rollups and ditch the raw data, if you don't need it.
>
>>
>> - Perhaps having some kind of HBase (daily) staging table which is
>> truncated after aggregating data is an option?
>
> If you do the aggregations nightly then you won't have "access to
> aggregated data very quickly".
>
>>
>> - How could Co-processors help here (at the time of the Go-Live, they
>> might be available in e.g. Cloudera)?
>
> Coprocessors are more like an internal HBase tool, so don't put all
> your eggs there until you play with them. What you could do is get the
> 0.92.0 RC0 tarball and try them out :)
>
>> Any ideas/comments are appreciated.
>
> Normally data is stored in a way that's not easy to query in a batch
> or analytics mode, so an ETL step is introduced. You'll probably need
> to do the same, as in you could asynchronously stream your data to
> other HBase tables or Hive or Pig via logs or replication and then
> directly insert it into the format it needs to be or stage it for
> later aggregations. If you explore those avenues I'm sure you'll find
> concepts that are very very similar to those you listed regarding
> RDBMS.
>
> You could also keep live counts using atomic increments, you'd issue
> those at write time or async.
>
> Hope this helps,
>
> J-D

Re: Strategies for aggregating data in a HBase table

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Inline.

J-D

On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
<Th...@scch.at> wrote:
> Hello,
> ...
>
> While it is an option processing the entire HBase table e.g. every night
> when we go live, it probably isn't an option when data volume grows over
> the years. So, what options are there for some kind of incremental
> aggregating only new data?

Yeah you don't want to go there.

>
> - Perhaps using versioning (internal timestamp) might be an option?

I guess you could do rollups and ditch the raw data, if you don't need it.

>
> - Perhaps having some kind of HBase (daily) staging table which is
> truncated after aggregating data is an option?

If you do the aggregations nightly then you won't have "access to
aggregated data very quickly".

>
> - How could Co-processors help here (at the time of the Go-Live, they
> might be available in e.g. Cloudera)?

Coprocessors are more like an internal HBase tool, so don't put all
your eggs there until you play with them. What you could do is get the
0.92.0 RC0 tarball and try them out :)

> Any ideas/comments are appreciated.

Normally data is stored in a way that's not easy to query in a batch
or analytics mode, so an ETL step is introduced. You'll probably need
to do the same, as in you could asynchronously stream your data to
other HBase tables or Hive or Pig via logs or replication and then
directly insert it into the format it needs to be or stage it for
later aggregations. If you explore those avenues I'm sure you'll find
concepts that are very very similar to those you listed regarding
RDBMS.

You could also keep live counts using atomic increments, you'd issue
those at write time or async.

Hope this helps,

J-D

Re: Strategies for aggregating data in a HBase table

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Also re: frontend is always a problem. so far we have a custom data
source for this thing in jasper reports, but jdbc eventually is also
possible. Looking to see what it takes to mount jpivot to it, but it
is more serious endeavor so no big expectations there (unless i pick
somebody willing to help there).

On Wed, Dec 21, 2011 at 12:14 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> https://github.com/dlyubimov/HBase-Lattice
>
> On Wed, Dec 21, 2011 at 12:13 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Thomas,
>>
>> Sorry for shameless self-promotion. Can you look at our hbase-lattice
>> project? it is incremental OLAP-ish cube compilation with custom
>> filtering to optimize for composite key scans. Some rudimental query
>> language as well.
>>
>> Bunch of standard (and not so standard) aggregates for measure data
>> and ability to relatively easily add user aggregate thru model
>> definiton.
>>
>> Very early stage. But see if it could fit your purpose, maybe even
>> share some perspectives since i am honestly not an expert on
>> dimensional data representation.
>>
>> (I guess i need to add some query shell so people can try it out more easily.. )
>>
>> On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
>> <Th...@scch.at> wrote:
>>> Hello,
>>>
>>>
>>>
>>> this has been already discussed a bit in the past, but I'm trying to
>>> refresh this thread as this is an important design issue in our HBase
>>> evaluation.
>>>
>>>
>>>
>>> Basically, the result of our evaluation was that we gonna be happy with
>>> what Hadoop/HBase offers for managing our measurement/sensor data.
>>> Although one crucial thing for e.g. backend analysis tasks is, we need
>>> access to aggregated data very quickly. The idea is to run a MapReduce
>>> job and store the dialy aggregates in a RDBMS, which allows us to access
>>> aggregated data more easily via different tools (BI frontends etc.).
>>> Monthly and yearly aggregates are then handled with RDBMS concepts like
>>> Materialized Views and Partitioning.
>>>
>>>
>>>
>>> While it is an option processing the entire HBase table e.g. every night
>>> when we go live, it probably isn't an option when data volume grows over
>>> the years. So, what options are there for some kind of incremental
>>> aggregating only new data?
>>>
>>>
>>>
>>> - Perhaps using versioning (internal timestamp) might be an option?
>>>
>>> - Perhaps having some kind of HBase (daily) staging table which is
>>> truncated after aggregating data is an option?
>>>
>>> - How could Co-processors help here (at the time of the Go-Live, they
>>> might be available in e.g. Cloudera)?
>>>
>>>
>>>
>>> etc.
>>>
>>>
>>>
>>> Any ideas/comments are appreciated.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>>>
>>>

Re: Strategies for aggregating data in a HBase table

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
https://github.com/dlyubimov/HBase-Lattice

On Wed, Dec 21, 2011 at 12:13 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Thomas,
>
> Sorry for shameless self-promotion. Can you look at our hbase-lattice
> project? it is incremental OLAP-ish cube compilation with custom
> filtering to optimize for composite key scans. Some rudimental query
> language as well.
>
> Bunch of standard (and not so standard) aggregates for measure data
> and ability to relatively easily add user aggregate thru model
> definiton.
>
> Very early stage. But see if it could fit your purpose, maybe even
> share some perspectives since i am honestly not an expert on
> dimensional data representation.
>
> (I guess i need to add some query shell so people can try it out more easily.. )
>
> On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
> <Th...@scch.at> wrote:
>> Hello,
>>
>>
>>
>> this has been already discussed a bit in the past, but I'm trying to
>> refresh this thread as this is an important design issue in our HBase
>> evaluation.
>>
>>
>>
>> Basically, the result of our evaluation was that we gonna be happy with
>> what Hadoop/HBase offers for managing our measurement/sensor data.
>> Although one crucial thing for e.g. backend analysis tasks is, we need
>> access to aggregated data very quickly. The idea is to run a MapReduce
>> job and store the dialy aggregates in a RDBMS, which allows us to access
>> aggregated data more easily via different tools (BI frontends etc.).
>> Monthly and yearly aggregates are then handled with RDBMS concepts like
>> Materialized Views and Partitioning.
>>
>>
>>
>> While it is an option processing the entire HBase table e.g. every night
>> when we go live, it probably isn't an option when data volume grows over
>> the years. So, what options are there for some kind of incremental
>> aggregating only new data?
>>
>>
>>
>> - Perhaps using versioning (internal timestamp) might be an option?
>>
>> - Perhaps having some kind of HBase (daily) staging table which is
>> truncated after aggregating data is an option?
>>
>> - How could Co-processors help here (at the time of the Go-Live, they
>> might be available in e.g. Cloudera)?
>>
>>
>>
>> etc.
>>
>>
>>
>> Any ideas/comments are appreciated.
>>
>>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>

Re: Strategies for aggregating data in a HBase table

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thomas,

Sorry for shameless self-promotion. Can you look at our hbase-lattice
project? it is incremental OLAP-ish cube compilation with custom
filtering to optimize for composite key scans. Some rudimental query
language as well.

Bunch of standard (and not so standard) aggregates for measure data
and ability to relatively easily add user aggregate thru model
definiton.

Very early stage. But see if it could fit your purpose, maybe even
share some perspectives since i am honestly not an expert on
dimensional data representation.

(I guess i need to add some query shell so people can try it out more easily.. )

On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
<Th...@scch.at> wrote:
> Hello,
>
>
>
> this has been already discussed a bit in the past, but I'm trying to
> refresh this thread as this is an important design issue in our HBase
> evaluation.
>
>
>
> Basically, the result of our evaluation was that we gonna be happy with
> what Hadoop/HBase offers for managing our measurement/sensor data.
> Although one crucial thing for e.g. backend analysis tasks is, we need
> access to aggregated data very quickly. The idea is to run a MapReduce
> job and store the dialy aggregates in a RDBMS, which allows us to access
> aggregated data more easily via different tools (BI frontends etc.).
> Monthly and yearly aggregates are then handled with RDBMS concepts like
> Materialized Views and Partitioning.
>
>
>
> While it is an option processing the entire HBase table e.g. every night
> when we go live, it probably isn't an option when data volume grows over
> the years. So, what options are there for some kind of incremental
> aggregating only new data?
>
>
>
> - Perhaps using versioning (internal timestamp) might be an option?
>
> - Perhaps having some kind of HBase (daily) staging table which is
> truncated after aggregating data is an option?
>
> - How could Co-processors help here (at the time of the Go-Live, they
> might be available in e.g. Cloudera)?
>
>
>
> etc.
>
>
>
> Any ideas/comments are appreciated.
>
>
>
> Thanks,
>
> Thomas
>
>
>