You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by JC <jc...@marketo.com> on 2013/11/11 13:10:22 UTC

HBase as a transformation engine

We are looking to use hbase as a transformation engine. In other words, take
data already loaded into hbase, run some large calculation/aggregation on
that data and then load it back into a rdbms for our BI analytic tools to
use. I was curious about what the communities experience is on this and if
there are some best practices. Some thoughts we are kicking around is using
Mapreduce 2 and Yarn and writing files to HDFS to be loaded into the rdbms.
Not sure what all the pieces are needed for the complete application though. 

Thanks in advance for your help,
JC



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-as-a-transformation-engine-tp4052670.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase as a transformation engine

Posted by Asaf Mesika <as...@gmail.com>.

Are you reading using HBase client or do you have an inputFormat for
reading HFiles?

On Wednesday, November 13, 2013, Amit Sela wrote:

> Hi,
>
> We do something like that programmatically.
> Read blobbed HBase data (qualifiers represent cross-sections such as
> country_product and blob data such as clicks, impressions etc.)
> We have several aggregation tasks (one per MySQL table) that aggregates the
> data and inserts (in batches) to MySQL.
> I don't know how much data you wanna scan and insert but we scan, aggregate
> and insert approximately 7GB as ~12M lines from one HBase table into 9
> MySQL tables and that takes a little bit less than 2 hours.
> Our analysis shows that ~25% of that time is net HBase read and most of the
> time is spent on MySQL inserts.
> Since we are in the process of building a new system, optimizing is not in
> our agenda but I would definitely try writing to csv and bulk loading into
> RDBMS.
>
> Hope that helps.
>
>
>
>
> On Wed, Nov 13, 2013 at 9:11 AM, Vincent Barat <vincent.barat@gmail.com<javascript:;>
> >wrote:
>
> > Hi,
> >
> > We have done this kind of thing using HBase 0.92.1 + Pig, but we finally
> > had to limit the size of the tables and move the biggest data to HDFS:
> > loading data directly from HBase is much slower than from HDFS, and doing
> > it using M/R overloads HBase region servers, since several maps jobs scan
> > table regions at the same time: so the bigger your tables are, the higher
> > the load is (usually Pig creates 1 map per region, I don't know about
> Hive).
> >
> > This may not be an issue if your HBase cluster is dedicated to this kind
> > of job, but if you also have to ensure a good random read latency at the
> > same time, forget it.
> >
> > Regards,
> >
> > Le 11/11/2013 13:10, JC a écrit :
> >
> >  We are looking to use hbase as a transformation engine. In other words,
> >> take
> >> data already loaded into hbase, run some large calculation/aggregation
> on
> >> that data and then load it back into a rdbms for our BI analytic tools
> to
> >> use. I was curious about what the communities experience is on this and
> if
> >> there are some best practices. Some thoughts we are kicking around is
> >> using
> >> Mapreduce 2 and Yarn and writing files to HDFS to be loaded into the
> >> rdbms.
> >> Not sure what all the pieces are needed for the complete application
> >> though.
> >>
> >> Thanks in advance for your help,
> >> JC
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-hbase.679495.n3.
> >> nabble.com/HBase-as-a-transformation-engine-tp4052670.html
> >> Sent from the HBase User mailing list archive at Nabble.com.
> >>
> >>
>

Re: HBase as a transformation engine

Posted by Amit Sela <am...@infolinks.com>.

Hi,

We do something like that programmatically.
Read blobbed HBase data (qualifiers represent cross-sections such as
country_product and blob data such as clicks, impressions etc.)
We have several aggregation tasks (one per MySQL table) that aggregates the
data and inserts (in batches) to MySQL.
I don't know how much data you wanna scan and insert but we scan, aggregate
and insert approximately 7GB as ~12M lines from one HBase table into 9
MySQL tables and that takes a little bit less than 2 hours.
Our analysis shows that ~25% of that time is net HBase read and most of the
time is spent on MySQL inserts.
Since we are in the process of building a new system, optimizing is not in
our agenda but I would definitely try writing to csv and bulk loading into
RDBMS.

Hope that helps.

On Wed, Nov 13, 2013 at 9:11 AM, Vincent Barat <vi...@gmail.com>wrote:

> Hi,
>
> We have done this kind of thing using HBase 0.92.1 + Pig, but we finally
> had to limit the size of the tables and move the biggest data to HDFS:
> loading data directly from HBase is much slower than from HDFS, and doing
> it using M/R overloads HBase region servers, since several maps jobs scan
> table regions at the same time: so the bigger your tables are, the higher
> the load is (usually Pig creates 1 map per region, I don't know about Hive).
>
> This may not be an issue if your HBase cluster is dedicated to this kind
> of job, but if you also have to ensure a good random read latency at the
> same time, forget it.
>
> Regards,
>
> Le 11/11/2013 13:10, JC a écrit :
>
>  We are looking to use hbase as a transformation engine. In other words,
>> take
>> data already loaded into hbase, run some large calculation/aggregation on
>> that data and then load it back into a rdbms for our BI analytic tools to
>> use. I was curious about what the communities experience is on this and if
>> there are some best practices. Some thoughts we are kicking around is
>> using
>> Mapreduce 2 and Yarn and writing files to HDFS to be loaded into the
>> rdbms.
>> Not sure what all the pieces are needed for the complete application
>> though.
>>
>> Thanks in advance for your help,
>> JC
>>
>>
>>
>> --
>> View this message in context: http://apache-hbase.679495.n3.
>> nabble.com/HBase-as-a-transformation-engine-tp4052670.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>

Re: HBase as a transformation engine

Posted by Jia Wang <ra...@appannie.com>.

Keep a separate Hadoop cluster only focus on analyse is a better way to go,
the HBase cluster is only for collecting data. You can use distcp to copy
data between the two cluster which is faster, your Hadoop task has to parse
the HFile format for reading data which can be done but need some coding,
i'm wondering if there is already some code that you can reuse to pare the
HFile format file.

Cheers
Ramon


On Wed, Nov 13, 2013 at 3:11 PM, Vincent Barat <vi...@gmail.com>wrote:

> Hi,
>
> We have done this kind of thing using HBase 0.92.1 + Pig, but we finally
> had to limit the size of the tables and move the biggest data to HDFS:
> loading data directly from HBase is much slower than from HDFS, and doing
> it using M/R overloads HBase region servers, since several maps jobs scan
> table regions at the same time: so the bigger your tables are, the higher
> the load is (usually Pig creates 1 map per region, I don't know about Hive).
>
> This may not be an issue if your HBase cluster is dedicated to this kind
> of job, but if you also have to ensure a good random read latency at the
> same time, forget it.
>
> Regards,
>
> Le 11/11/2013 13:10, JC a écrit :
>
>  We are looking to use hbase as a transformation engine. In other words,
>> take
>> data already loaded into hbase, run some large calculation/aggregation on
>> that data and then load it back into a rdbms for our BI analytic tools to
>> use. I was curious about what the communities experience is on this and if
>> there are some best practices. Some thoughts we are kicking around is
>> using
>> Mapreduce 2 and Yarn and writing files to HDFS to be loaded into the
>> rdbms.
>> Not sure what all the pieces are needed for the complete application
>> though.
>>
>> Thanks in advance for your help,
>> JC
>>
>>
>>
>> --
>> View this message in context: http://apache-hbase.679495.n3.
>> nabble.com/HBase-as-a-transformation-engine-tp4052670.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>

Re: HBase as a transformation engine

Posted by Vincent Barat <vi...@gmail.com>.

Hi,

We have done this kind of thing using HBase 0.92.1 + Pig, but we 
finally had to limit the size of the tables and move the biggest 
data to HDFS: loading data directly from HBase is much slower than 
from HDFS, and doing it using M/R overloads HBase region servers, 
since several maps jobs scan table regions at the same time: so the 
bigger your tables are, the higher the load is (usually Pig creates 
1 map per region, I don't know about Hive).

This may not be an issue if your HBase cluster is dedicated to this 
kind of job, but if you also have to ensure a good random read 
latency at the same time, forget it.

Regards,

Le 11/11/2013 13:10, JC a écrit :
> We are looking to use hbase as a transformation engine. In other words, take
> data already loaded into hbase, run some large calculation/aggregation on
> that data and then load it back into a rdbms for our BI analytic tools to
> use. I was curious about what the communities experience is on this and if
> there are some best practices. Some thoughts we are kicking around is using
> Mapreduce 2 and Yarn and writing files to HDFS to be loaded into the rdbms.
> Not sure what all the pieces are needed for the complete application though.
>
> Thanks in advance for your help,
> JC
>
>
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-as-a-transformation-engine-tp4052670.html
> Sent from the HBase User mailing list archive at Nabble.com.
>