You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Laxman <la...@huawei.com> on 2012/01/30 14:48:22 UTC

Faster Bulkload from Oracle to HBase

Hi,

We have the following use-case.

We have data in relational database (Oracle).
We need to export this data to HBase and perform analysis on this data.
We need to perform this export-import 500G periodically, say every month.

Following are the different approaches I can see as per my knowledge.
Before testing and finding out the best way by myself, I wanted to listen
from the experts here.

Approach #1
===========
1) Export from Oracle to raw text file (Using Oracle export utility - Faster
- Involves no transactional overhead)

2) Upload text file to HDFS

3) Run the bulk load job (HFileOutputFormat.configureIncrementalLoad())

Approach #2
===========
1) Write a custom Job using DBInputFormat to directly read from database.
	- Just a thought to avoid multiple hops(Oracle to Local FS, Local FS
to HDFS, HDFS to HBase) involved in approach #1.

2) Use the HBase bulk load tool to load this data to
HBase.(HFileOutputFormat.configureIncrementalLoad())

Approach #3
===========
1) Use Apache Sqoop (Currently under incubation) to achieve my requirement.
	- I'm not aware of the istability of this.

Also, please suggest me if we have a better approach than the above.

Re: Faster Bulkload from Oracle to HBase

Posted by Tim Robertson <ti...@gmail.com>.

Hi Laxman,

We use both #1 and #3 from MySQL which also has hi speed exports.
For our 300G and 340M rows, #1 takes us around 3 hours, with Sqoop it
is closer to 8 hrs to our 3 node cluster.

We are having issues with delimiters though (since we have \r, \t and
\n in the database), and now using Avro as a compression to overcome
these.  A colleague of mine is currently doing 2 patches for Sqoop to
fix some bugs in this.

For the majority of our processing we don't use HBase but straight to
HDFS, and we use Sqoop wrapping it all up in Oozie workflows which is
working very nicely.  We do a lot of ETL processing from one DB to
another through an Oozie workflow having Sqoop and Hive.

HTH,
Tim



On Tue, Jan 31, 2012 at 7:21 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> Hi Laxman,
>
> I'm an Apache HBase and Sqoop committer.  I haven't run the comparison
> you've suggested but my first thoughts are to consider #1 and #3.
>
> Case 1 will natively export data which should be the fastest way to get
> data out of Oracle.  You many need to do some reformatting to use HBase's
> importtsv.
>
> Case 2, writing your own DBInputFormat, is essentially duplicating what
> Sqoop does in the generic case.  Here it is essentially executing sql
> queries against the rdbms to get data out which will likely be a bit slower
> than a database's native bulk export feature.
>
> Case 3, consider using Apache Sqoop in conjunction with Quest's data
> connector for oracle and hadoop.  It is a free as in beer plugin to sqoop!
>  This is probably the fastest from a getting started point of view (no dev
> time) but may not be as performant as #1.  I'd give it a try.
> http://www.quest.com/data-connector-for-oracle-and-hadoop/
>
> If you have more questions about Sqoop, feel free to cross post to
> sqoop-user@incubator.apache.org!
>
> Jon.
>
> On Mon, Jan 30, 2012 at 5:48 AM, Laxman <la...@huawei.com> wrote:
>
>> Hi,
>>
>> We have the following use-case.
>>
>> We have data in relational database (Oracle).
>> We need to export this data to HBase and perform analysis on this data.
>> We need to perform this export-import 500G periodically, say every month.
>>
>> Following are the different approaches I can see as per my knowledge.
>> Before testing and finding out the best way by myself, I wanted to listen
>> from the experts here.
>>
>> Approach #1
>> ===========
>> 1) Export from Oracle to raw text file (Using Oracle export utility -
>> Faster
>> - Involves no transactional overhead)
>>
>> 2) Upload text file to HDFS
>>
>> 3) Run the bulk load job (HFileOutputFormat.configureIncrementalLoad())
>>
>> Approach #2
>> ===========
>> 1) Write a custom Job using DBInputFormat to directly read from database.
>>        - Just a thought to avoid multiple hops(Oracle to Local FS, Local FS
>> to HDFS, HDFS to HBase) involved in approach #1.
>>
>> 2) Use the HBase bulk load tool to load this data to
>> HBase.(HFileOutputFormat.configureIncrementalLoad())
>>
>> Approach #3
>> ===========
>> 1) Use Apache Sqoop (Currently under incubation) to achieve my requirement.
>>        - I'm not aware of the istability of this.
>>
>> Also, please suggest me if we have a better approach than the above.
>>
>>
>>
>>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com

RE: Faster Bulkload from Oracle to HBase

Posted by Laxman <la...@huawei.com>.

Jon & Tim, Thanks a lot for your responses. That was really helpful
information.

We will do a comparison and publish our findings.

 

>> For our 300G and 340M rows, #1 takes us around 3 hours, with Sqoop it is
closer to 8 hrs to our 3 node cluster.

 

@Tim, Can you please let me know whether you use Sqoop in conjunction with
Quest's  data connector.

Quest claims 5x improvement with their connector.

 

@Jon, do you feel Approach #3 with Quest's connector may out-perform #1?

 

 

From: Jonathan Hsieh [mailto:jon@cloudera.com] 
Sent: Tuesday, January 31, 2012 11:51 AM
To: user@hbase.apache.org; lakshman.ch@huawei.com
Cc: Arvind Prabhakar
Subject: Re: Faster Bulkload from Oracle to HBase

 

Hi Laxman,

 

I'm an Apache HBase and Sqoop committer.  I haven't run the comparison
you've suggested but my first thoughts are to consider #1 and #3.  

 

Case 1 will natively export data which should be the fastest way to get data
out of Oracle.  You many need to do some reformatting to use HBase's
importtsv.

 

Case 2, writing your own DBInputFormat, is essentially duplicating what
Sqoop does in the generic case.  Here it is essentially executing sql
queries against the rdbms to get data out which will likely be a bit slower
than a database's native bulk export feature.  

 

Case 3, consider using Apache Sqoop in conjunction with Quest's data
connector for oracle and hadoop.  It is a free as in beer plugin to sqoop!
This is probably the fastest from a getting started point of view (no dev
time) but may not be as performant as #1.  I'd give it a try.
http://www.quest.com/data-connector-for-oracle-and-hadoop/ 

 

If you have more questions about Sqoop, feel free to cross post to
sqoop-user@incubator.apache.org!

 

Jon.

On Mon, Jan 30, 2012 at 5:48 AM, Laxman <la...@huawei.com> wrote:

Hi,

We have the following use-case.

We have data in relational database (Oracle).
We need to export this data to HBase and perform analysis on this data.
We need to perform this export-import 500G periodically, say every month.

Following are the different approaches I can see as per my knowledge.
Before testing and finding out the best way by myself, I wanted to listen
from the experts here.

Approach #1
===========
1) Export from Oracle to raw text file (Using Oracle export utility - Faster
- Involves no transactional overhead)

2) Upload text file to HDFS

3) Run the bulk load job (HFileOutputFormat.configureIncrementalLoad())

Approach #2
===========
1) Write a custom Job using DBInputFormat to directly read from database.
       - Just a thought to avoid multiple hops(Oracle to Local FS, Local FS
to HDFS, HDFS to HBase) involved in approach #1.

2) Use the HBase bulk load tool to load this data to
HBase.(HFileOutputFormat.configureIncrementalLoad())

Approach #3
===========
1) Use Apache Sqoop (Currently under incubation) to achieve my requirement.
       - I'm not aware of the istability of this.

Also, please suggest me if we have a better approach than the above.








 

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera

// jon@cloudera.com

Re: Faster Bulkload from Oracle to HBase

Posted by Jonathan Hsieh <jo...@cloudera.com>.

Hi Laxman,

I'm an Apache HBase and Sqoop committer.  I haven't run the comparison
you've suggested but my first thoughts are to consider #1 and #3.

Case 1 will natively export data which should be the fastest way to get
data out of Oracle.  You many need to do some reformatting to use HBase's
importtsv.

Case 2, writing your own DBInputFormat, is essentially duplicating what
Sqoop does in the generic case.  Here it is essentially executing sql
queries against the rdbms to get data out which will likely be a bit slower
than a database's native bulk export feature.

Case 3, consider using Apache Sqoop in conjunction with Quest's data
connector for oracle and hadoop.  It is a free as in beer plugin to sqoop!
 This is probably the fastest from a getting started point of view (no dev
time) but may not be as performant as #1.  I'd give it a try.
http://www.quest.com/data-connector-for-oracle-and-hadoop/

If you have more questions about Sqoop, feel free to cross post to
sqoop-user@incubator.apache.org!

Jon.

On Mon, Jan 30, 2012 at 5:48 AM, Laxman <la...@huawei.com> wrote:

> Hi,
>
> We have the following use-case.
>
> We have data in relational database (Oracle).
> We need to export this data to HBase and perform analysis on this data.
> We need to perform this export-import 500G periodically, say every month.
>
> Following are the different approaches I can see as per my knowledge.
> Before testing and finding out the best way by myself, I wanted to listen
> from the experts here.
>
> Approach #1
> ===========
> 1) Export from Oracle to raw text file (Using Oracle export utility -
> Faster
> - Involves no transactional overhead)
>
> 2) Upload text file to HDFS
>
> 3) Run the bulk load job (HFileOutputFormat.configureIncrementalLoad())
>
> Approach #2
> ===========
> 1) Write a custom Job using DBInputFormat to directly read from database.
>        - Just a thought to avoid multiple hops(Oracle to Local FS, Local FS
> to HDFS, HDFS to HBase) involved in approach #1.
>
> 2) Use the HBase bulk load tool to load this data to
> HBase.(HFileOutputFormat.configureIncrementalLoad())
>
> Approach #3
> ===========
> 1) Use Apache Sqoop (Currently under incubation) to achieve my requirement.
>        - I'm not aware of the istability of this.
>
> Also, please suggest me if we have a better approach than the above.
>
>
>
>

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com