You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ryan Greenhall <ry...@googlemail.com> on 2011/03/23 11:29:37 UTC

Recommended approaches for large data import from RDBMS into hive (Approx Terabyte of data)

My desire is to be able to run a query on the existing RDBMS (SQLServer) to
de-normalise the required data into single rows that can then be imported
into a hive table partitioned by date.

Previously, on a much smaller scale, I have achieved data import into Hive
by copying a tsv of my data (already partitioned) into HDFS and then loading
into Hive, for example.

LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE page_view
PARTITION(date='2008-06-08')


I have considered the following options and was looking for feedback or more
effective solutions.

1) Create a data file for each day (we have over a years worth of data) and
then import as above.  On a large scale this is likely to be rather
cumbersome, although it could be scripted.

2) Use sqoop to import all the data into a stageing table and then use the
dynamic partition insert.  Unfortunately Sqoop does not allow (as far as I
can see) import into partitions.

I am not sure how this is going to work over such a large dataset as I have
seen examples of a staging table with 900,000 rows of data fail when a
dynamic partition import is applied.

Thanks in advance

Ryan

Re: Recommended approaches for large data import from RDBMS into hive (Approx Terabyte of data)

Posted by Thiruvel Thirumoolan <th...@yahoo-inc.com>.
I am not sure how this is going to work over such a large dataset as I have seen examples of a staging table with 900,000 rows of data fail when a dynamic partition import is applied.

What was the problem? I have created 120,000 dynamic partitions using distribute by. Guess that will solve your failure? See  http://wiki.apache.org/hadoop/Hive/Tutorial#Dynamic-partition_Insert

On Mar 23, 2011, at 3:59 PM, Ryan Greenhall wrote:

My desire is to be able to run a query on the existing RDBMS (SQLServer) to de-normalise the required data into single rows that can then be imported into a hive table partitioned by date.

Previously, on a much smaller scale, I have achieved data import into Hive by copying a tsv of my data (already partitioned) into HDFS and then loading into Hive, for example.

LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE page_view PARTITION(date='2008-06-08')


I have considered the following options and was looking for feedback or more effective solutions.

1) Create a data file for each day (we have over a years worth of data) and then import as above.  On a large scale this is likely to be rather cumbersome, although it could be scripted.

2) Use sqoop to import all the data into a stageing table and then use the dynamic partition insert.  Unfortunately Sqoop does not allow (as far as I can see) import into partitions.

I am not sure how this is going to work over such a large dataset as I have seen examples of a staging table with 900,000 rows of data fail when a dynamic partition import is applied.

Thanks in advance

Ryan