You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Riesland, Zack" <Za...@sensus.com> on 2016/02/13 13:41:23 UTC

Optimizing external table structure

On a daily basis, we move large amounts of data from hive to hbase, via phoenix.

In order to do this, we create an external hive table with the data we need to move (all a subset of 1 compressed ORC table), and then use the Phoenix CsvBulkUpload utility. From everything I've read, this is the best approach.

My question is: how can I optimize my external table to make the bulk upload as efficient as possible?

For example, today, my external table is backed by 6,020 files in HDFS, each about 300-400mb.

This results in a mapreduce operation with 12,209 mappers that takes about 3 hours (we don't have a huge cluster - 13 data nodes currently).

Would it be better to have more, smaller files? Fewer, larger files?

RE: Optimizing external table structure

Posted by "Riesland, Zack" <Za...@sensus.com>.
Thanks.

We have 16 disks per node, to answer your question.
________________________________________
From: Jörn Franke [jornfranke@gmail.com]
Sent: Saturday, February 13, 2016 9:46 AM
To: user@hive.apache.org
Subject: Re: Optimizing external table structure

How many disk drives do you have / node?
Generally one node should have 12 drives not configured as raid and not configured as lvm.

Files could be a little bit larger (4 or better 40 gb - your namenode will thank you) or use Hadoop Archive (HAR).

I am not sure about the latest status of Phoenix but maybe you can can make hbase tables directly available as external tables in Hive - you would save a lot of time by not converting to csv.
You could also explore using sqoop (import from hive to jdbc / Phoenix or to hbase).

On 13 Feb 2016, at 13:41, Riesland, Zack <Za...@sensus.com>> wrote:

On a daily basis, we move large amounts of data from hive to hbase, via phoenix.

In order to do this, we create an external hive table with the data we need to move (all a subset of 1 compressed ORC table), and then use the Phoenix CsvBulkUpload utility. From everything I've read, this is the best approach.

My question is: how can I optimize my external table to make the bulk upload as efficient as possible?

For example, today, my external table is backed by 6,020 files in HDFS, each about 300-400mb.

This results in a mapreduce operation with 12,209 mappers that takes about 3 hours (we don't have a huge cluster – 13 data nodes currently).

Would it be better to have more, smaller files? Fewer, larger files?

Re: Optimizing external table structure

Posted by Jörn Franke <jo...@gmail.com>.
How many disk drives do you have / node? 
Generally one node should have 12 drives not configured as raid and not configured as lvm.

Files could be a little bit larger (4 or better 40 gb - your namenode will thank you) or use Hadoop Archive (HAR).

I am not sure about the latest status of Phoenix but maybe you can can make hbase tables directly available as external tables in Hive - you would save a lot of time by not converting to csv.
You could also explore using sqoop (import from hive to jdbc / Phoenix or to hbase).

> On 13 Feb 2016, at 13:41, Riesland, Zack <Za...@sensus.com> wrote:
> 
> On a daily basis, we move large amounts of data from hive to hbase, via phoenix.
>  
> In order to do this, we create an external hive table with the data we need to move (all a subset of 1 compressed ORC table), and then use the Phoenix CsvBulkUpload utility. From everything I've read, this is the best approach.
>  
> My question is: how can I optimize my external table to make the bulk upload as efficient as possible?
>  
> For example, today, my external table is backed by 6,020 files in HDFS, each about 300-400mb.
>  
> This results in a mapreduce operation with 12,209 mappers that takes about 3 hours (we don't have a huge cluster – 13 data nodes currently).
>  
> Would it be better to have more, smaller files? Fewer, larger files?