You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ravi Shetye <ra...@gmail.com> on 2012/08/29 12:54:16 UTC

Performance comparision external s3 table vs managed table

I am launching HIVE cluster  in interactive mode
http://aws.amazon.com/elasticmapreduce/faqs/#hive-6.

I data on s3  like

*s3://ravi/logs/adv_id=123/date=2012-01-01/log.gz*

*s3://ravi/logs/adv_id=456/date=2012-01-02/log.gz*

*s3://ravi/logs/adv_id=123/date=2012-01-03/log.gz*

I create two tables

CREATE EXTERNAL TABLE s3Table (...)
PARTITIONED BY (adv_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://ravi/logs/';

CREATE TABLE managedTable (...)   ==> same defination
PARTITIONED BY (adv_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

I load data into both tables
ALTER TABLE s3Table RECOVER PARTITIONS;
and
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true
INSERT OVERWRITE TABLE managedTable PARTITION (adv_id,dt) SELECT * FROM
s3Table;

Intuitively I am expecting the managedTable to perform better.

I run a count(*) query on both which cont approx 40,000,000 rows
The one for s3Table generates mapper per patition and finishes in 149 sec
The one for managedTable generates mapper per HDFS Block and finishes in
238sec

Can I improve upon the performance of managedTable by any tuning parameters?
Should I NOT be using managedTable ever?

I did the experiment on m1.large cluster to avoid any IO vs Network
reasoning.
-- 
RAVI SHETYE