You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ravi Shetye <ra...@vizury.com> on 2012/08/27 14:57:54 UTC

Re: Hive on EMR on S3 : Beginner

Thanks to all your help I have moved ahead with my project.
So I create table as
CREATE TABLE test (...)
PARTITIONED BY (adid STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://logs/'

Do a  *ALTER TABLE results RECOVER PARTITIONS;*

and then start querying.

Now the issue is it fetches data from s3 to hdfs for every single query. So
if i remove the s3 buckets the result change

How can i remove this dependency? Store the data over HDFS and then query
it repeatatively.

Am I even trying a valid use-case? or am I doing something fundamentally
wrong?

RE: Hive on EMR on S3 : Beginner

Posted by ri...@nokia.com.
Hi Ravi,

The idea of using EMR is that you don't have to have a Hadoop cluster running all the time. So put all your data in S3, spin up an EMR cluster, do computation and store your data back in S3.
In an ideal case data in S3 should not be moved around and Hive will always read from S3 if you have defined S3 Location and table is external.

If you have some tables which you frequently access make them managed tables, hive stores the data for managed table in HDFS.
So you might create a managed table (without External keyword) result_managed, fields similar to result table and do something like

INSERT OVERWRITE result_managed SELECT * FROM result;

Basically you are copying the data from external table to a managed table, nothing else.
Another thing to note when you are using Hive in S3 is SET hive.optimize.s3.query=true; - amazon has done some optimizations of their own for hive to work with S3.

Hope this helps.

Thanks,
Richin

From: ext Ravi Shetye [mailto:ravi.shetye@vizury.com]
Sent: Monday, August 27, 2012 8:58 AM
To: user@hive.apache.org
Subject: Re: Hive on EMR on S3 : Beginner

Thanks to all your help I have moved ahead with my project.
So I create table as
CREATE TABLE test (...)
PARTITIONED BY (adid STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://logs/'
Do a  ALTER TABLE results RECOVER PARTITIONS;

and then start querying.

Now the issue is it fetches data from s3 to hdfs for every single query. So if i remove the s3 buckets the result change

How can i remove this dependency? Store the data over HDFS and then query it repeatatively.

Am I even trying a valid use-case? or am I doing something fundamentally wrong?