You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ravi Shetye <ra...@vizury.com> on 2012/08/24 14:26:20 UTC

Hive on EMR on S3 : Beginner

I have the data in s3 bucket in the following manner

|s3://logs/ad1date1.log.gz
s3://logs/ad1date2.log.gz
s3://logs/ad1date3.log.gz
s3://logs/ad1date4.log.gz
s3://logs/ad2date1.log.gz
s3://logs/ad2date2.log.gz
s3://logs/ad2date3.log.gz
s3://logs/ad2date4.log.gz
|

I have to load some of them into a single hive table for which I am 
using the following query

|CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING,
d2 STRING,
url STRING,
d4 STRING,
d5 STRING,
d6 STRING,
adv_id_dummy STRING,
timestp STRING,
ip STRING,
userAgent STRING,
stage STRING,
d12 STRING,
d13 STRING)
PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://logs/joined_analyze_files_hive/'
|

How should I add the data into files?

will |ALTER TABLE raw_logs RECOVER PARTITIONS;| do the trick?

Don't I need to map which file maps to which adv_id,date combination?

Also a pointer to good tutorial for beginner would be helpful.

RE: Hive on EMR on S3 : Beginner

Posted by ri...@nokia.com.

Hi Ravi,

The idea of using EMR is that you don't have to have a Hadoop cluster running all the time. So put all your data in S3, spin up an EMR cluster, do computation and store your data back in S3.
In an ideal case data in S3 should not be moved around and Hive will always read from S3 if you have defined S3 Location and table is external.

If you have some tables which you frequently access make them managed tables, hive stores the data for managed table in HDFS.
So you might create a managed table (without External keyword) result_managed, fields similar to result table and do something like

INSERT OVERWRITE result_managed SELECT * FROM result;

Basically you are copying the data from external table to a managed table, nothing else.
Another thing to note when you are using Hive in S3 is SET hive.optimize.s3.query=true; - amazon has done some optimizations of their own for hive to work with S3.

Hope this helps.

Thanks,
Richin

From: ext Ravi Shetye [mailto:ravi.shetye@vizury.com]
Sent: Monday, August 27, 2012 8:58 AM
To: user@hive.apache.org
Subject: Re: Hive on EMR on S3 : Beginner

Thanks to all your help I have moved ahead with my project.
So I create table as
CREATE TABLE test (...)
PARTITIONED BY (adid STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://logs/'
Do a  ALTER TABLE results RECOVER PARTITIONS;

and then start querying.

Now the issue is it fetches data from s3 to hdfs for every single query. So if i remove the s3 buckets the result change

How can i remove this dependency? Store the data over HDFS and then query it repeatatively.

Am I even trying a valid use-case? or am I doing something fundamentally wrong?

Re: Hive on EMR on S3 : Beginner

Posted by Ravi Shetye <ra...@vizury.com>.

Thanks to all your help I have moved ahead with my project.
So I create table as
CREATE TABLE test (...)
PARTITIONED BY (adid STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://logs/'

Do a  *ALTER TABLE results RECOVER PARTITIONS;*

and then start querying.

Now the issue is it fetches data from s3 to hdfs for every single query. So
if i remove the s3 buckets the result change

How can i remove this dependency? Store the data over HDFS and then query
it repeatatively.

Am I even trying a valid use-case? or am I doing something fundamentally
wrong?

Re: Hive on EMR on S3 : Beginner

Posted by Pedro Figueiredo <pf...@89clouds.com>.

Hi,

On 25 Aug 2012, at 05:58, Ravi Shetye <ra...@vizury.com> wrote:

> Thanks Richin and Pedro,
> So a final clarification
>     Another way of doing apart from dynamic partition is if you can create your directories like below either manually or the ETL process you might be doing to get the table data it     is pretty easy.
> 
> 	s3://ravi/logs/adv_id=123/date=2012-01-01/log.gz
> 	s3://ravi/logs/adv_id=456/date=2012-01-02/log.gz
> 	s3://ravi/logs/adv_id=123/date=2012-01-03/log.gz
> 
> 1)Since I have used PARTITIONED BY (adv_id STRING,date STRING) Hive system will read the bucket name adv_id=123 and understand that the data within this bucket can be accessed by a pseudo column adv_id?

Yes.

> 2) It would be wrong if I use PARTITIONED BY (date STRING,adv_id STRING) and keep the same bucket structure?

Yes, the order of the fields in PARTITIONED BY must match the structure.

> 3)Also it wont work if I store data in s3://ravi/logs/123/2012-01-01/log.gz ?

No, you need xxx=.

Cheers,

Pedro

Re: Hive on EMR on S3 : Beginner

Posted by Ravi Shetye <ra...@vizury.com>.

Thanks Richin and Pedro,
So a final clarification

/Another way of doing apart from dynamic partition is if you can create 
your directories like below either manually or the ETL process you might 
be doing to get the table data it     is pretty easy./

/|*	s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-01/log.gz*|/

/|*	s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=456/date=2012-01-02/log.gz*|/

/|*	s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-03/log.gz*|/


1)Since I have used|*PARTITIONED BY (adv_id STRING,date STRING) *|Hive 
system will read the bucket name |*adv_id=123 *and understand that the 
data within this bucket can be accessed by a pseudo column adv_id?|
||2) |It would be wrong if I use ||*PARTITIONED BY (date 
STRING*||*,adv_id STRING*||*) *and keep the same bucket structure?|
|||3)Also it wont work if I store data in**||*s3:// 
<s3://logs/joined_analyze_files_hive/%27>ravi/logs/123/2012-01-01/log.gz ?*|

On 08/24/2012 07:44 PM, richin.jain@nokia.com wrote:
>
> Hi Ravi,
>
> Another way of doing apart from dynamic partition is if you can create 
> your directories like below either manually or the ETL process you 
> might be doing to get the table data it is pretty easy.
>
> |*s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-01/log.gz*|
> |*s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=456/date=2012-01-02/log.gz*|
> |*s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-03/log.gz*|
>   
>
> and so on..
>
> and then when you define your table
>
> |*CREATE EXTERNAL TABLE results*|
> |*         (field1 STRING,*|
> |*         field2 STRING)*|
> |*         PARTITIONED BY (adv_id STRING,date STRING)*|
> |*         ROW FORMAT DELIMITED*|
> |*         FIELDS TERMINATED BY '\t'*|
> |*         LOCATION 's3://ravi/logs/';  <s3://logs/joined_analyze_files_hive/%27>*|
> |*  *|
> |*Once the table is created do a*|
> |*  *|
> |*ALTER TABLE results RECOVER PARTITIONS; -- This will load all the data from the partitions in the results table*|
> |*  *|
> |*SELECT * FROM results; -- should return you all the rows.*|
> |*  *|
> |*For queries by partition you could do something like*|
> |*  *|
> |*SELECT * from results where adv-id=123 and date='2012-01-01'; -- this query will just read that specific directory/partition*|
> |*   *|
>
> Here is an link to AWS forum, describing this in more detail - 
> https://forums.aws.amazon.com/thread.jspa?threadID=92862
>
> Thanks,
>
> Richin
>
> *From:*ext Ravi Shetye [mailto:ravi.shetye@vizury.com]
> *Sent:* Friday, August 24, 2012 9:09 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on EMR on S3 : Beginner
>
> |  |
> |thanks for the reply|
> |  |
> |Let concentrate on the second case**|
> |*  *|
> |*CREATE EXTERNAL TABLE results (cookie STRING,*|
> |*   d2 STRING,*|
> |*   url STRING,*|
> |*   d4 STRING,*|
> |*   d5 STRING,*|
> |*   d6 STRING,*|
> |*   adv_id_dummy STRING,*|
> |*   timestp STRING,*|
> |*   ip STRING,*|
> |*   userAgent STRING,*|
> |*   stage STRING,*|
> |*   d12 STRING,*|
> |*   d13 STRING)*|
> |*PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'*|
> |*LOCATION 's3://somewhere-outside-the-logs-tree';  <s3://logs/joined_analyze_files_hive/%27>*|
> |*  *|
> |*You can then*|
> |*  *|
> |*INSERT OVERWRITE TABLE results PARTITION (adv_id, date)*|
> |*  *|
> |*Is this all I need to do to load the data?*|
> |how will the system know what data will go into what partition?|
> |As I understand the partition columns should be psedo columns and not part of the actual data.|
> |  |
> |Also if I have to load just 2 of the files say|
> |s3://logs/ad1date1.log.gz  and|
> |s3://logs/ad2date4.log.gz   how do I specify it.|
>
>
> On 08/24/2012 06:16 PM, Pedro Figueiredo wrote:
>
>     Hi,
>
>     On 24 Aug 2012, at 13:26, Ravi Shetye wrote:
>
>
>
>     I have the data in s3 bucket in the following manner
>
>     |s3://logs/ad1date1.log.gz|
>
>     |s3://logs/ad1date2.log.gz|
>
>     |s3://logs/ad1date3.log.gz|
>
>     |s3://logs/ad1date4.log.gz|
>
>     |s3://logs/ad2date1.log.gz|
>
>     |s3://logs/ad2date2.log.gz|
>
>     |s3://logs/ad2date3.log.gz|
>
>     |s3://logs/ad2date4.log.gz|
>
>     If you do
>
>
>
>     |CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING,|
>
>     |d2 STRING,|
>
>     |url STRING,|
>
>     |d4 STRING,|
>
>     |d5 STRING,|
>
>     |d6 STRING,|
>
>     |adv_id_dummy STRING,|
>
>     |timestp STRING,|
>
>     |ip STRING,|
>
>     |userAgent STRING,|
>
>     |stage STRING,|
>
>     |d12 STRING,|
>
>     |d13 STRING)|
>
>     LOCATION 's3n://logs/' <s3n://logs/%27>;
>
>     you'll have all of it in a table. If you then want the results
>     partitioned, you can do
>
>     |CREATE EXTERNAL TABLE results (cookie STRING,|
>
>     |   d2 STRING,|
>
>     |   url STRING,|
>
>     |   d4 STRING,|
>
>     |   d5 STRING,|
>
>     |   d6 STRING,|
>
>     |   adv_id_dummy STRING,|
>
>     |   timestp STRING,|
>
>     |   ip STRING,|
>
>     |   userAgent STRING,|
>
>     |   stage STRING,|
>
>     |   d12 STRING,|
>
>     |   d13 STRING)|
>
>     |PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'|
>
>     |LOCATION 's3://somewhere-outside-the-logs-tree';  <s3://logs/joined_analyze_files_hive/%27>|
>
>     |  |
>
>     |You can then|
>
>     |  |
>
>     |INSERT OVERWRITE TABLE results PARTITION (adv_id, date)|
>
>     |<your query>|
>
>     |  |
>
>     |Note that to use dynamic partitions you have to first run|
>
>     |SET hive.exec.dynamic.partition.mode=nonstrict;|
>
>     |SET hive.exec.dynamic.partition=true;|
>
>     Cheers,
>
>     Pedro
>
>     Pedro Figueiredo
>     Skype: pfig.89clouds
>     http://89clouds.com/ - Big Data Consulting
>
>
>
>

RE: Hive on EMR on S3 : Beginner

Posted by ri...@nokia.com.

Hi Ravi,

Another way of doing apart from dynamic partition is if you can create your directories like below either manually or the ETL process you might be doing to get the table data it is pretty easy.

s3://<s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-01/log.gz

s3://<s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=456/date=2012-01-02/log.gz

s3://<s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-03/log.gz


and so on..

and then when you define your table


CREATE EXTERNAL TABLE results

        (field1 STRING,

        field2 STRING)

        PARTITIONED BY (adv_id STRING,date STRING)

        ROW FORMAT DELIMITED

        FIELDS TERMINATED BY '\t'

        LOCATION 's3://ravi/logs/';<s3://logs/joined_analyze_files_hive/%27>



Once the table is created do a



ALTER TABLE results RECOVER PARTITIONS; -- This will load all the data from the partitions in the results table



SELECT * FROM results; - should return you all the rows.



For queries by partition you could do something like



SELECT * from results where adv-id=123 and date='2012-01-01'; -- this query will just read that specific directory/partition


Here is an link to AWS forum, describing this in more detail - https://forums.aws.amazon.com/thread.jspa?threadID=92862

Thanks,
Richin

From: ext Ravi Shetye [mailto:ravi.shetye@vizury.com]
Sent: Friday, August 24, 2012 9:09 AM
To: user@hive.apache.org
Subject: Re: Hive on EMR on S3 : Beginner




thanks for the reply



Let concentrate on the second case



CREATE EXTERNAL TABLE results (cookie STRING,

  d2 STRING,

  url STRING,

  d4 STRING,

  d5 STRING,

  d6 STRING,

  adv_id_dummy STRING,

  timestp STRING,

  ip STRING,

  userAgent STRING,

  stage STRING,

  d12 STRING,

  d13 STRING)

PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION 's3://somewhere-outside-the-logs-tree';<s3://logs/joined_analyze_files_hive/%27>



You can then



INSERT OVERWRITE TABLE results PARTITION (adv_id, date)



Is this all I need to do to load the data?

how will the system know what data will go into what partition?

As I understand the partition columns should be psedo columns and not part of the actual data.



Also if I have to load just 2 of the files say

s3://logs/ad1date1.log.gz and

s3://logs/ad2date4.log.gz  how do I specify it.

On 08/24/2012 06:16 PM, Pedro Figueiredo wrote:
Hi,

On 24 Aug 2012, at 13:26, Ravi Shetye wrote:



I have the data in s3 bucket in the following manner

s3://logs/ad1date1.log.gz

s3://logs/ad1date2.log.gz

s3://logs/ad1date3.log.gz

s3://logs/ad1date4.log.gz

s3://logs/ad2date1.log.gz

s3://logs/ad2date2.log.gz

s3://logs/ad2date3.log.gz

s3://logs/ad2date4.log.gz


If you do



CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING,

d2 STRING,

url STRING,

d4 STRING,

d5 STRING,

d6 STRING,

adv_id_dummy STRING,

timestp STRING,

ip STRING,

userAgent STRING,

stage STRING,

d12 STRING,

d13 STRING)
LOCATION 's3n://logs/'<s3n://logs/%27>;

you'll have all of it in a table. If you then want the results partitioned, you can do


CREATE EXTERNAL TABLE results (cookie STRING,

  d2 STRING,

  url STRING,

  d4 STRING,

  d5 STRING,

  d6 STRING,

  adv_id_dummy STRING,

  timestp STRING,

  ip STRING,

  userAgent STRING,

  stage STRING,

  d12 STRING,

  d13 STRING)

PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION 's3://somewhere-outside-the-logs-tree';<s3://logs/joined_analyze_files_hive/%27>



You can then



INSERT OVERWRITE TABLE results PARTITION (adv_id, date)

<your query>



Note that to use dynamic partitions you have to first run

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.dynamic.partition=true;

Cheers,

Pedro
Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting

Re: Hive on EMR on S3 : Beginner

Posted by Pedro Figueiredo <pf...@89clouds.com>.

Hi,

On 24 Aug 2012, at 14:08, Ravi Shetye wrote:

> 
> Is this all I need to do to load the data?
> how will the system know what data will go into what partition?
> As I understand the partition columns should be psedo columns and not part of the actual data.

Sorry, I just copy&pasted your table definition, obviously the results table will be something else. The partitions will come from the select statement.

> 
> Also if I have to load just 2 of the files say 
> s3://logs/ad1date1.log.gz and 
> s3://logs/ad2date4.log.gz  how do I specify it.

You have to have them in a separate directory. You could have dailies:

s3://logs/date1/files.for.day1.gz
s3://logs/date2/files.for.day2.gz
etc.

If you create this table as a partitioned table (PARTITIONED BY (date STRING) LOCATION 's3n://logs/') , you can then filter on date = 'date1'. To get the partitions in the table do a recover or add the partitions statically.

Cheers,

Pedro

Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting

Re: Hive on EMR on S3 : Beginner

Posted by Ravi Shetye <ra...@vizury.com>.

|
thanks for the reply

Let concentrate on the second case*

*|*|CREATE EXTERNAL TABLE results (cookie STRING,
   d2 STRING,
   url STRING,
   d4 STRING,
   d5 STRING,
   d6 STRING,
   adv_id_dummy STRING,
   timestp STRING,
   ip STRING,
   userAgent STRING,
   stage STRING,
   d12 STRING,
   d13 STRING)
PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://somewhere-outside-the-logs-tree';  <s3://logs/joined_analyze_files_hive/%27>|*

*|
|*

*|You can then|*

*|
|*

|*INSERT OVERWRITE TABLE results PARTITION (adv_id, date)

Is this all I need to do to load the data?
*how will the system know what data will go into what partition?
As I understand the partition columns should be psedo columns and not part of the actual data.

Also if I have to load just 2 of the files say|
|s3://logs/ad1date1.log.gz  and|
|s3://logs/ad2date4.log.gz   how do I specify it.|


On 08/24/2012 06:16 PM, Pedro Figueiredo wrote:
> Hi,
>
> On 24 Aug 2012, at 13:26, Ravi Shetye wrote:
>
>> I have the data in s3 bucket in the following manner
>>
>> |s3://logs/ad1date1.log.gz
>> s3://logs/ad1date2.log.gz
>> s3://logs/ad1date3.log.gz
>> s3://logs/ad1date4.log.gz
>> s3://logs/ad2date1.log.gz
>> s3://logs/ad2date2.log.gz
>> s3://logs/ad2date3.log.gz
>> s3://logs/ad2date4.log.gz
>> |
>>
>>
> If you do
>
>> |CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING,
>> d2 STRING,
>> url STRING,
>> d4 STRING,
>> d5 STRING,
>> d6 STRING,
>> adv_id_dummy STRING,
>> timestp STRING,
>> ip STRING,
>> userAgent STRING,
>> stage STRING,
>> d12 STRING,
>> d13 STRING)|
> LOCATION 's3n://logs/' <s3n://logs/%27>;
>
> you'll have all of it in a table. If you then want the results 
> partitioned, you can do
>
> |CREATE EXTERNAL TABLE results (cookie STRING,
>    d2 STRING,
>    url STRING,
>    d4 STRING,
>    d5 STRING,
>    d6 STRING,
>    adv_id_dummy STRING,
>    timestp STRING,
>    ip STRING,
>    userAgent STRING,
>    stage STRING,
>    d12 STRING,
>    d13 STRING)
> PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION 's3://somewhere-outside-the-logs-tree';  <s3://logs/joined_analyze_files_hive/%27>|
> |
> |
> |You can then|
> |
> |
> |INSERT OVERWRITE TABLE results PARTITION (adv_id, date)|
> |<your query>|
> |
> |
> |Note that to use dynamic partitions you have to first run|
> |SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.dynamic.partition=true;|
>
> Cheers,
>
> Pedro
> Pedro Figueiredo
> Skype: pfig.89clouds
> http://89clouds.com/ - Big Data Consulting
>
>
>
>

Re: Hive on EMR on S3 : Beginner

Posted by Pedro Figueiredo <pf...@89clouds.com>.

Hi,

On 24 Aug 2012, at 13:26, Ravi Shetye wrote:

> I have the data in s3 bucket in the following manner
> 
> s3://logs/ad1date1.log.gz
> s3://logs/ad1date2.log.gz
> s3://logs/ad1date3.log.gz
> s3://logs/ad1date4.log.gz
> s3://logs/ad2date1.log.gz
> s3://logs/ad2date2.log.gz
> s3://logs/ad2date3.log.gz
> s3://logs/ad2date4.log.gz
> 
> 
If you do

> CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING,
> d2 STRING, 
> url STRING, 
> d4 STRING, 
> d5 STRING, 
> d6 STRING, 
> adv_id_dummy STRING,
> timestp STRING,
> ip STRING,
> userAgent STRING,
> stage STRING,
> d12 STRING,
> d13 STRING)
LOCATION 's3n://logs/';

you'll have all of it in a table. If you then want the results partitioned, you can do

CREATE EXTERNAL TABLE results (cookie STRING,
  d2 STRING, 
  url STRING, 
  d4 STRING, 
  d5 STRING, 
  d6 STRING, 
  adv_id_dummy STRING,
  timestp STRING,
  ip STRING,
  userAgent STRING,
  stage STRING,
  d12 STRING,
  d13 STRING)
PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://somewhere-outside-the-logs-tree';

You can then

INSERT OVERWRITE TABLE results PARTITION (adv_id, date)
<your query>

Note that to use dynamic partitions you have to first run
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true; 

Cheers,

Pedro
Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting