You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Erik Thorson <et...@varickmm.com> on 2012/12/10 22:46:18 UTC

Partition by directory

Hello All,

I have been using the AWS setup for EMR for some time now and I am currently in the process of implementing spark/shark on my own cluster. I am installing from https://github.com/downloads/mesos/spark/spark-0.6.0-sources.tar.gz. Which includes hive0.9.0. I am using this with s3 and am unable to recover partitions from a directory with a series of other directories (partitions)  inside of it. I want to have 2 partitions 2012-10-25 and 2012-10-26 which contain their respective files. For example I have the following files located at s3://varickTest3/nn/.


drwxrwxrwx   -          0 1970-01-01 00:00 /nn/ds=2012-10-25

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00000

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00001

drwxrwxrwx   -          0 1970-01-01 00:00 /nn/ds=2012-10-26

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00000

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00001


When I run the query in hive (not shark):


CREATE EXTERNAL TABLE wiki(id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)

PARTITIONED BY (ds STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://varickTest3/nn';

ALTER TABLE wiki RECOVER PARTITIONS;


This will result in an empty table.


I have tried many iterations of this and nothing has worked so far. Including adding:

MSCK REPAIR TABLE wiki;

And using s3 rather than s3n (credentials for both types are set in core-site.xml)


And setting the options:

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;


Although if I use:

LOCATION 's3n://varickTest3/nn/*


The table will have content but I am still unable to recover partitions.


Is there any way to do this using settings or data structure (rather than writing a script) to partition the table using the directories as I can in AWS?


Thank you for any help anyone can give me.

Re: Partition by directory

Posted by Mark Grover <gr...@gmail.com>.
Erik,
Did you find out the answer to this? I would be curious to hear what
the problem is.

BTW, I would check the hive logs (/var/log/apps/hive or /var/log/hive
or similar on EMR). Try increasing the log level and see if that
helps.

Given that EMR comes with it's own distribution of Hive (which the
last I saw was 0.8*), it would be interesting how Shark's Hive 0.9 is
going to play around with EMR's version of Hive. FWIW, commands like
"ALTER TABLE RECOVER PARTITIONS" are only available in EMR Hive.

Keep us posted!
Mark

On Mon, Dec 10, 2012 at 1:46 PM, Erik Thorson <et...@varickmm.com> wrote:
> Hello All,
>
> I have been using the AWS setup for EMR for some time now and I am currently
> in the process of implementing spark/shark on my own cluster. I am
> installing from
> https://github.com/downloads/mesos/spark/spark-0.6.0-sources.tar.gz. Which
> includes hive0.9.0. I am using this with s3 and am unable to recover
> partitions from a directory with a series of other directories (partitions)
> inside of it. I want to have 2 partitions 2012-10-25 and 2012-10-26 which
> contain their respective files. For example I have the following files
> located at s3://varickTest3/nn/.
>
>
> drwxrwxrwx   -          0 1970-01-01 00:00 /nn/ds=2012-10-25
>
> -rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00000
>
> -rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00001
>
> drwxrwxrwx   -          0 1970-01-01 00:00 /nn/ds=2012-10-26
>
> -rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00000
>
> -rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00001
>
>
> When I run the query in hive (not shark):
>
>
> CREATE EXTERNAL TABLE wiki(id BIGINT, title STRING, last_modified STRING,
> xml STRING, text STRING)
>
> PARTITIONED BY (ds STRING)
>
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
> 's3n://varickTest3/nn';
>
> ALTER TABLE wiki RECOVER PARTITIONS;
>
>
> This will result in an empty table.
>
>
> I have tried many iterations of this and nothing has worked so far.
> Including adding:
>
> MSCK REPAIR TABLE wiki;
>
> And using s3 rather than s3n (credentials for both types are set in
> core-site.xml)
>
>
> And setting the options:
>
> SET hive.exec.dynamic.partition=true;
>
> SET hive.exec.dynamic.partition.mode=nonstrict;
>
>
> Although if I use:
>
> LOCATION 's3n://varickTest3/nn/*
>
>
> The table will have content but I am still unable to recover partitions.
>
>
> Is there any way to do this using settings or data structure (rather than
> writing a script) to partition the table using the directories as I can in
> AWS?
>
>
> Thank you for any help anyone can give me.