You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mike Stec <mi...@actionx.com> on 2014/02/28 20:05:46 UTC

Loading and partitioning data when root partition value is variable

Hi,

I’m using Hive 0.11 on EMR to join data from S3 and write the joined data back to it. The join working fine but I have an issue with the directory structure of the input data. The input directories are structured like:

s3://bucket/input_data/id=$ID/tblname=$TBL/dt=DT/hr=$HR

This is a pretty standard use case for partitioning but the issue is that I want to load in the data for all IDs but only some of the subdirectories for an ID, i.e.

s3://bucket/input_data/id=1/tblname=table/dt=2014-02-28/hr=12
s3://bucket/input_data/id=2/tblname=table/dt=2014-02-28/hr=12
s3://bucket/input_data/id=3/tblname=table/dt=2014-02-28/hr=12
s3://bucket/input_data/id=4/tblname=table/dt=2014-02-28/hr=12

and so on. 

I know that I can’t use dynamic partitioning in this case because according to the Hive documentation you can’t do:

INSERT OVERWRITE TABLE table PARTITION(id, tblname=“table”, dt='2014-02-28', hr=12)

And if I have PARTITIONED BY (token STRING, tbl STRING, dt STRING, hr INT) in my CREATE TABLE statement and do:

ALTER TABLE table ADD PARTITION (tblname=“table”, dt=“2014-02-28”, hr=12)

I get a SemanticException error that says, "Invalid partition spec specified table is partitioned but partition spec is not specified or does not fully match table partitioning”

Is there a way to deal with this problem in Hive itself or will I have to use a script or something else in conjunction with Hive to get around this?