You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Guy Doulberg <Gu...@conduit.com> on 2011/04/07 08:45:42 UTC

Synching HDFS directories with partitions on the Hive.

Hey folks,

I wanted to consult with on something that has been bothering me for a while...


I have declared external tables, these table are partitioned by dates_hour. I have a batch hadoop process that updates the files under the partitions. I want the data to be accesed via Hive, as soon as it is updated.

I came up with 3 solutions each has its own problem
1. Creating all partitions a month in  advance,  It creates empty directories on the HDFS with the future partitions. As a result of that using ">" might fail the job, since it loads empty file input.
2. When the batch finishes its work it updates the hive a new partitions has been added - the batch need to "know" hive in order to update it, I want the batch to be agnostic towards the Hive.
3. Having in the crontab a process that reads the HDFS and find all the partitions available on the HDFS, and then lists all the declared partitions, find the delta, and add the partitions in the delta.


Do you have other solutions?
Or improvements?


Thanks.
Guy