You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Chris Goffinet <go...@digg.com> on 2009/08/11 20:15:45 UTC

Dynamic Partitioning?

Hi

I was wondering if anyone has thought about the possibility of having  
dynamic partitioning in Hive? Right now you typically use LOAD DATA or  
ALTER TABLE to add new partitions. It would be great for applications  
like Scribe that can load data into HDFS, could just place the data  
into the correct folder structure for your partitions on HDFS. Has  
anyone investigated this? What is everyone else doing in regards to  
things like this? It seems a little error prone to have a cron job run  
everyday adding new partitions. It might not even be possible to do  
dynamic partitioning since its meta data read. But I'd love to hear  
thoughts?

-Chris

Re: Dynamic Partitioning?

Posted by Schubert Zhang <zs...@gmail.com>.

We have following use case:

1. We have a periodic MapReduce job to pre-process the source data (files)
and want put the output data files into HDFS directory. The HDFS directory
is correspond to a Hive table (this table should be partitioned). The above
MapReduce job shall output data into different partitions based on data
analysis.

2. We want Hive to recognise any new raised partitions from HDFS
sub-directories under the table's root directory. And the above MapReduce
job may add new files into new created partitions or existing partitions.

3. We also need a compact/merging process to periodic compact or merge the
existing partitions to get bigger files.



On Tue, Aug 18, 2009 at 9:46 AM, Prasad Chakka <pc...@facebook.com> wrote:

> Well, the code to infer partitions from HDFS directory exists in old
> version of Hive. You need to bring that back (and possibly make some
> modifications to reflect latest code). But the work involved here is to
> disallow tables being marked as EXTERNAL and also disallow setting Partition
> properties. There may be couple of other things that need to be taken care
> of that I can’t think of right now.
>
> It doesn’t look like much.
>
> Prasad
>
> ------------------------------
> *From: *Chris Goffinet <go...@digg.com>
> *Reply-To: *<hi...@hadoop.apache.org>
> *Date: *Mon, 17 Aug 2009 18:38:40 -0700
> *To: *<hi...@hadoop.apache.org>
> *Subject: *Re: Dynamic Partitioning?
>
> How much work is involved for such a feature?
>
> -Chris
>
> On Aug 17, 2009, at 6:19 PM, Prasad Chakka wrote:
>
>  We could make this feature per table property which doesn’t have the
> extended feature set supported...
>
>
> ------------------------------
> *From: *Frederick Oko <frederick.oko@gmail.com <x-msg:
> //89/frederick.oko@gmail.com <ht...@gmail.com>> >
> *Reply-To: *<hive-user@hadoop.apache.org <x-msg:
> //89/hive-user@hadoop.apache.org <ht...@hadoop.apache.org>>
> >
> *Date: *Thu, 13 Aug 2009 02:12:54 -0700
> *To: *<hive-user@hadoop.apache.org <x-...@hadoop.apache.org>>
> >
> *Subject: *Re: Dynamic Partitioning?
>
> Actually this is what Hive originally did -- it used to trust partitions it
> discovered via HDFS -- this blind trust could be leveraged for just what you
> are requesting as partions do follow a simple directory scheme (and there is
> precedent for such out-of-band data loading). However, this blind trust
> became incompatible with extended feature set of external tables and
> per-partition schemas introduced earlier this year. The re-enabling of this
> behavior based on configuration is currently tracked as
> https://issues.apache.org/jira/browse/HIVE-493 'automatically infer
> existing partitions of table from HDFS files'.
>
> On Tue, Aug 11, 2009 at 11:15 AM, Chris Goffinet <goffinet@digg.com<x-msg:
> //89/goffinet@digg.com <ht...@digg.com>> > wrote:
>
> Hi
>
> I was wondering if anyone has thought about the possibility of having
> dynamic partitioning in Hive? Right now you typically use LOAD DATA or ALTER
> TABLE to add new partitions. It would be great for applications like Scribe
> that can load data into HDFS, could just place the data into the correct
> folder structure for your partitions on HDFS. Has anyone investigated this?
> What is everyone else doing in regards to things like this? It seems a
> little error prone to have a cron job run everyday adding new partitions. It
> might not even be possible to do dynamic partitioning since its meta data
> read. But I'd love to hear thoughts?
>
> -Chris
>
>
>
>
>
>

Re: Dynamic Partitioning?

Posted by Prasad Chakka <pc...@facebook.com>.

Well, the code to infer partitions from HDFS directory exists in old version of Hive. You need to bring that back (and possibly make some modifications to reflect latest code). But the work involved here is to disallow tables being marked as EXTERNAL and also disallow setting Partition properties. There may be couple of other things that need to be taken care of that I can't think of right now.

It doesn't look like much.

Prasad

________________________________
From: Chris Goffinet <go...@digg.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Mon, 17 Aug 2009 18:38:40 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Dynamic Partitioning?

How much work is involved for such a feature?

-Chris

On Aug 17, 2009, at 6:19 PM, Prasad Chakka wrote:

We could make this feature per table property which doesn't have the extended feature set supported...

________________________________
From: Frederick Oko <frederick.oko@gmail.com <x-...@gmail.com> >
Reply-To: <hive-user@hadoop.apache.org <x-...@hadoop.apache.org> >
Date: Thu, 13 Aug 2009 02:12:54 -0700
To: <hive-user@hadoop.apache.org <x-...@hadoop.apache.org> >
Subject: Re: Dynamic Partitioning?

Actually this is what Hive originally did -- it used to trust partitions it discovered via HDFS -- this blind trust could be leveraged for just what you are requesting as partions do follow a simple directory scheme (and there is precedent for such out-of-band data loading). However, this blind trust became incompatible with extended feature set of external tables and per-partition schemas introduced earlier this year. The re-enabling of this behavior based on configuration is currently tracked as https://issues.apache.org/jira/browse/HIVE-493 'automatically infer existing partitions of table from HDFS files'.

On Tue, Aug 11, 2009 at 11:15 AM, Chris Goffinet <goffinet@digg.com <x-...@digg.com> > wrote:
Hi

I was wondering if anyone has thought about the possibility of having dynamic partitioning in Hive? Right now you typically use LOAD DATA or ALTER TABLE to add new partitions. It would be great for applications like Scribe that can load data into HDFS, could just place the data into the correct folder structure for your partitions on HDFS. Has anyone investigated this? What is everyone else doing in regards to things like this? It seems a little error prone to have a cron job run everyday adding new partitions. It might not even be possible to do dynamic partitioning since its meta data read. But I'd love to hear thoughts?

-Chris

Re: Dynamic Partitioning?

Posted by Chris Goffinet <go...@digg.com>.

How much work is involved for such a feature?

-Chris

On Aug 17, 2009, at 6:19 PM, Prasad Chakka wrote:

> We could make this feature per table property which doesn’t have the  
> extended feature set supported...
>
>
> From: Frederick Oko <fr...@gmail.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Thu, 13 Aug 2009 02:12:54 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Dynamic Partitioning?
>
> Actually this is what Hive originally did -- it used to trust  
> partitions it discovered via HDFS -- this blind trust could be  
> leveraged for just what you are requesting as partions do follow a  
> simple directory scheme (and there is precedent for such out-of-band  
> data loading). However, this blind trust became incompatible with  
> extended feature set of external tables and per-partition schemas  
> introduced earlier this year. The re-enabling of this behavior based  
> on configuration is currently tracked as https://issues.apache.org/jira/browse/HIVE-493 
>  'automatically infer existing partitions of table from HDFS files'.
>
> On Tue, Aug 11, 2009 at 11:15 AM, Chris Goffinet <go...@digg.com>  
> wrote:
> Hi
>
> I was wondering if anyone has thought about the possibility of  
> having dynamic partitioning in Hive? Right now you typically use  
> LOAD DATA or ALTER TABLE to add new partitions. It would be great  
> for applications like Scribe that can load data into HDFS, could  
> just place the data into the correct folder structure for your  
> partitions on HDFS. Has anyone investigated this? What is everyone  
> else doing in regards to things like this? It seems a little error  
> prone to have a cron job run everyday adding new partitions. It  
> might not even be possible to do dynamic partitioning since its meta  
> data read. But I'd love to hear thoughts?
>
> -Chris
>
>

Re: Dynamic Partitioning?

Posted by Prasad Chakka <pc...@facebook.com>.

We could make this feature per table property which doesn't have the extended feature set supported...

________________________________
From: Frederick Oko <fr...@gmail.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Thu, 13 Aug 2009 02:12:54 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Dynamic Partitioning?

Actually this is what Hive originally did -- it used to trust partitions it discovered via HDFS -- this blind trust could be leveraged for just what you are requesting as partions do follow a simple directory scheme (and there is precedent for such out-of-band data loading). However, this blind trust became incompatible with extended feature set of external tables and per-partition schemas introduced earlier this year. The re-enabling of this behavior based on configuration is currently tracked as https://issues.apache.org/jira/browse/HIVE-493 'automatically infer existing partitions of table from HDFS files'.

On Tue, Aug 11, 2009 at 11:15 AM, Chris Goffinet <go...@digg.com> wrote:
Hi

I was wondering if anyone has thought about the possibility of having dynamic partitioning in Hive? Right now you typically use LOAD DATA or ALTER TABLE to add new partitions. It would be great for applications like Scribe that can load data into HDFS, could just place the data into the correct folder structure for your partitions on HDFS. Has anyone investigated this? What is everyone else doing in regards to things like this? It seems a little error prone to have a cron job run everyday adding new partitions. It might not even be possible to do dynamic partitioning since its meta data read. But I'd love to hear thoughts?

-Chris

Re: Dynamic Partitioning?

Posted by Frederick Oko <fr...@gmail.com>.

Actually this is what Hive originally did -- it used to trust partitions it
discovered via HDFS -- this blind trust could be leveraged for just what you
are requesting as partions do follow a simple directory scheme (and there is
precedent for such out-of-band data loading). However, this blind trust
became incompatible with extended feature set of external tables and
per-partition schemas introduced earlier this year. The re-enabling of this
behavior based on configuration is currently tracked as
https://issues.apache.org/jira/browse/HIVE-493 'automatically infer existing
partitions of table from HDFS files'.

On Tue, Aug 11, 2009 at 11:15 AM, Chris Goffinet <go...@digg.com> wrote:

> Hi
>
> I was wondering if anyone has thought about the possibility of having
> dynamic partitioning in Hive? Right now you typically use LOAD DATA or ALTER
> TABLE to add new partitions. It would be great for applications like Scribe
> that can load data into HDFS, could just place the data into the correct
> folder structure for your partitions on HDFS. Has anyone investigated this?
> What is everyone else doing in regards to things like this? It seems a
> little error prone to have a cron job run everyday adding new partitions. It
> might not even be possible to do dynamic partitioning since its meta data
> read. But I'd love to hear thoughts?
>
> -Chris
>