You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Johan Oskarsson <jo...@oskarsson.nu> on 2008/11/28 12:48:44 UTC

External tables and existing directory structure

Hi, just had some fun with hive. Exciting stuff.

I have one question about mapping tables to our existing directory
structure. I assume the "CREATE EXTERNAL TABLE" would be the way to go,
but I haven't been able to find much information about how it works.

We have datasets in the following format in hdfs:
/dataset/yyyy/MM/dd/<one or more files>

I'd love to be able to bind these with the date as the partition to hive
tables without copying or moving the data. Is it currently possible?

/Johan

Re: External tables and existing directory structure

Posted by Johan Oskarsson <jo...@oskarsson.nu>.
Thanks for the answer.

I like the idea of having a more flexible way of specifying how 
partition maps to directory structure.
I'll see if I'll have some time to look at this, in the mean time I've 
filed a ticket for it: https://issues.apache.org/jira/browse/HIVE-91

Had a quick look at HIVE-86 (don't delete data) but am not quite sure 
what each component is doing.
Is there an updated version of this wiki page anywhere? 
http://wiki.apache.org/hadoop/Hive/DeveloperGuide

If not could someone explain what HiveMetaStore* does compared to 
MetaStore*. One newer and one older version?
And what does FileStore do compared to the above. Stores the meta db in 
files instead of a sql db?

There seems to be two different drop table methods as far as I can see. 
Are both used?
RWTable.drop()
HiveMetaStore.drop_table()

/Johan

Joydeep Sen Sarma wrote:
> That's one possibility.
>
> Or we could have a 'format' spec in the create table command for how the directories are named. By default it's '%key=%value', but in this case it's '%value'. this might make it more flexible if we encounter other kinds of directory layouts.
>
> Thoughts?
>
> (just remembered that there's probably an unfilled issue that drop table should not be deleting directories for external tables - but probably does right now ..)
>
> -----Original Message-----
> From: Josh Ferguson [mailto:josh@besquared.net] 
> Sent: Friday, November 28, 2008 3:32 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: External tables and existing directory structure
>
> I think this is a pretty common scenario as this is how I was storing  
> my stuff as well. Would this affect the HiveQL create table statement  
> at all or just implicitly require that it be ordered?
>
> Josh
>
> On Nov 28, 2008, at 3:00 PM, Joydeep Sen Sarma wrote:
>
>   
>> Hi Johann,
>>
>> Create external table with the 'location' clause set to ur data  
>> would be the way to go. However - Hive has it's own directory naming  
>> scheme for partitions ('<partitition_key>=<partition_val>'). So just  
>> pointing to a directory with subdirectories would not work.
>>
>> So right now case one would have to move or copy the data using the  
>> load command.
>>
>> Going forward - one thing we can do is that for external tables - we  
>> can drop the 'key=val' directory naming for partitioned stuff and  
>> just assume that directory hierarchy follows the partition list and  
>> the directory names are partition values. Is that's what's required  
>> in this case?
>>
>> Joydeep
>>
>>
>> -----Original Message-----
>> From: Johan Oskarsson [mailto:johan@oskarsson.nu]
>> Sent: Friday, November 28, 2008 3:49 AM
>> To: hive-user@hadoop.apache.org
>> Subject: External tables and existing directory structure
>>
>> Hi, just had some fun with hive. Exciting stuff.
>>
>> I have one question about mapping tables to our existing directory
>> structure. I assume the "CREATE EXTERNAL TABLE" would be the way to  
>> go,
>> but I haven't been able to find much information about how it works.
>>
>> We have datasets in the following format in hdfs:
>> /dataset/yyyy/MM/dd/<one or more files>
>>
>> I'd love to be able to bind these with the date as the partition to  
>> hive
>> tables without copying or moving the data. Is it currently possible?
>>
>> /Johan
>>     
>
>   


RE: External tables and existing directory structure

Posted by Joydeep Sen Sarma <js...@facebook.com>.
That's one possibility.

Or we could have a 'format' spec in the create table command for how the directories are named. By default it's '%key=%value', but in this case it's '%value'. this might make it more flexible if we encounter other kinds of directory layouts.

Thoughts?

(just remembered that there's probably an unfilled issue that drop table should not be deleting directories for external tables - but probably does right now ..)

-----Original Message-----
From: Josh Ferguson [mailto:josh@besquared.net] 
Sent: Friday, November 28, 2008 3:32 PM
To: hive-user@hadoop.apache.org
Subject: Re: External tables and existing directory structure

I think this is a pretty common scenario as this is how I was storing  
my stuff as well. Would this affect the HiveQL create table statement  
at all or just implicitly require that it be ordered?

Josh

On Nov 28, 2008, at 3:00 PM, Joydeep Sen Sarma wrote:

> Hi Johann,
>
> Create external table with the 'location' clause set to ur data  
> would be the way to go. However - Hive has it's own directory naming  
> scheme for partitions ('<partitition_key>=<partition_val>'). So just  
> pointing to a directory with subdirectories would not work.
>
> So right now case one would have to move or copy the data using the  
> load command.
>
> Going forward - one thing we can do is that for external tables - we  
> can drop the 'key=val' directory naming for partitioned stuff and  
> just assume that directory hierarchy follows the partition list and  
> the directory names are partition values. Is that's what's required  
> in this case?
>
> Joydeep
>
>
> -----Original Message-----
> From: Johan Oskarsson [mailto:johan@oskarsson.nu]
> Sent: Friday, November 28, 2008 3:49 AM
> To: hive-user@hadoop.apache.org
> Subject: External tables and existing directory structure
>
> Hi, just had some fun with hive. Exciting stuff.
>
> I have one question about mapping tables to our existing directory
> structure. I assume the "CREATE EXTERNAL TABLE" would be the way to  
> go,
> but I haven't been able to find much information about how it works.
>
> We have datasets in the following format in hdfs:
> /dataset/yyyy/MM/dd/<one or more files>
>
> I'd love to be able to bind these with the date as the partition to  
> hive
> tables without copying or moving the data. Is it currently possible?
>
> /Johan


Re: External tables and existing directory structure

Posted by Josh Ferguson <jo...@besquared.net>.
I think this is a pretty common scenario as this is how I was storing  
my stuff as well. Would this affect the HiveQL create table statement  
at all or just implicitly require that it be ordered?

Josh

On Nov 28, 2008, at 3:00 PM, Joydeep Sen Sarma wrote:

> Hi Johann,
>
> Create external table with the 'location' clause set to ur data  
> would be the way to go. However - Hive has it's own directory naming  
> scheme for partitions ('<partitition_key>=<partition_val>'). So just  
> pointing to a directory with subdirectories would not work.
>
> So right now case one would have to move or copy the data using the  
> load command.
>
> Going forward - one thing we can do is that for external tables - we  
> can drop the 'key=val' directory naming for partitioned stuff and  
> just assume that directory hierarchy follows the partition list and  
> the directory names are partition values. Is that's what's required  
> in this case?
>
> Joydeep
>
>
> -----Original Message-----
> From: Johan Oskarsson [mailto:johan@oskarsson.nu]
> Sent: Friday, November 28, 2008 3:49 AM
> To: hive-user@hadoop.apache.org
> Subject: External tables and existing directory structure
>
> Hi, just had some fun with hive. Exciting stuff.
>
> I have one question about mapping tables to our existing directory
> structure. I assume the "CREATE EXTERNAL TABLE" would be the way to  
> go,
> but I haven't been able to find much information about how it works.
>
> We have datasets in the following format in hdfs:
> /dataset/yyyy/MM/dd/<one or more files>
>
> I'd love to be able to bind these with the date as the partition to  
> hive
> tables without copying or moving the data. Is it currently possible?
>
> /Johan


RE: External tables and existing directory structure

Posted by Joydeep Sen Sarma <js...@facebook.com>.
Hi Johann,

Create external table with the 'location' clause set to ur data would be the way to go. However - Hive has it's own directory naming scheme for partitions ('<partitition_key>=<partition_val>'). So just pointing to a directory with subdirectories would not work.

So right now case one would have to move or copy the data using the load command.

Going forward - one thing we can do is that for external tables - we can drop the 'key=val' directory naming for partitioned stuff and just assume that directory hierarchy follows the partition list and the directory names are partition values. Is that's what's required in this case?

Joydeep


-----Original Message-----
From: Johan Oskarsson [mailto:johan@oskarsson.nu] 
Sent: Friday, November 28, 2008 3:49 AM
To: hive-user@hadoop.apache.org
Subject: External tables and existing directory structure

Hi, just had some fun with hive. Exciting stuff.

I have one question about mapping tables to our existing directory
structure. I assume the "CREATE EXTERNAL TABLE" would be the way to go,
but I haven't been able to find much information about how it works.

We have datasets in the following format in hdfs:
/dataset/yyyy/MM/dd/<one or more files>

I'd love to be able to bind these with the date as the partition to hive
tables without copying or moving the data. Is it currently possible?

/Johan