You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mayuran Yogarajah <ma...@casalemedia.com> on 2009/09/11 21:20:25 UTC

General design/schema question

We have our files in HDFS laid out by day like this:

2009-09-01/files
2009-09-02/files
2009-09-03/files

Loading this data into Hive would mean creating a new table per day!

I'm thinking this might be a common issue though, since others most likely
do batch processing on a daily/nightly basis.  Is there any way to have the
data in Hive without creating a new table per day ?

thanks

Re: General design/schema question

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Sep 11, 2009 at 3:26 PM, Prasad Chakka <pc...@facebook.com> wrote:
> You should create a daily partition table. So you just need to create a new
> partition which is automatic if you use ‘LOAD DATA... INTO TABLE ...
> PARTITION (ds=’2009-09-01’)’
>
> Prasad
>
>
> ________________________________
> From: Mayuran Yogarajah <ma...@casalemedia.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Fri, 11 Sep 2009 12:20:25 -0700
> To: <hi...@hadoop.apache.org>
> Subject: General design/schema question
>
> We have our files in HDFS laid out by day like this:
>
> 2009-09-01/files
> 2009-09-02/files
> 2009-09-03/files
>
> Loading this data into Hive would mean creating a new table per day!
>
> I'm thinking this might be a common issue though, since others most likely
> do batch processing on a daily/nightly basis.  Is there any way to have the
> data in Hive without creating a new table per day ?
>
> thanks
>
>
I went DAY/HOUR, in this way I can do hourly queries in a short amount
of time. We also have 5 minute logs. So each hour partition holds 12
files per web server.

Re: General design/schema question

Posted by Prasad Chakka <pc...@facebook.com>.

A partitioned table has a set of partition keys. Check wiki on how to create partitioned table. In your case you can one partition key named 'ds' (datestamp). You can choose any format for values but commonly chosen one is 'YYYY-MM-DD'. You can specify the partition while loading data by '<TBL_NAME> PARTITION (ds="YYYY-MM-DD")' and hive will load the data into hdfs directory located at <table_directory>/ds=YYYY-MM-DD/.

But if you want to specify the full path, append 'LOCATION <you own location>' to the above.

Thanks,
Prasad

________________________________
From: Mayuran Yogarajah <ma...@casalemedia.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Fri, 11 Sep 2009 13:16:27 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: General design/schema question

Prasad Chakka wrote:
> You should create a daily partition table. So you just need to create
> a new partition which is automatic if you use 'LOAD DATA... INTO TABLE
> ... PARTITION (ds='2009-09-01')'
>
> Prasad
>
Just wanted to clarify, I still need to do LOAD DATA .. INTO TABLE ..
PARTITION (day='hdfs/path/to/day')
every night correct? I was confused since you said its automatic.  This
is actually great if it can work like this!

thanks again

Re: General design/schema question

Posted by Mayuran Yogarajah <ma...@casalemedia.com>.

Prasad Chakka wrote:
> You should create a daily partition table. So you just need to create 
> a new partition which is automatic if you use ‘LOAD DATA... INTO TABLE 
> ... PARTITION (ds=’2009-09-01’)’
>
> Prasad
>
Just wanted to clarify, I still need to do LOAD DATA .. INTO TABLE .. 
PARTITION (day='hdfs/path/to/day')
every night correct? I was confused since you said its automatic.  This 
is actually great if it can work like this!

thanks again

Re: General design/schema question

Posted by Prasad Chakka <pc...@facebook.com>.

You should create a daily partition table. So you just need to create a new partition which is automatic if you use 'LOAD DATA... INTO TABLE ... PARTITION (ds='2009-09-01')'

Prasad

________________________________
From: Mayuran Yogarajah <ma...@casalemedia.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Fri, 11 Sep 2009 12:20:25 -0700
To: <hi...@hadoop.apache.org>
Subject: General design/schema question

We have our files in HDFS laid out by day like this:

2009-09-01/files
2009-09-02/files
2009-09-03/files

Loading this data into Hive would mean creating a new table per day!

I'm thinking this might be a common issue though, since others most likely
do batch processing on a daily/nightly basis.  Is there any way to have the
data in Hive without creating a new table per day ?

thanks