You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Raj Hadoop <ha...@yahoo.com> on 2013/10/31 23:34:45 UTC

External Partition Table

Hi,

I am planning for a Hive External Partition Table based on a date.

Which one of the below yields a better performance or both have the same performance?

1) Partition based on one folder per day
LIKE date INT
2) Partition based on one folder per year / month / day ( So it has three folders) 
LIKE year INT, month INT, day INT

Thanks,
Raj

Re: External Partition Table

Posted by Raj Hadoop <ha...@yahoo.com>.

Thanks Tim. I am using a String column for the partition column. 

On Thursday, October 31, 2013 6:49 PM, Timothy Potter <th...@gmail.com> wrote:

Hi Raj,
This seems like a matter of style vs. any performance benefit / cost ... if you're going to do a lot of queries just based on month or year, then #2 might be easier, e.g.

select * from foo where year = 2013 seems a little cleaner than select * from foo where date >= 20130101 and date <= 20131231 (not sure how you're encoding dates into a INT but I think you get the idea)

I do something similar but my partition fields are strings, like 2013-10-31_0000 (which has the nice property of lexically sorting the same as numeric sort).

I'm assuming they will both have the same performance because Hive is still selecting the same number of input paths in both scenarios, one just happens to be a little deeper.

Cheers,
Tim

On Thu, Oct 31, 2013 at 4:34 PM, Raj Hadoop <ha...@yahoo.com> wrote:

Hi,
>
>
>I am planning for a Hive External Partition Table based on a date.
>
>
>Which one of the below yields a better performance or both have the same performance?
>
>
>1) Partition based on one folder per day
>LIKE date INT
>2) Partition based on one folder per year / month / day ( So it has three folders) 
>LIKE year INT, month INT, day INT
>
>
>Thanks,
>Raj
>
>

Re: External Partition Table

Posted by Timothy Potter <th...@gmail.com>.

Hi Raj,

This seems like a matter of style vs. any performance benefit / cost ... if
you're going to do a lot of queries just based on month or year, then #2
might be easier, e.g.

select * from foo where year = 2013 seems a little cleaner than select *
from foo where date >= 20130101 and date <= 20131231 (not sure how you're
encoding dates into a INT but I think you get the idea)

I do something similar but my partition fields are strings, like
2013-10-31_0000 (which has the nice property of lexically sorting the same
as numeric sort).

I'm assuming they will both have the same performance because Hive is still
selecting the same number of input paths in both scenarios, one just
happens to be a little deeper.

Cheers,
Tim

On Thu, Oct 31, 2013 at 4:34 PM, Raj Hadoop <ha...@yahoo.com> wrote:

> Hi,
>
> I am planning for a Hive External Partition Table based on a date.
>
> Which one of the below yields a better performance or both have the same
> performance?
>
> 1) Partition based on one folder per day
> LIKE date INT
> 2) Partition based on one folder per year / month / day ( So it has three
> folders)
> LIKE year INT, month INT, day INT
>
> Thanks,
> Raj
>
>

Re: External Partition Table

Posted by Brad Ruderman <br...@radiumone.com>.

Personally from my limited understanding of your requirements, I would
think partitioned by day would be fine. Perhaps use the "YYYYMMDD" method
so partition for today would be 20131031 and tomorrow would be 20131101

Thanks,
Brad


On Thu, Oct 31, 2013 at 3:42 PM, Raj Hadoop <ha...@yahoo.com> wrote:

> Hi Brad,
>
> Thanks for the quick response.
>
> I have about 10 GB file per day (web logs). And I am creating a
> folder(partition) per each day. Is it something uncommon ?
>
> I do not know at this juncture what kind of queries I would be executing
> upon on this table. But just wanted to know whether this is something
> normal or not at all a normal thing.
>
> Thanks,
> Raj
>
>
>   On Thursday, October 31, 2013 6:39 PM, Brad Ruderman <
> bruderman@radiumone.com> wrote:
>  Wow that question won't be answerable. It all depends on the amount of
> data per partition and the queries you are going to be executing on it, as
> well as the structure of the data. In general in hive (depending on your
> cluster size) you need to balance the number of files with the size,
> smaller number of files is typically preferred but partitions will help
> when date restricting.
>
> Thx,
> Brad
>
>
> On Thu, Oct 31, 2013 at 3:34 PM, Raj Hadoop <ha...@yahoo.com> wrote:
>
> Hi,
>
> I am planning for a Hive External Partition Table based on a date.
>
> Which one of the below yields a better performance or both have the same
> performance?
>
> 1) Partition based on one folder per day
> LIKE date INT
> 2) Partition based on one folder per year / month / day ( So it has three
> folders)
> LIKE year INT, month INT, day INT
>
>  Thanks,
> Raj
>
>
>
>
>

Re: External Partition Table

Posted by Raj Hadoop <ha...@yahoo.com>.

Hi Brad,

Thanks for the quick response.

I have about 10 GB file per day (web logs). And I am creating a folder(partition) per each day. Is it something uncommon ?

I do not know at this juncture what kind of queries I would be executing upon on this table. But just wanted to know whether this is something normal or not at all a normal thing.

Thanks,
Raj

On Thursday, October 31, 2013 6:39 PM, Brad Ruderman <br...@radiumone.com> wrote:

Wow that question won't be answerable. It all depends on the amount of data per partition and the queries you are going to be executing on it, as well as the structure of the data. In general in hive (depending on your cluster size) you need to balance the number of files with the size, smaller number of files is typically preferred but partitions will help when date restricting.

Thx,
Brad

On Thu, Oct 31, 2013 at 3:34 PM, Raj Hadoop <ha...@yahoo.com> wrote:

Hi,
>
>
>I am planning for a Hive External Partition Table based on a date.
>
>
>Which one of the below yields a better performance or both have the same performance?
>
>
>1) Partition based on one folder per day
>LIKE date INT
>2) Partition based on one folder per year / month / day ( So it has three folders) 
>LIKE year INT, month INT, day INT
>
>
>Thanks,
>Raj
>
>

Re: External Partition Table

Posted by Brad Ruderman <br...@radiumone.com>.

Wow that question won't be answerable. It all depends on the amount of data
per partition and the queries you are going to be executing on it, as well
as the structure of the data. In general in hive (depending on your cluster
size) you need to balance the number of files with the size, smaller number
of files is typically preferred but partitions will help when date
restricting.

Thx,
Brad

On Thu, Oct 31, 2013 at 3:34 PM, Raj Hadoop <ha...@yahoo.com> wrote:

> Hi,
>
> I am planning for a Hive External Partition Table based on a date.
>
> Which one of the below yields a better performance or both have the same
> performance?
>
> 1) Partition based on one folder per day
> LIKE date INT
> 2) Partition based on one folder per year / month / day ( So it has three
> folders)
> LIKE year INT, month INT, day INT
>
> Thanks,
> Raj
>
>