You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by John Omernik <jo...@omernik.com> on 2014/12/02 18:25:36 UTC

Files Per Partition Causing Slowness

I am running Hive 0.12 in production, I have a table that ha 1100
partitions, (flat, no multi level partitions) and in those partitions some
have a small number of files (5- 10) and others have quite a few files (up
to 120).   The total table size is not "huge" around 285 GB.

While this is not terrible to my eyes, when I try to run a query on lots of
partition say all 1100, the time from query start to the time the query is
submitted to the jobtracker is horribly slow.  For example, it can take up
to 3.5 minutes just to get to the point where the job is seen in the job
tracker.
Is the number of files here what's hurting me? Is there some sort of per
file enumeration going on under the hood in Hive?  I ran Hive with debug
mode on and saw lots of file calls for each individual file... I guess I am
curious for others out there who may have similar tables, would a query
like that take a horribly long time for you as well? Is this "normal" or am
I seeing issues here?

RE: Files Per Partition Causing Slowness

Posted by Mike Roberts <mi...@spyfu.com>.

unsubscribe

-----Original Message-----
From: John Omernik [mailto:john@omernik.com] 
Sent: Tuesday, December 2, 2014 1:01 PM
To: user@hive.apache.org
Subject: Fwd: Files Per Partition Causing Slowness

---------- Forwarded message ----------
From: John Omernik <jo...@omernik.com>
Date: Tue, Dec 2, 2014 at 1:58 PM
Subject: Re: Files Per Partition Causing Slowness
To: user@hive.apache.org


Thank you Edward, I knew the number of partitions mattered,  but I didn't think 1000 would be to much.  However, I didn't realize the number of files per partition was also a fact prior to job submission.
I am looking at reducing some of those now too.

Out of curiosity, if I have a per day partition for three years of data, how would I setup bucketing to keep my partitions lower? I am struggling to find a way to approach this problem.


Thanks!

On Tue, Dec 2, 2014 at 12:28 PM, John Omernik <jo...@omernik.com> wrote:
>
> Thank you Edward, I knew the number of partitions mattered, and knew I was getting high, however, I didn't realize the number of files per partition was also a fact prior to job submission.
>
> Thanks!
>
> John
>
> On Tue, Dec 2, 2014 at 11:35 AM, Edward Capriolo <ed...@gmail.com> wrote:
>>
>> This is discussed in the programming hive book. The more files the 
>> longer it takes the job tracker to plan the job. The more tasks the 
>> more things the job tracker has to track. The more partitions the 
>> more metastore lookups are required. All of these things limit 
>> throughput. I do not like tables with more then 100 partitions above 
>> that I would switch to bucketing or some other mechanism (application 
>> level partitioning)
>>
>> On Tue, Dec 2, 2014 at 12:25 PM, John Omernik <jo...@omernik.com> wrote:
>>>
>>> I am running Hive 0.12 in production, I have a table that ha 1100 partitions, (flat, no multi level partitions) and in those partitions some have a small number of files (5- 10) and others have quite a few files (up to 120).   The total table size is not "huge" around 285 GB.
>>>
>>> While this is not terrible to my eyes, when I try to run a query on lots of partition say all 1100, the time from query start to the time the query is submitted to the jobtracker is horribly slow.  For example, it can take up to 3.5 minutes just to get to the point where the job is seen in the job tracker.
>>> Is the number of files here what's hurting me? Is there some sort of per file enumeration going on under the hood in Hive?  I ran Hive with debug mode on and saw lots of file calls for each individual file... I guess I am curious for others out there who may have similar tables, would a query like that take a horribly long time for you as well? Is this "normal" or am I seeing issues here?
>>>
>>>
>>
>

Fwd: Files Per Partition Causing Slowness

Posted by John Omernik <jo...@omernik.com>.

---------- Forwarded message ----------
From: John Omernik <jo...@omernik.com>
Date: Tue, Dec 2, 2014 at 1:58 PM
Subject: Re: Files Per Partition Causing Slowness
To: user@hive.apache.org


Thank you Edward, I knew the number of partitions mattered,  but I
didn't think 1000 would be to much.  However, I didn't realize the
number of files per partition was also a fact prior to job submission.
I am looking at reducing some of those now too.

Out of curiosity, if I have a per day partition for three years of
data, how would I setup bucketing to keep my partitions lower? I am
struggling to find a way to approach this problem.


Thanks!

On Tue, Dec 2, 2014 at 12:28 PM, John Omernik <jo...@omernik.com> wrote:
>
> Thank you Edward, I knew the number of partitions mattered, and knew I was getting high, however, I didn't realize the number of files per partition was also a fact prior to job submission.
>
> Thanks!
>
> John
>
> On Tue, Dec 2, 2014 at 11:35 AM, Edward Capriolo <ed...@gmail.com> wrote:
>>
>> This is discussed in the programming hive book. The more files the longer it takes the job tracker to plan the job. The more tasks the more things the job tracker has to track. The more partitions the more metastore lookups are required. All of these things limit throughput. I do not like tables with more then 100 partitions above that I would switch to bucketing or some other mechanism (application level partitioning)
>>
>> On Tue, Dec 2, 2014 at 12:25 PM, John Omernik <jo...@omernik.com> wrote:
>>>
>>> I am running Hive 0.12 in production, I have a table that ha 1100 partitions, (flat, no multi level partitions) and in those partitions some have a small number of files (5- 10) and others have quite a few files (up to 120).   The total table size is not "huge" around 285 GB.
>>>
>>> While this is not terrible to my eyes, when I try to run a query on lots of partition say all 1100, the time from query start to the time the query is submitted to the jobtracker is horribly slow.  For example, it can take up to 3.5 minutes just to get to the point where the job is seen in the job tracker.
>>> Is the number of files here what's hurting me? Is there some sort of per file enumeration going on under the hood in Hive?  I ran Hive with debug mode on and saw lots of file calls for each individual file... I guess I am curious for others out there who may have similar tables, would a query like that take a horribly long time for you as well? Is this "normal" or am I seeing issues here?
>>>
>>>
>>
>

Re: Files Per Partition Causing Slowness

Posted by Edward Capriolo <ed...@gmail.com>.

This is discussed in the programming hive book. The more files the longer
it takes the job tracker to plan the job. The more tasks the more things
the job tracker has to track. The more partitions the more metastore
lookups are required. All of these things limit throughput. I do not like
tables with more then 100 partitions above that I would switch to bucketing
or some other mechanism (application level partitioning)

On Tue, Dec 2, 2014 at 12:25 PM, John Omernik <jo...@omernik.com> wrote:

> I am running Hive 0.12 in production, I have a table that ha 1100
> partitions, (flat, no multi level partitions) and in those partitions some
> have a small number of files (5- 10) and others have quite a few files (up
> to 120).   The total table size is not "huge" around 285 GB.
>
> While this is not terrible to my eyes, when I try to run a query on lots
> of partition say all 1100, the time from query start to the time the query
> is submitted to the jobtracker is horribly slow.  For example, it can take
> up to 3.5 minutes just to get to the point where the job is seen in the job
> tracker.
> Is the number of files here what's hurting me? Is there some sort of per
> file enumeration going on under the hood in Hive?  I ran Hive with debug
> mode on and saw lots of file calls for each individual file... I guess I am
> curious for others out there who may have similar tables, would a query
> like that take a horribly long time for you as well? Is this "normal" or am
> I seeing issues here?
>
>
>