You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Pradeep Gollakota <pr...@gmail.com> on 2015/06/11 20:56:16 UTC

Very slow dynamic partition load

Hi All,

I have a table which is partitioned on two columns (customer, date). I'm
loading some data into the table using a Hive query. The MapReduce job
completed within a few minutes and needs to "commit" the data to the
appropriate partitions. There were about 32000 partitions generated. The
commit phase has been running for almost 16 hours and has not finished yet.
I've been monitoring jmap, and don't believe it's a memory or gc issue.
I've also been looking at jstack and not sure why it's so slow. I'm not
sure what the problem is, but seems to be a Hive performance issue when it
comes to "highly partitioned" tables.

Any thoughts on this issue would be greatly appreciated.

Thanks in advance,
Pradeep

Re: Very slow dynamic partition load

Posted by Pradeep Gollakota <pr...@gmail.com>.
I actually decided to remove one of my 2 partition columns and make it a
bucketing column instead... same query completed fully in under 10 minutes
with 92 partitions added. This will suffice for me for now.

On Thu, Jun 11, 2015 at 2:25 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> Hmm... did your performance increase with the patch you supplied? I do
> need the partitions in Hive, but I have a separate tool that has the
> ability to add partitions to the metastore and is definitely much faster
> than this. I just checked my job again, the actual Hive job completed 24
> hours ago and has been adding the dynamic partitions to the metastore since
> then and is still not done. According to the metastore theres only 10830
> partitions added so far... at this pace, it will take approximately 2 more
> days for it complete.
>
> On Thu, Jun 11, 2015 at 1:18 PM, Slava Markeyev <
> slava.markeyev@upsight.com> wrote:
>
>> This is something that a few of us have run into. I think the bottleneck
>> is in partition creation calls to the metastore. My work around was
>> HIVE-10385 which optionally removed partition creation in the metastore but
>> this isn't a solution for everyone. If you don't require actual partitions
>> in the table but simply partitioned data in hdfs give it a shot. It may be
>> worthwhile looking into optimizations for this use case.
>>
>> -Slava
>>
>> On Thu, Jun 11, 2015 at 11:56 AM, Pradeep Gollakota <pradeepg26@gmail.com
>> > wrote:
>>
>>> Hi All,
>>>
>>> I have a table which is partitioned on two columns (customer, date). I'm
>>> loading some data into the table using a Hive query. The MapReduce job
>>> completed within a few minutes and needs to "commit" the data to the
>>> appropriate partitions. There were about 32000 partitions generated. The
>>> commit phase has been running for almost 16 hours and has not finished yet.
>>> I've been monitoring jmap, and don't believe it's a memory or gc issue.
>>> I've also been looking at jstack and not sure why it's so slow. I'm not
>>> sure what the problem is, but seems to be a Hive performance issue when it
>>> comes to "highly partitioned" tables.
>>>
>>> Any thoughts on this issue would be greatly appreciated.
>>>
>>> Thanks in advance,
>>> Pradeep
>>>
>>
>>
>>
>> --
>>
>> Slava Markeyev | Engineering | Upsight
>>
>> Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
>> <http://www.linkedin.com/in/slavamarkeyev>
>>
>
>

Re: Very slow dynamic partition load

Posted by Pradeep Gollakota <pr...@gmail.com>.
Hmm... did your performance increase with the patch you supplied? I do need
the partitions in Hive, but I have a separate tool that has the ability to
add partitions to the metastore and is definitely much faster than this. I
just checked my job again, the actual Hive job completed 24 hours ago and
has been adding the dynamic partitions to the metastore since then and is
still not done. According to the metastore theres only 10830 partitions
added so far... at this pace, it will take approximately 2 more days for it
complete.

On Thu, Jun 11, 2015 at 1:18 PM, Slava Markeyev <sl...@upsight.com>
wrote:

> This is something that a few of us have run into. I think the bottleneck
> is in partition creation calls to the metastore. My work around was
> HIVE-10385 which optionally removed partition creation in the metastore but
> this isn't a solution for everyone. If you don't require actual partitions
> in the table but simply partitioned data in hdfs give it a shot. It may be
> worthwhile looking into optimizations for this use case.
>
> -Slava
>
> On Thu, Jun 11, 2015 at 11:56 AM, Pradeep Gollakota <pr...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have a table which is partitioned on two columns (customer, date). I'm
>> loading some data into the table using a Hive query. The MapReduce job
>> completed within a few minutes and needs to "commit" the data to the
>> appropriate partitions. There were about 32000 partitions generated. The
>> commit phase has been running for almost 16 hours and has not finished yet.
>> I've been monitoring jmap, and don't believe it's a memory or gc issue.
>> I've also been looking at jstack and not sure why it's so slow. I'm not
>> sure what the problem is, but seems to be a Hive performance issue when it
>> comes to "highly partitioned" tables.
>>
>> Any thoughts on this issue would be greatly appreciated.
>>
>> Thanks in advance,
>> Pradeep
>>
>
>
>
> --
>
> Slava Markeyev | Engineering | Upsight
>
> Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
> <http://www.linkedin.com/in/slavamarkeyev>
>

Re: Very slow dynamic partition load

Posted by Slava Markeyev <sl...@upsight.com>.
This is something that a few of us have run into. I think the bottleneck is
in partition creation calls to the metastore. My work around was HIVE-10385
which optionally removed partition creation in the metastore but this isn't
a solution for everyone. If you don't require actual partitions in the
table but simply partitioned data in hdfs give it a shot. It may be
worthwhile looking into optimizations for this use case.

-Slava

On Thu, Jun 11, 2015 at 11:56 AM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> Hi All,
>
> I have a table which is partitioned on two columns (customer, date). I'm
> loading some data into the table using a Hive query. The MapReduce job
> completed within a few minutes and needs to "commit" the data to the
> appropriate partitions. There were about 32000 partitions generated. The
> commit phase has been running for almost 16 hours and has not finished yet.
> I've been monitoring jmap, and don't believe it's a memory or gc issue.
> I've also been looking at jstack and not sure why it's so slow. I'm not
> sure what the problem is, but seems to be a Hive performance issue when it
> comes to "highly partitioned" tables.
>
> Any thoughts on this issue would be greatly appreciated.
>
> Thanks in advance,
> Pradeep
>



-- 

Slava Markeyev | Engineering | Upsight

Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
<http://www.linkedin.com/in/slavamarkeyev>