You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Tianqi Tong <tt...@brightedge.com> on 2015/04/15 23:55:44 UTC

Extremely Slow Data Loading with 40k+ Partitions

Hi,
I'm loading data to a Parquet table with dynamic partitons. I have 40k+ partitions, and I have skipped the partition stats computation step.
Somehow it's still exetremely slow loading data into partitions (800MB/h).
Do you have any hints on the possible reason and solution?

Thank you
Tianqi Tong


Re: Extremely Slow Data Loading with 40k+ Partitions

Posted by Daniel Haviv <da...@veracity-group.com>.
Is this a test environment?
If so, can you try and disable concurrency?


Daniel

> On 16 באפר׳ 2015, at 19:44, Tianqi Tong <tt...@brightedge.com> wrote:
> 
> Hi Daniel,
> Actually the mapreduce job was just fine, but the process stuck on the data loading after that.
> The output stopped at:
> Loading data to table default.parquet_table_with_40k_partitions partition (yearmonth=null, prefix=null)
>  
> When I look at the size of hdfs files of table, I can see the size is growing, but it's kind of slow.
> For mapreduce job, I had 400+ mappers and 100+ reducers.
>  
> Thanks
> Tianqi
>  
> From: Daniel Haviv [mailto:daniel.haviv@veracity-group.com] 
> Sent: Wednesday, April 15, 2015 9:23 PM
> To: user@hive.apache.org
> Subject: Re: Extremely Slow Data Loading with 40k+ Partitions
>  
> How many reducers are you using?
> 
> Daniel
> 
> On 16 באפר׳ 2015, at 00:55, Tianqi Tong <tt...@brightedge.com> wrote:
> 
> Hi,
> I'm loading data to a Parquet table with dynamic partitons. I have 40k+ partitions, and I have skipped the partition stats computation step.
> Somehow it's still exetremely slow loading data into partitions (800MB/h).
> Do you have any hints on the possible reason and solution?
>  
> Thank you
> Tianqi Tong
>  

RE: Extremely Slow Data Loading with 40k+ Partitions

Posted by Tianqi Tong <tt...@brightedge.com>.
Hi Daniel,
Actually the mapreduce job was just fine, but the process stuck on the data loading after that.
The output stopped at:
Loading data to table default.parquet_table_with_40k_partitions partition (yearmonth=null, prefix=null)

When I look at the size of hdfs files of table, I can see the size is growing, but it's kind of slow.
For mapreduce job, I had 400+ mappers and 100+ reducers.

Thanks
Tianqi

From: Daniel Haviv [mailto:daniel.haviv@veracity-group.com]
Sent: Wednesday, April 15, 2015 9:23 PM
To: user@hive.apache.org
Subject: Re: Extremely Slow Data Loading with 40k+ Partitions

How many reducers are you using?
Daniel

On 16 באפר׳ 2015, at 00:55, Tianqi Tong <tt...@brightedge.com>> wrote:
Hi,
I'm loading data to a Parquet table with dynamic partitons. I have 40k+ partitions, and I have skipped the partition stats computation step.
Somehow it's still exetremely slow loading data into partitions (800MB/h).
Do you have any hints on the possible reason and solution?

Thank you
Tianqi Tong


Re: Extremely Slow Data Loading with 40k+ Partitions

Posted by Daniel Haviv <da...@veracity-group.com>.
How many reducers are you using?

Daniel

> On 16 באפר׳ 2015, at 00:55, Tianqi Tong <tt...@brightedge.com> wrote:
> 
> Hi,
> I'm loading data to a Parquet table with dynamic partitons. I have 40k+ partitions, and I have skipped the partition stats computation step.
> Somehow it's still exetremely slow loading data into partitions (800MB/h).
> Do you have any hints on the possible reason and solution?
>  
> Thank you
> Tianqi Tong
>