You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by David Ginzburg <da...@gmail.com> on 2015/05/24 22:34:41 UTC

one reducer only when inserting into an orc dynimicaly partitioned table

Hi,

I am running on 10 node cluster hdp 2.2.
Using tez and yarn.
hive version is 0.14

I have a 90 milion row table stroed in a plain text csv 10GB text file.

When trying to insert into an orc partitioned table using the statement:

"insert overwrite table 2h2 partition (dt) select *,TIME_STAMP  from
2h_tmp;"

dt is the dynamic partition key.

Tez alloactes only one reducer to the job which results in a 6 hour run.

I expect about 120 partions to be created .

How can I increase number of reducers to speed up this job?

Is this related to https://issues.apache.org/jira/browse/HIVE-7158 , it is
marked as resolved for hive 0.14

I am running with default values

hive.tez.auto.reducer.parallelism

    Default Value: false
    Added In: Hive 0.14.0 with HIVE-7158

hive.tez.max.partition.factor

    Default Value: 2
    Added In: Hive 0.14.0 with HIVE-7158

hive.tez.min.partition.factor

    Default Value: 0.25
    Added In: Hive 0.14.0 with HIVE-7158

and  hive.exec.dynamic.partition=true;
 hive.exec.dynamic.partition.mode=nonstrict;

Re: one reducer only when inserting into an orc dynimicaly partitioned table

Posted by Gopal Vijayaraghavan <go...@apache.org>.

Hi,

This is really a hive question & hopefully you can follow up on this on
the hive user@ mailing lists.

But since you¹re looking at Hive-on-Tez, this issue seems familiar to me.

> "insert overwrite table 2h2 partition (dt) select *,TIME_STAMP  from
>2h_tmp;"
> 
> Tez alloactes only one reducer to the job which results in a 6 hour run.

That doesn¹t look like it needs a reducer in normal cases.

Is the destination table bucketed into 1 bucket?

> Is this related to https://issues.apache.org/jira/browse/HIVE-7158 , it
>is marked as resolved for hive 0.14

No, it is not.

This might be related to a featured turned off by default in HDP-2.2.

If you have >1 partition in the dynamic partitioned insert, the feature
you need is in HIVE-6455 + HIVE-6761.


set hive.optimize.sort.dynamic.partition=true;


This is off by default, since it slows down ETL where the destination is
exactly 1 partition.

I keep updating the hive-testbench to do the right thing (because it does
both TPC-DS and TPC-H), so those settings might be of help

https://github.com/hortonworks/hive-testbench/blob/hive14/settings/load-par
titioned.sql#L10


Cheers,
Gopal