You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by abhiTowson cal <ab...@gmail.com> on 2012/07/24 03:54:36 UTC

Hive query optimization

Hi all,

Some queries in hive are executing for too long.So i have overriden
some parameters in hive, for some querys performance increased rapidly
when i overriden this properities  for some querys no change in
performance.Can any one you
tell me any other optimizations in hive apart from partitions and
buckets,

set io.sort.mb=512;
set io.sort.factor=100;
set mapred.reduce.parallel.copies=40;
set hive.map.aggr =true;
set hive.exec.parallel=true;
set hive.groupby.skewindata=true;
set mapred.job.reuse.jvm.num.tasks=-1;

default values were

io.sort.mb=256;
io.sort.factor=10;
mapred.reduce.parallel.copies=10;

Thanks
Abhishek

Re: Hive query optimization

Posted by Abhishek <ab...@gmail.com>.

Hi Tatarinov,

Thanks for the reply, by my understanding did you mean to set number to reduce tasks equal to number of reduce slots in the cluster?

Regards
Abhi


Sent from my iPhone

On Jul 24, 2012, at 12:51 AM, Igor Tatarinov <ig...@decide.com> wrote:

> Here is my 2 cents.
> The parameters you are looking at are quite specific. Unless you know what you are doing it might be hard to set them exactly right and they shouldn't make that much of a difference - again unless you know the specifics.
> 
> What worked for me is using a single "wave" of reducers. Basically, you want to set the number of reduce tasks to be equal to the number of reduce slots (assuming your job will run by itself).
> 
> It might also help to re-arrange your joins so that the larger table is streamed (https://cwiki.apache.org/Hive/languagemanual-joins.html).
> That seems especially important with map joins since those fail if there is not enough memory and have to be rerun as regular joins.
> 
> Hope this helps.
> 
> On Mon, Jul 23, 2012 at 6:54 PM, abhiTowson cal <ab...@gmail.com> wrote:
> Hi all,
> 
> Some queries in hive are executing for too long.So i have overriden
> some parameters in hive, for some querys performance increased rapidly
> when i overriden this properities  for some querys no change in
> performance.Can any one you
> tell me any other optimizations in hive apart from partitions and
> buckets,
> 
> set io.sort.mb=512;
> set io.sort.factor=100;
> set mapred.reduce.parallel.copies=40;
> set hive.map.aggr =true;
> set hive.exec.parallel=true;
> set hive.groupby.skewindata=true;
> set mapred.job.reuse.jvm.num.tasks=-1;
> 
> default values were
> 
> io.sort.mb=256;
> io.sort.factor=10;
> mapred.reduce.parallel.copies=10;
> 
> Thanks
> Abhishek
>

Re: Hive query optimization

Posted by Igor Tatarinov <ig...@decide.com>.

Here is my 2 cents.
The parameters you are looking at are quite specific. Unless you know what
you are doing it might be hard to set them exactly right and they shouldn't
make that much of a difference - again unless you know the specifics.

What worked for me is using a single "wave" of reducers. Basically, you
want to set the number of reduce tasks to be equal to the number of reduce
slots (assuming your job will run by itself).

It might also help to re-arrange your joins so that the larger table is
streamed (https://cwiki.apache.org/Hive/languagemanual-joins.html).
That seems especially important with map joins since those fail if there is
not enough memory and have to be rerun as regular joins.

Hope this helps.

On Mon, Jul 23, 2012 at 6:54 PM, abhiTowson cal
<ab...@gmail.com>wrote:

> Hi all,
>
> Some queries in hive are executing for too long.So i have overriden
> some parameters in hive, for some querys performance increased rapidly
> when i overriden this properities  for some querys no change in
> performance.Can any one you
> tell me any other optimizations in hive apart from partitions and
> buckets,
>
> set io.sort.mb=512;
> set io.sort.factor=100;
> set mapred.reduce.parallel.copies=40;
> set hive.map.aggr =true;
> set hive.exec.parallel=true;
> set hive.groupby.skewindata=true;
> set mapred.job.reuse.jvm.num.tasks=-1;
>
> default values were
>
> io.sort.mb=256;
> io.sort.factor=10;
> mapred.reduce.parallel.copies=10;
>
> Thanks
> Abhishek
>