You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Saurabh Mishra <sa...@outlook.com> on 2012/10/15 14:09:54 UTC

Hive Query Unable to distribute load evenly in reducers

Hi,
I am firing some hive queries joining tables containing upto 30millions records each. Since the load on the reducers is very significant in these cases, i specifically set the following parameters before executing the queries : 

set mapred.reduce.tasks=100;
set hive.exec.reducers.bytes.per.reducer=500000000;
set hive.optimize.cp=true;

The number of reducer the job spouts in now 160, but despite the high number most of the load remains upon 1 or 2 reducers. Hence in the final statistics, 158 reducers go completed with 2-3 minutes of start and 2 reducers took 2 hrs to run.
Is there any way to overcome this load distribution disparity.
Any help in this regards will be highly appreciated.

Sincerely
Saurabh Mishra
 		 	   		  

Re: Hive Query Unable to distribute load evenly in reducers

Posted by Philip Tromans <ph...@gmail.com>.
I'm really not convinced that there's no skew in your data. Look at
the counters from the Hadoop TaskTracker pages, and thoroughly check
that the numbers of reducer input records / groups and output records
are all similar.

Phil.

On 18 October 2012 09:56, Saurabh Mishra <sa...@outlook.com> wrote:
> any views on the problem
>
> ________________________________
> From: saurabhmishra.iitg@outlook.com
> To: user@hive.apache.org; navis.ryu@nexr.com
> Subject: RE: Hive Query Unable to distribute load evenly in reducers
> Date: Tue, 16 Oct 2012 11:23:29 +0530
>
>
> by using mapjoin if you are implying setting
> set hive.auto.convert.join=true;
> then this configuration i am already using, but to no avail...:(
>
> ________________________________
> Date: Tue, 16 Oct 2012 14:17:47 +0900
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> From: navis.ryu@nexr.com
> To: user@hive.apache.org
>
> How about using MapJoin?
>
> 2012/10/16 Saurabh Mishra <sa...@outlook.com>
>
> no there is apparently no heavy skewing. also another stats i wanted to
> point was, following is approximate table contents in this 4 table join
> query :
> tableA : 170 million (actual number, + i am also exploding these records, so
> the number could be much much higher)
> tableB:15
> tableC:45
> tableD:45
> tableE : 45
> tableF  : 14000
>
> Also i cannot put any filter condition on tableA ,situation does not permit
> so. :(
> Kindly suggest, some alternative solution or some hive configuration to
> better load distribute in the reducers
>
>> Date: Mon, 15 Oct 2012 16:29:56 +0100
>
>> Subject: Re: Hive Query Unable to distribute load evenly in reducers
>> From: philip.j.tromans@gmail.com
>> To: user@hive.apache.org
>
>>
>> Is your data heavily skewed towards certain values of a.x etc?
>>
>> On 15 October 2012 15:23, Saurabh Mishra <sa...@outlook.com>
>> wrote:
>> > The queries are simple joins, something on the lines of
>> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join....
>> > group
>> > by a, b,c;
>> >
>> >
>> >> From: liy099@gmail.com
>> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
>> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
>> >> To: user@hive.apache.org
>> >
>> >>
>> >> And your queries were?
>> >>
>> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
>> >> <sa...@outlook.com> wrote:
>> >> > Hi,
>> >> > I am firing some hive queries joining tables containing upto
>> >> > 30millions
>> >> > records each. Since the load on the reducers is very significant in
>> >> > these
>> >> > cases, i specifically set the following parameters before executing
>> >> > the
>> >> > queries :
>> >> >
>> >> > set mapred.reduce.tasks=100;
>> >> > set hive.exec.reducers.bytes.per.reducer=500000000;
>> >> > set hive.optimize.cp=true;
>> >> >
>> >> > The number of reducer the job spouts in now 160, but despite the high
>> >> > number
>> >> > most of the load remains upon 1 or 2 reducers. Hence in the final
>> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
>> >> > reducers took 2 hrs to run.
>> >> > Is there any way to overcome this load distribution disparity.
>> >> > Any help in this regards will be highly appreciated.
>> >> >
>> >> > Sincerely
>> >> > Saurabh Mishra
>
>

RE: Hive Query Unable to distribute load evenly in reducers

Posted by Saurabh Mishra <sa...@outlook.com>.
any views on the problem

From: saurabhmishra.iitg@outlook.com
To: user@hive.apache.org; navis.ryu@nexr.com
Subject: RE: Hive Query Unable to distribute load evenly in reducers
Date: Tue, 16 Oct 2012 11:23:29 +0530




by using mapjoin if you are implying setting 
set hive.auto.convert.join=true;
then this configuration i am already using, but to no avail...:(

Date: Tue, 16 Oct 2012 14:17:47 +0900
Subject: Re: Hive Query Unable to distribute load evenly in reducers
From: navis.ryu@nexr.com
To: user@hive.apache.org

How about using MapJoin?

2012/10/16 Saurabh Mishra <sa...@outlook.com>




no there is apparently no heavy skewing. also another stats i wanted to point was, following is approximate table contents in this 4 table join query : 
tableA : 170 million (actual number, + i am also exploding these records, so the number could be much much higher)

tableB:15
tableC:45
tableD:45
tableE : 45
tableF  : 14000

Also i cannot put any filter condition on tableA ,situation does not permit so. :( 
Kindly suggest, some alternative solution or some hive configuration to better load distribute in the reducers


> Date: Mon, 15 Oct 2012 16:29:56 +0100
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> From: philip.j.tromans@gmail.com

> To: user@hive.apache.org
> 
> Is your data heavily skewed towards certain values of a.x etc?
> 
> On 15 October 2012 15:23, Saurabh Mishra <sa...@outlook.com> wrote:

> > The queries are simple joins, something on the lines of
> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group
> > by a, b,c;
> >
> >
> >> From: liy099@gmail.com

> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> >> To: user@hive.apache.org

> >
> >>
> >> And your queries were?
> >>
> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> >> <sa...@outlook.com> wrote:

> >> > Hi,
> >> > I am firing some hive queries joining tables containing upto 30millions
> >> > records each. Since the load on the reducers is very significant in
> >> > these

> >> > cases, i specifically set the following parameters before executing the
> >> > queries :
> >> >
> >> > set mapred.reduce.tasks=100;
> >> > set hive.exec.reducers.bytes.per.reducer=500000000;

> >> > set hive.optimize.cp=true;
> >> >
> >> > The number of reducer the job spouts in now 160, but despite the high
> >> > number
> >> > most of the load remains upon 1 or 2 reducers. Hence in the final

> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> >> > reducers took 2 hrs to run.
> >> > Is there any way to overcome this load distribution disparity.

> >> > Any help in this regards will be highly appreciated.
> >> >
> >> > Sincerely
> >> > Saurabh Mishra
 		 	   		  

 		 	   		   		 	   		  

RE: Hive Query Unable to distribute load evenly in reducers

Posted by Saurabh Mishra <sa...@outlook.com>.
by using mapjoin if you are implying setting 
set hive.auto.convert.join=true;
then this configuration i am already using, but to no avail...:(

Date: Tue, 16 Oct 2012 14:17:47 +0900
Subject: Re: Hive Query Unable to distribute load evenly in reducers
From: navis.ryu@nexr.com
To: user@hive.apache.org

How about using MapJoin?

2012/10/16 Saurabh Mishra <sa...@outlook.com>




no there is apparently no heavy skewing. also another stats i wanted to point was, following is approximate table contents in this 4 table join query : 
tableA : 170 million (actual number, + i am also exploding these records, so the number could be much much higher)

tableB:15
tableC:45
tableD:45
tableE : 45
tableF  : 14000

Also i cannot put any filter condition on tableA ,situation does not permit so. :( 
Kindly suggest, some alternative solution or some hive configuration to better load distribute in the reducers


> Date: Mon, 15 Oct 2012 16:29:56 +0100
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> From: philip.j.tromans@gmail.com

> To: user@hive.apache.org
> 
> Is your data heavily skewed towards certain values of a.x etc?
> 
> On 15 October 2012 15:23, Saurabh Mishra <sa...@outlook.com> wrote:

> > The queries are simple joins, something on the lines of
> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group
> > by a, b,c;
> >
> >
> >> From: liy099@gmail.com

> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> >> To: user@hive.apache.org

> >
> >>
> >> And your queries were?
> >>
> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> >> <sa...@outlook.com> wrote:

> >> > Hi,
> >> > I am firing some hive queries joining tables containing upto 30millions
> >> > records each. Since the load on the reducers is very significant in
> >> > these

> >> > cases, i specifically set the following parameters before executing the
> >> > queries :
> >> >
> >> > set mapred.reduce.tasks=100;
> >> > set hive.exec.reducers.bytes.per.reducer=500000000;

> >> > set hive.optimize.cp=true;
> >> >
> >> > The number of reducer the job spouts in now 160, but despite the high
> >> > number
> >> > most of the load remains upon 1 or 2 reducers. Hence in the final

> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> >> > reducers took 2 hrs to run.
> >> > Is there any way to overcome this load distribution disparity.

> >> > Any help in this regards will be highly appreciated.
> >> >
> >> > Sincerely
> >> > Saurabh Mishra
 		 	   		  

 		 	   		  

Re: Hive Query Unable to distribute load evenly in reducers

Posted by Navis류승우 <na...@nexr.com>.
How about using MapJoin?

2012/10/16 Saurabh Mishra <sa...@outlook.com>

> no there is apparently no heavy skewing. also another stats i wanted to
> point was, following is approximate table contents in this 4 table join
> query :
> tableA : 170 million (actual number, + i am also exploding these records,
> so the number could be much much higher)
> tableB:15
> tableC:45
> tableD:45
> tableE : 45
> tableF  : 14000
>
> Also i cannot put any filter condition on tableA ,situation does not
> permit so. :(
> Kindly suggest, some alternative solution or some hive configuration to
> better load distribute in the reducers
>
> > Date: Mon, 15 Oct 2012 16:29:56 +0100
>
> > Subject: Re: Hive Query Unable to distribute load evenly in reducers
> > From: philip.j.tromans@gmail.com
> > To: user@hive.apache.org
>
> >
> > Is your data heavily skewed towards certain values of a.x etc?
> >
> > On 15 October 2012 15:23, Saurabh Mishra <sa...@outlook.com>
> wrote:
> > > The queries are simple joins, something on the lines of
> > > select a, b, c, count(D) from tableA join tableB on a.x=b.y join....
> group
> > > by a, b,c;
> > >
> > >
> > >> From: liy099@gmail.com
> > >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> > >> To: user@hive.apache.org
> > >
> > >>
> > >> And your queries were?
> > >>
> > >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> > >> <sa...@outlook.com> wrote:
> > >> > Hi,
> > >> > I am firing some hive queries joining tables containing upto
> 30millions
> > >> > records each. Since the load on the reducers is very significant in
> > >> > these
> > >> > cases, i specifically set the following parameters before executing
> the
> > >> > queries :
> > >> >
> > >> > set mapred.reduce.tasks=100;
> > >> > set hive.exec.reducers.bytes.per.reducer=500000000;
> > >> > set hive.optimize.cp=true;
> > >> >
> > >> > The number of reducer the job spouts in now 160, but despite the
> high
> > >> > number
> > >> > most of the load remains upon 1 or 2 reducers. Hence in the final
> > >> > statistics, 158 reducers go completed with 2-3 minutes of start and
> 2
> > >> > reducers took 2 hrs to run.
> > >> > Is there any way to overcome this load distribution disparity.
> > >> > Any help in this regards will be highly appreciated.
> > >> >
> > >> > Sincerely
> > >> > Saurabh Mishra
>

RE: Hive Query Unable to distribute load evenly in reducers

Posted by Saurabh Mishra <sa...@outlook.com>.
no there is apparently no heavy skewing. also another stats i wanted to point was, following is approximate table contents in this 4 table join query : 
tableA : 170 million (actual number, + i am also exploding these records, so the number could be much much higher)
tableB:15
tableC:45
tableD:45
tableE : 45
tableF  : 14000

Also i cannot put any filter condition on tableA ,situation does not permit so. :( 
Kindly suggest, some alternative solution or some hive configuration to better load distribute in the reducers

> Date: Mon, 15 Oct 2012 16:29:56 +0100
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> From: philip.j.tromans@gmail.com
> To: user@hive.apache.org
> 
> Is your data heavily skewed towards certain values of a.x etc?
> 
> On 15 October 2012 15:23, Saurabh Mishra <sa...@outlook.com> wrote:
> > The queries are simple joins, something on the lines of
> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group
> > by a, b,c;
> >
> >
> >> From: liy099@gmail.com
> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> >> To: user@hive.apache.org
> >
> >>
> >> And your queries were?
> >>
> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> >> <sa...@outlook.com> wrote:
> >> > Hi,
> >> > I am firing some hive queries joining tables containing upto 30millions
> >> > records each. Since the load on the reducers is very significant in
> >> > these
> >> > cases, i specifically set the following parameters before executing the
> >> > queries :
> >> >
> >> > set mapred.reduce.tasks=100;
> >> > set hive.exec.reducers.bytes.per.reducer=500000000;
> >> > set hive.optimize.cp=true;
> >> >
> >> > The number of reducer the job spouts in now 160, but despite the high
> >> > number
> >> > most of the load remains upon 1 or 2 reducers. Hence in the final
> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> >> > reducers took 2 hrs to run.
> >> > Is there any way to overcome this load distribution disparity.
> >> > Any help in this regards will be highly appreciated.
> >> >
> >> > Sincerely
> >> > Saurabh Mishra
 		 	   		  

Re: Hive Query Unable to distribute load evenly in reducers

Posted by Philip Tromans <ph...@gmail.com>.
Is your data heavily skewed towards certain values of a.x etc?

On 15 October 2012 15:23, Saurabh Mishra <sa...@outlook.com> wrote:
> The queries are simple joins, something on the lines of
> select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group
> by a, b,c;
>
>
>> From: liy099@gmail.com
>> Date: Mon, 15 Oct 2012 21:10:39 +0800
>> Subject: Re: Hive Query Unable to distribute load evenly in reducers
>> To: user@hive.apache.org
>
>>
>> And your queries were?
>>
>> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
>> <sa...@outlook.com> wrote:
>> > Hi,
>> > I am firing some hive queries joining tables containing upto 30millions
>> > records each. Since the load on the reducers is very significant in
>> > these
>> > cases, i specifically set the following parameters before executing the
>> > queries :
>> >
>> > set mapred.reduce.tasks=100;
>> > set hive.exec.reducers.bytes.per.reducer=500000000;
>> > set hive.optimize.cp=true;
>> >
>> > The number of reducer the job spouts in now 160, but despite the high
>> > number
>> > most of the load remains upon 1 or 2 reducers. Hence in the final
>> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
>> > reducers took 2 hrs to run.
>> > Is there any way to overcome this load distribution disparity.
>> > Any help in this regards will be highly appreciated.
>> >
>> > Sincerely
>> > Saurabh Mishra

RE: Hive Query Unable to distribute load evenly in reducers

Posted by Saurabh Mishra <sa...@outlook.com>.
The queries are simple joins, something on the lines of 
select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group by a, b,c;


> From: liy099@gmail.com
> Date: Mon, 15 Oct 2012 21:10:39 +0800
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> To: user@hive.apache.org
> 
> And your queries were?
> 
> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> <sa...@outlook.com> wrote:
> > Hi,
> > I am firing some hive queries joining tables containing upto 30millions
> > records each. Since the load on the reducers is very significant in these
> > cases, i specifically set the following parameters before executing the
> > queries :
> >
> > set mapred.reduce.tasks=100;
> > set hive.exec.reducers.bytes.per.reducer=500000000;
> > set hive.optimize.cp=true;
> >
> > The number of reducer the job spouts in now 160, but despite the high number
> > most of the load remains upon 1 or 2 reducers. Hence in the final
> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> > reducers took 2 hrs to run.
> > Is there any way to overcome this load distribution disparity.
> > Any help in this regards will be highly appreciated.
> >
> > Sincerely
> > Saurabh Mishra
 		 	   		  

Re: Hive Query Unable to distribute load evenly in reducers

Posted by MiaoMiao <li...@gmail.com>.
And your queries were?

On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
<sa...@outlook.com> wrote:
> Hi,
> I am firing some hive queries joining tables containing upto 30millions
> records each. Since the load on the reducers is very significant in these
> cases, i specifically set the following parameters before executing the
> queries :
>
> set mapred.reduce.tasks=100;
> set hive.exec.reducers.bytes.per.reducer=500000000;
> set hive.optimize.cp=true;
>
> The number of reducer the job spouts in now 160, but despite the high number
> most of the load remains upon 1 or 2 reducers. Hence in the final
> statistics, 158 reducers go completed with 2-3 minutes of start and 2
> reducers took 2 hrs to run.
> Is there any way to overcome this load distribution disparity.
> Any help in this regards will be highly appreciated.
>
> Sincerely
> Saurabh Mishra