You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Abhishek <ab...@gmail.com> on 2012/09/29 05:56:16 UTC

Cartesian Product in HIVE

Hi all,

I have use case where we are doing Cartesian product of two tables with
One table with 
990k rows
Second table 
20k rows

Query is Cartesian product of just two columns. 

So it comes around 20 billion rows

For one hour it is processing like around 5 billion rows.

So the process takes around 4 hrs.

I have over riden some of the properties in hive

>> Set io.sort.mb=512
    Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false

How can optimize it to get better results.

Even though I have set reduce tasks to 17, only one reduce is spawned for the query . Did I do some thing wrong ??

My cluster configuration is having
20 slave nodes running cdh3u5.
With 240 map slots
        120 reduce slots 
 Block size is 128 mb
 Memory on the slave node is 96GB

How can the query perform better??

How can I increase number of rows processed by reducer at a single moment, or per second

Can help is greatly appreciated.

Regards
Abhi

Re: Cartesian Product in HIVE

Posted by Abhishek <ab...@gmail.com>.

Thanks for the reply Bejoy.

I did not any order by in the query.

Here are the properities I have used and query, table sizes

-----  set mapred.reduce.tasks=17;
set mapred.child.java.opts=xmx2073741824;
set io.sort.mb=512;
set io.sort.factor=250;
set mapred.reduce.parallel.copies=true;
set mapred.job.reuse.jvm.num.tasks=1;
set hive.mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.map.tasks.speculative.execution=false;

CREATE TABLE t1 AS
SELECT /*+ STREAMTABLE(t2) */
t2.col1,
t3.col1
FROM table2 t2
JOIN table3 t3

table2 : 997406 rows
total bytes: 20848934 -- 19.88 mb

table3 : 20773 rows
total bytes: 353127 -- 0.33 mb

#of Mappers: 4
#of reducers: 1

Regards
Abhi

On Sep 30, 2012, at 9:35 AM, Bejoy KS <be...@outlook.com> wrote:

> Hi Abshiek
> 
> 
> No need of any similar columns for map join to work. It is just taking the join process to mapper rather then  doing the same in a reducer.
> 
> The actual bottle neck is the single reducer. Need to figure out why only one reducer is fired rather than the set value of 17. Are you using ORDER BY in your query? If so, it sets the number of reducers to 1.
> 
> Can you provide the full console stack here so that we'll be able to understand your issue and help you better? (starting from the properties you set, your query and the error ). Also can you get the exact data sizes for two tables.
> 
> Regards
> Bejoy KS
> 
> > From: abhishek.dodda1@gmail.com
> > Date: Sat, 29 Sep 2012 07:44:06 -0700
> > Subject: Re: Cartesian Product in HIVE
> > To: user@hive.apache.org; bejoy_ks@yahoo.com
> > 
> > Thanks for the reply Bejoy.
> > 
> > I tried to map join, by setting the property mentioned by you and Even
> > increased the small table file size
> > 20k table size would be not more than 200 mb but it doesnot work.
> > 
> > Cartesian product of tables, they dont have any similar columns does
> > map join work here??
> > 
> > By applying below setting with STREAM TABLE HINT it was processing
> > around 5 Billion rows per hour,so process took around 4 hrs.
> > 
> > Set io.sort.mb=512
> > Set mapred.reduce.tasks=17
> > Set io.sort.factor=256
> > Set mapred.child.jvm.opts=-Xmx2048mb
> > Set hive.map.aggr=true
> > Set hive.exec.parallel=true
> > Set mapred.tasks.reuse.num.tasks=-1
> > Set hive.mapred.map.speculative.execution=false
> > Set hive.mapred.reduce.speculative.execution=false
> > 
> > By using this map join hint set hive.auto.convert.join = true; and
> > increasing the small table file size the job initiated but it was
> > 
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > 
> > Till 30 min it was like this, so i killed the task.
> > 
> > My doubts are:
> > 
> > -- I have increased the reducer number mapred.reduce.tasks to 17, but
> > the hive query engine fired only one reducer for the job.
> > -- I have slave node memory around 96 GB can i over ride some
> > parameters, other than the above mentioned and make efficient use of
> > it.
> > -- How can I increase number of rows processed by reducer at a single
> > moment or per second
> > -- Any other techniques to optimize the query
> > 
> > Thanks for response and your time Bejoy.
> > 
> > Regards
> > abhi
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <be...@yahoo.com> wrote:
> > > Hi Abshiek
> > >
> > > What is the data size of the 20k rows? If it is lesser then you can go in
> > > for map join, which will give you a performance boost.
> > >
> > > set hive.auto.convert.join = true;
> > > Regards
> > > Bejoy KS
> > >
> > > Sent from handheld, please excuse typos.
> > > ________________________________
> > > From: Abhishek <ab...@gmail.com>
> > > Date: Fri, 28 Sep 2012 23:56:16 -0400
> > > To: Hive<us...@hive.apache.org>
> > > ReplyTo: user@hive.apache.org
> > > Cc: Bejoy Ks<be...@yahoo.com>
> > > Subject: Cartesian Product in HIVE
> > >
> > > Hi all,
> > >
> > > I have use case where we are doing Cartesian product of two tables with
> > > One table with
> > > 990k rows
> > > Second table
> > > 20k rows
> > >
> > > Query is Cartesian product of just two columns.
> > >
> > > So it comes around 20 billion rows
> > >
> > > For one hour it is processing like around 5 billion rows.
> > >
> > > So the process takes around 4 hrs.
> > >
> > > I have over riden some of the properties in hive
> > >
> > > Set io.sort.mb=512
> > >
> > > Set mapred.reduce.tasks=17
> > >
> > > Set io.sort.factor=256
> > > Set mapred.child.jvm.opts=-Xmx2048mb
> > > Set hive.map.aggr=true
> > > Set hive.exec.parallel=true
> > > Set mapred.tasks.reuse.num.tasks=-1
> > > Set hive.mapred.map.speculative.execution=false
> > > Set hive.mapred.reduce.speculative.execution=false
> > >
> > >
> > > How can optimize it to get better results.
> > >
> > > Even though I have set reduce tasks to 17, only one reduce is spawned for
> > > the query . Did I do some thing wrong ??
> > >
> > > My cluster configuration is having
> > > 20 slave nodes running cdh3u5.
> > > With 240 map slots
> > > 120 reduce slots
> > > Block size is 128 mb
> > > Memory on the slave node is 96GB
> > >
> > > How can the query perform better??
> > >
> > > How can I increase number of rows processed by reducer at a single moment,
> > > or per second
> > >
> > > Can help is greatly appreciated.
> > >
> > > Regards
> > > Abhi

Re: Cartesian Product in HIVE

Posted by abhishek dodda <ab...@gmail.com>.

Thanks for the reply Bejoy.

On Sun, Sep 30, 2012 at 6:35 AM, Bejoy KS <be...@outlook.com> wrote:
> Hi Abshiek
>
>
> No need of any similar columns for map join to work. It is just taking the
> join process to mapper rather then  doing the same in a reducer.
>
> The actual bottle neck is the single reducer. Need to figure out why only
> one reducer is fired rather than the set value of 17. Are you using ORDER BY
> in your query? If so, it sets the number of reducers to 1.

-----  I didnot use order by in the query.

>
> Can you provide the full console stack here so that we'll be able to
> understand your issue and help you better? (starting from the properties you
> set, your query and the error ). Also can you get the exact data sizes for
> two tables.

-----  set mapred.reduce.tasks=17;
set mapred.child.java.opts=xmx2073741824;
set io.sort.mb=512;
set io.sort.factor=250;
set mapred.reduce.parallel.copies=true;
set mapred.job.reuse.jvm.num.tasks=1;
set hive.mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.map.tasks.speculative.execution=false;

CREATE TABLE t1 AS
SELECT /*+ STREAMTABLE(t2) */
t2.col1,
t3.col1
FROM table2 t2
JOIN table3 t3

table2 : 997406 rows
total bytes: 20848934 -- 19.88 mb

table3 : 20773 rows
total bytes: 353127 -- 0.33 mb

#of Mappers: 4
#of reducers: 1



>
> Regards
> Bejoy KS
>
>> From: abhishek.dodda1@gmail.com
>> Date: Sat, 29 Sep 2012 07:44:06 -0700
>> Subject: Re: Cartesian Product in HIVE
>> To: user@hive.apache.org; bejoy_ks@yahoo.com
>
>>
>> Thanks for the reply Bejoy.
>>
>> I tried to map join, by setting the property mentioned by you and Even
>> increased the small table file size
>> 20k table size would be not more than 200 mb but it doesnot work.
>>
>> Cartesian product of tables, they dont have any similar columns does
>> map join work here??
>>
>> By applying below setting with STREAM TABLE HINT it was processing
>> around 5 Billion rows per hour,so process took around 4 hrs.
>>
>> Set io.sort.mb=512
>> Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false
>>
>> By using this map join hint set hive.auto.convert.join = true; and
>> increasing the small table file size the job initiated but it was
>>
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>>
>> Till 30 min it was like this, so i killed the task.
>>
>> My doubts are:
>>
>> -- I have increased the reducer number mapred.reduce.tasks to 17, but
>> the hive query engine fired only one reducer for the job.
>> -- I have slave node memory around 96 GB can i over ride some
>> parameters, other than the above mentioned and make efficient use of
>> it.
>> -- How can I increase number of rows processed by reducer at a single
>> moment or per second
>> -- Any other techniques to optimize the query
>>
>> Thanks for response and your time Bejoy.
>>
>> Regards
>> abhi
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <be...@yahoo.com> wrote:
>> > Hi Abshiek
>> >
>> > What is the data size of the 20k rows? If it is lesser then you can go
>> > in
>> > for map join, which will give you a performance boost.
>> >
>> > set hive.auto.convert.join = true;
>> > Regards
>> > Bejoy KS
>> >
>> > Sent from handheld, please excuse typos.
>> > ________________________________
>> > From: Abhishek <ab...@gmail.com>
>> > Date: Fri, 28 Sep 2012 23:56:16 -0400
>> > To: Hive<us...@hive.apache.org>
>> > ReplyTo: user@hive.apache.org
>> > Cc: Bejoy Ks<be...@yahoo.com>
>> > Subject: Cartesian Product in HIVE
>> >
>> > Hi all,
>> >
>> > I have use case where we are doing Cartesian product of two tables with
>> > One table with
>> > 990k rows
>> > Second table
>> > 20k rows
>> >
>> > Query is Cartesian product of just two columns.
>> >
>> > So it comes around 20 billion rows
>> >
>> > For one hour it is processing like around 5 billion rows.
>> >
>> > So the process takes around 4 hrs.
>> >
>> > I have over riden some of the properties in hive
>> >
>> > Set io.sort.mb=512
>> >
>> > Set mapred.reduce.tasks=17
>> >
>> > Set io.sort.factor=256
>> > Set mapred.child.jvm.opts=-Xmx2048mb
>> > Set hive.map.aggr=true
>> > Set hive.exec.parallel=true
>> > Set mapred.tasks.reuse.num.tasks=-1
>> > Set hive.mapred.map.speculative.execution=false
>> > Set hive.mapred.reduce.speculative.execution=false
>> >
>> >
>> > How can optimize it to get better results.
>> >
>> > Even though I have set reduce tasks to 17, only one reduce is spawned
>> > for
>> > the query . Did I do some thing wrong ??
>> >
>> > My cluster configuration is having
>> > 20 slave nodes running cdh3u5.
>> > With 240 map slots
>> > 120 reduce slots
>> > Block size is 128 mb
>> > Memory on the slave node is 96GB
>> >
>> > How can the query perform better??
>> >
>> > How can I increase number of rows processed by reducer at a single
>> > moment,
>> > or per second
>> >
>> > Can help is greatly appreciated.
>> >
>> > Regards
>> > Abhi

RE: Cartesian Product in HIVE

Posted by Bejoy KS <be...@outlook.com>.

Hi Abshiek


No need of any similar columns for map join to work. It is just taking the join process to mapper rather then  doing the same in a reducer.

The actual bottle neck is the single reducer. Need to figure out why only one reducer is fired rather than the set value of 17. Are you using ORDER BY in your query? If so, it sets the number of reducers to 1.

Can you provide the full console stack here so that we'll be able to understand your issue and help you better? (starting from the properties you set, your query and the error ). Also can you get the exact data sizes for two tables.

Regards
Bejoy KS

> From: abhishek.dodda1@gmail.com
> Date: Sat, 29 Sep 2012 07:44:06 -0700
> Subject: Re: Cartesian Product in HIVE
> To: user@hive.apache.org; bejoy_ks@yahoo.com
> 
> Thanks for the reply Bejoy.
> 
> I tried to map join, by setting the property mentioned by you and Even
> increased the small table file size
> 20k table size would be not more than 200 mb but it doesnot work.
> 
> Cartesian product of tables, they dont have any similar columns does
> map join work here??
> 
> By applying below setting with STREAM TABLE HINT it was processing
> around 5 Billion rows per hour,so process took around 4 hrs.
> 
>  Set io.sort.mb=512
>  Set mapred.reduce.tasks=17
>  Set io.sort.factor=256
>  Set mapred.child.jvm.opts=-Xmx2048mb
>  Set hive.map.aggr=true
>  Set hive.exec.parallel=true
>  Set mapred.tasks.reuse.num.tasks=-1
>  Set hive.mapred.map.speculative.execution=false
>  Set hive.mapred.reduce.speculative.execution=false
> 
> By using this map join hint set hive.auto.convert.join = true; and
> increasing the small table file size the job initiated but it was
> 
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> 
> Till 30 min it was like this, so i killed the task.
> 
> My doubts are:
> 
> -- I have increased the reducer number mapred.reduce.tasks to 17, but
> the hive query engine fired only one reducer for the job.
> -- I have slave node memory around 96 GB can i over ride some
> parameters, other than the above mentioned and make efficient use of
> it.
> -- How can I increase number of rows processed by reducer at a single
> moment or per second
> -- Any other techniques to optimize the query
> 
> Thanks for response and your time Bejoy.
> 
> Regards
> abhi
> 
> 
> 
> 
> 
> 
> 
> On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <be...@yahoo.com> wrote:
> > Hi Abshiek
> >
> > What is the data size of the 20k rows? If it is lesser then you can go in
> > for map join, which will give you a performance boost.
> >
> > set hive.auto.convert.join = true;
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> > ________________________________
> > From: Abhishek <ab...@gmail.com>
> > Date: Fri, 28 Sep 2012 23:56:16 -0400
> > To: Hive<us...@hive.apache.org>
> > ReplyTo: user@hive.apache.org
> > Cc: Bejoy Ks<be...@yahoo.com>
> > Subject: Cartesian Product in HIVE
> >
> > Hi all,
> >
> > I have use case where we are doing Cartesian product of two tables with
> > One table with
> > 990k rows
> > Second table
> > 20k rows
> >
> > Query is Cartesian product of just two columns.
> >
> > So it comes around 20 billion rows
> >
> > For one hour it is processing like around 5 billion rows.
> >
> > So the process takes around 4 hrs.
> >
> > I have over riden some of the properties in hive
> >
> > Set io.sort.mb=512
> >
> >     Set mapred.reduce.tasks=17
> >
> > Set io.sort.factor=256
> > Set mapred.child.jvm.opts=-Xmx2048mb
> > Set hive.map.aggr=true
> > Set hive.exec.parallel=true
> > Set mapred.tasks.reuse.num.tasks=-1
> > Set hive.mapred.map.speculative.execution=false
> > Set hive.mapred.reduce.speculative.execution=false
> >
> >
> > How can optimize it to get better results.
> >
> > Even though I have set reduce tasks to 17, only one reduce is spawned for
> > the query . Did I do some thing wrong ??
> >
> > My cluster configuration is having
> > 20 slave nodes running cdh3u5.
> > With 240 map slots
> >         120 reduce slots
> >  Block size is 128 mb
> >  Memory on the slave node is 96GB
> >
> > How can the query perform better??
> >
> > How can I increase number of rows processed by reducer at a single moment,
> > or per second
> >
> > Can help is greatly appreciated.
> >
> > Regards
> > Abhi

Re: Cartesian Product in HIVE

Posted by abhishek dodda <ab...@gmail.com>.

Thanks for the reply Bejoy.

I tried to map join, by setting the property mentioned by you and Even
increased the small table file size
20k table size would be not more than 200 mb but it doesnot work.

Cartesian product of tables, they dont have any similar columns does
map join work here??

By applying below setting with STREAM TABLE HINT it was processing
around 5 Billion rows per hour,so process took around 4 hrs.

 Set io.sort.mb=512
 Set mapred.reduce.tasks=17
 Set io.sort.factor=256
 Set mapred.child.jvm.opts=-Xmx2048mb
 Set hive.map.aggr=true
 Set hive.exec.parallel=true
 Set mapred.tasks.reuse.num.tasks=-1
 Set hive.mapred.map.speculative.execution=false
 Set hive.mapred.reduce.speculative.execution=false

By using this map join hint set hive.auto.convert.join = true; and
increasing the small table file size the job initiated but it was

map 0 % -- reduce 0%
map 0 % -- reduce 0%
map 0 % -- reduce 0%
map 0 % -- reduce 0%
map 0 % -- reduce 0%

Till 30 min it was like this, so i killed the task.

My doubts are:

-- I have increased the reducer number mapred.reduce.tasks to 17, but
the hive query engine fired only one reducer for the job.
-- I have slave node memory around 96 GB can i over ride some
parameters, other than the above mentioned and make efficient use of
it.
-- How can I increase number of rows processed by reducer at a single
moment or per second
-- Any other techniques to optimize the query

Thanks for response and your time Bejoy.

Regards
abhi







On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <be...@yahoo.com> wrote:
> Hi Abshiek
>
> What is the data size of the 20k rows? If it is lesser then you can go in
> for map join, which will give you a performance boost.
>
> set hive.auto.convert.join = true;
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Abhishek <ab...@gmail.com>
> Date: Fri, 28 Sep 2012 23:56:16 -0400
> To: Hive<us...@hive.apache.org>
> ReplyTo: user@hive.apache.org
> Cc: Bejoy Ks<be...@yahoo.com>
> Subject: Cartesian Product in HIVE
>
> Hi all,
>
> I have use case where we are doing Cartesian product of two tables with
> One table with
> 990k rows
> Second table
> 20k rows
>
> Query is Cartesian product of just two columns.
>
> So it comes around 20 billion rows
>
> For one hour it is processing like around 5 billion rows.
>
> So the process takes around 4 hrs.
>
> I have over riden some of the properties in hive
>
> Set io.sort.mb=512
>
>     Set mapred.reduce.tasks=17
>
> Set io.sort.factor=256
> Set mapred.child.jvm.opts=-Xmx2048mb
> Set hive.map.aggr=true
> Set hive.exec.parallel=true
> Set mapred.tasks.reuse.num.tasks=-1
> Set hive.mapred.map.speculative.execution=false
> Set hive.mapred.reduce.speculative.execution=false
>
>
> How can optimize it to get better results.
>
> Even though I have set reduce tasks to 17, only one reduce is spawned for
> the query . Did I do some thing wrong ??
>
> My cluster configuration is having
> 20 slave nodes running cdh3u5.
> With 240 map slots
>         120 reduce slots
>  Block size is 128 mb
>  Memory on the slave node is 96GB
>
> How can the query perform better??
>
> How can I increase number of rows processed by reducer at a single moment,
> or per second
>
> Can help is greatly appreciated.
>
> Regards
> Abhi

Re: Cartesian Product in HIVE

Posted by Bejoy KS <be...@yahoo.com>.

Hi Abshiek

What is the data size of the 20k rows? If it is lesser then you can go in for map join, which will give you a performance boost.

set hive.auto.convert.join = true;

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Abhishek <ab...@gmail.com>
Date: Fri, 28 Sep 2012 23:56:16 
To: Hive<us...@hive.apache.org>
Reply-To: user@hive.apache.org
Cc: Bejoy Ks<be...@yahoo.com>
Subject: Cartesian Product in HIVE

Hi all,

I have use case where we are doing Cartesian product of two tables with
One table with 
990k rows
Second table 
20k rows

Query is Cartesian product of just two columns. 

So it comes around 20 billion rows

For one hour it is processing like around 5 billion rows.

So the process takes around 4 hrs.

I have over riden some of the properties in hive

>> Set io.sort.mb=512
    Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false

How can optimize it to get better results.

Even though I have set reduce tasks to 17, only one reduce is spawned for the query . Did I do some thing wrong ??

My cluster configuration is having
20 slave nodes running cdh3u5.
With 240 map slots
        120 reduce slots 
 Block size is 128 mb
 Memory on the slave node is 96GB

How can the query perform better??

How can I increase number of rows processed by reducer at a single moment, or per second

Can help is greatly appreciated.

Regards
Abhi