You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Abhishek <ab...@gmail.com> on 2012/09/29 05:56:16 UTC
Cartesian Product in HIVE
Hi all,
I have use case where we are doing Cartesian product of two tables with
One table with
990k rows
Second table
20k rows
Query is Cartesian product of just two columns.
So it comes around 20 billion rows
For one hour it is processing like around 5 billion rows.
So the process takes around 4 hrs.
I have over riden some of the properties in hive
>> Set io.sort.mb=512
Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false
How can optimize it to get better results.
Even though I have set reduce tasks to 17, only one reduce is spawned for the query . Did I do some thing wrong ??
My cluster configuration is having
20 slave nodes running cdh3u5.
With 240 map slots
120 reduce slots
Block size is 128 mb
Memory on the slave node is 96GB
How can the query perform better??
How can I increase number of rows processed by reducer at a single moment, or per second
Can help is greatly appreciated.
Regards
Abhi
Re: Cartesian Product in HIVE
Posted by Abhishek <ab...@gmail.com>.
Thanks for the reply Bejoy.
I did not any order by in the query.
Here are the properities I have used and query, table sizes
----- set mapred.reduce.tasks=17;
set mapred.child.java.opts=xmx2073741824;
set io.sort.mb=512;
set io.sort.factor=250;
set mapred.reduce.parallel.copies=true;
set mapred.job.reuse.jvm.num.tasks=1;
set hive.mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.map.tasks.speculative.execution=false;
CREATE TABLE t1 AS
SELECT /*+ STREAMTABLE(t2) */
t2.col1,
t3.col1
FROM table2 t2
JOIN table3 t3
table2 : 997406 rows
total bytes: 20848934 -- 19.88 mb
table3 : 20773 rows
total bytes: 353127 -- 0.33 mb
#of Mappers: 4
#of reducers: 1
Regards
Abhi
On Sep 30, 2012, at 9:35 AM, Bejoy KS <be...@outlook.com> wrote:
> Hi Abshiek
>
>
> No need of any similar columns for map join to work. It is just taking the join process to mapper rather then doing the same in a reducer.
>
> The actual bottle neck is the single reducer. Need to figure out why only one reducer is fired rather than the set value of 17. Are you using ORDER BY in your query? If so, it sets the number of reducers to 1.
>
> Can you provide the full console stack here so that we'll be able to understand your issue and help you better? (starting from the properties you set, your query and the error ). Also can you get the exact data sizes for two tables.
>
> Regards
> Bejoy KS
>
> > From: abhishek.dodda1@gmail.com
> > Date: Sat, 29 Sep 2012 07:44:06 -0700
> > Subject: Re: Cartesian Product in HIVE
> > To: user@hive.apache.org; bejoy_ks@yahoo.com
> >
> > Thanks for the reply Bejoy.
> >
> > I tried to map join, by setting the property mentioned by you and Even
> > increased the small table file size
> > 20k table size would be not more than 200 mb but it doesnot work.
> >
> > Cartesian product of tables, they dont have any similar columns does
> > map join work here??
> >
> > By applying below setting with STREAM TABLE HINT it was processing
> > around 5 Billion rows per hour,so process took around 4 hrs.
> >
> > Set io.sort.mb=512
> > Set mapred.reduce.tasks=17
> > Set io.sort.factor=256
> > Set mapred.child.jvm.opts=-Xmx2048mb
> > Set hive.map.aggr=true
> > Set hive.exec.parallel=true
> > Set mapred.tasks.reuse.num.tasks=-1
> > Set hive.mapred.map.speculative.execution=false
> > Set hive.mapred.reduce.speculative.execution=false
> >
> > By using this map join hint set hive.auto.convert.join = true; and
> > increasing the small table file size the job initiated but it was
> >
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> >
> > Till 30 min it was like this, so i killed the task.
> >
> > My doubts are:
> >
> > -- I have increased the reducer number mapred.reduce.tasks to 17, but
> > the hive query engine fired only one reducer for the job.
> > -- I have slave node memory around 96 GB can i over ride some
> > parameters, other than the above mentioned and make efficient use of
> > it.
> > -- How can I increase number of rows processed by reducer at a single
> > moment or per second
> > -- Any other techniques to optimize the query
> >
> > Thanks for response and your time Bejoy.
> >
> > Regards
> > abhi
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <be...@yahoo.com> wrote:
> > > Hi Abshiek
> > >
> > > What is the data size of the 20k rows? If it is lesser then you can go in
> > > for map join, which will give you a performance boost.
> > >
> > > set hive.auto.convert.join = true;
> > > Regards
> > > Bejoy KS
> > >
> > > Sent from handheld, please excuse typos.
> > > ________________________________
> > > From: Abhishek <ab...@gmail.com>
> > > Date: Fri, 28 Sep 2012 23:56:16 -0400
> > > To: Hive<us...@hive.apache.org>
> > > ReplyTo: user@hive.apache.org
> > > Cc: Bejoy Ks<be...@yahoo.com>
> > > Subject: Cartesian Product in HIVE
> > >
> > > Hi all,
> > >
> > > I have use case where we are doing Cartesian product of two tables with
> > > One table with
> > > 990k rows
> > > Second table
> > > 20k rows
> > >
> > > Query is Cartesian product of just two columns.
> > >
> > > So it comes around 20 billion rows
> > >
> > > For one hour it is processing like around 5 billion rows.
> > >
> > > So the process takes around 4 hrs.
> > >
> > > I have over riden some of the properties in hive
> > >
> > > Set io.sort.mb=512
> > >
> > > Set mapred.reduce.tasks=17
> > >
> > > Set io.sort.factor=256
> > > Set mapred.child.jvm.opts=-Xmx2048mb
> > > Set hive.map.aggr=true
> > > Set hive.exec.parallel=true
> > > Set mapred.tasks.reuse.num.tasks=-1
> > > Set hive.mapred.map.speculative.execution=false
> > > Set hive.mapred.reduce.speculative.execution=false
> > >
> > >
> > > How can optimize it to get better results.
> > >
> > > Even though I have set reduce tasks to 17, only one reduce is spawned for
> > > the query . Did I do some thing wrong ??
> > >
> > > My cluster configuration is having
> > > 20 slave nodes running cdh3u5.
> > > With 240 map slots
> > > 120 reduce slots
> > > Block size is 128 mb
> > > Memory on the slave node is 96GB
> > >
> > > How can the query perform better??
> > >
> > > How can I increase number of rows processed by reducer at a single moment,
> > > or per second
> > >
> > > Can help is greatly appreciated.
> > >
> > > Regards
> > > Abhi
Re: Cartesian Product in HIVE
Posted by abhishek dodda <ab...@gmail.com>.
Thanks for the reply Bejoy.
On Sun, Sep 30, 2012 at 6:35 AM, Bejoy KS <be...@outlook.com> wrote:
> Hi Abshiek
>
>
> No need of any similar columns for map join to work. It is just taking the
> join process to mapper rather then doing the same in a reducer.
>
> The actual bottle neck is the single reducer. Need to figure out why only
> one reducer is fired rather than the set value of 17. Are you using ORDER BY
> in your query? If so, it sets the number of reducers to 1.
----- I didnot use order by in the query.
>
> Can you provide the full console stack here so that we'll be able to
> understand your issue and help you better? (starting from the properties you
> set, your query and the error ). Also can you get the exact data sizes for
> two tables.
----- set mapred.reduce.tasks=17;
set mapred.child.java.opts=xmx2073741824;
set io.sort.mb=512;
set io.sort.factor=250;
set mapred.reduce.parallel.copies=true;
set mapred.job.reuse.jvm.num.tasks=1;
set hive.mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.map.tasks.speculative.execution=false;
CREATE TABLE t1 AS
SELECT /*+ STREAMTABLE(t2) */
t2.col1,
t3.col1
FROM table2 t2
JOIN table3 t3
table2 : 997406 rows
total bytes: 20848934 -- 19.88 mb
table3 : 20773 rows
total bytes: 353127 -- 0.33 mb
#of Mappers: 4
#of reducers: 1
>
> Regards
> Bejoy KS
>
>> From: abhishek.dodda1@gmail.com
>> Date: Sat, 29 Sep 2012 07:44:06 -0700
>> Subject: Re: Cartesian Product in HIVE
>> To: user@hive.apache.org; bejoy_ks@yahoo.com
>
>>
>> Thanks for the reply Bejoy.
>>
>> I tried to map join, by setting the property mentioned by you and Even
>> increased the small table file size
>> 20k table size would be not more than 200 mb but it doesnot work.
>>
>> Cartesian product of tables, they dont have any similar columns does
>> map join work here??
>>
>> By applying below setting with STREAM TABLE HINT it was processing
>> around 5 Billion rows per hour,so process took around 4 hrs.
>>
>> Set io.sort.mb=512
>> Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false
>>
>> By using this map join hint set hive.auto.convert.join = true; and
>> increasing the small table file size the job initiated but it was
>>
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>>
>> Till 30 min it was like this, so i killed the task.
>>
>> My doubts are:
>>
>> -- I have increased the reducer number mapred.reduce.tasks to 17, but
>> the hive query engine fired only one reducer for the job.
>> -- I have slave node memory around 96 GB can i over ride some
>> parameters, other than the above mentioned and make efficient use of
>> it.
>> -- How can I increase number of rows processed by reducer at a single
>> moment or per second
>> -- Any other techniques to optimize the query
>>
>> Thanks for response and your time Bejoy.
>>
>> Regards
>> abhi
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <be...@yahoo.com> wrote:
>> > Hi Abshiek
>> >
>> > What is the data size of the 20k rows? If it is lesser then you can go
>> > in
>> > for map join, which will give you a performance boost.
>> >
>> > set hive.auto.convert.join = true;
>> > Regards
>> > Bejoy KS
>> >
>> > Sent from handheld, please excuse typos.
>> > ________________________________
>> > From: Abhishek <ab...@gmail.com>
>> > Date: Fri, 28 Sep 2012 23:56:16 -0400
>> > To: Hive<us...@hive.apache.org>
>> > ReplyTo: user@hive.apache.org
>> > Cc: Bejoy Ks<be...@yahoo.com>
>> > Subject: Cartesian Product in HIVE
>> >
>> > Hi all,
>> >
>> > I have use case where we are doing Cartesian product of two tables with
>> > One table with
>> > 990k rows
>> > Second table
>> > 20k rows
>> >
>> > Query is Cartesian product of just two columns.
>> >
>> > So it comes around 20 billion rows
>> >
>> > For one hour it is processing like around 5 billion rows.
>> >
>> > So the process takes around 4 hrs.
>> >
>> > I have over riden some of the properties in hive
>> >
>> > Set io.sort.mb=512
>> >
>> > Set mapred.reduce.tasks=17
>> >
>> > Set io.sort.factor=256
>> > Set mapred.child.jvm.opts=-Xmx2048mb
>> > Set hive.map.aggr=true
>> > Set hive.exec.parallel=true
>> > Set mapred.tasks.reuse.num.tasks=-1
>> > Set hive.mapred.map.speculative.execution=false
>> > Set hive.mapred.reduce.speculative.execution=false
>> >
>> >
>> > How can optimize it to get better results.
>> >
>> > Even though I have set reduce tasks to 17, only one reduce is spawned
>> > for
>> > the query . Did I do some thing wrong ??
>> >
>> > My cluster configuration is having
>> > 20 slave nodes running cdh3u5.
>> > With 240 map slots
>> > 120 reduce slots
>> > Block size is 128 mb
>> > Memory on the slave node is 96GB
>> >
>> > How can the query perform better??
>> >
>> > How can I increase number of rows processed by reducer at a single
>> > moment,
>> > or per second
>> >
>> > Can help is greatly appreciated.
>> >
>> > Regards
>> > Abhi
RE: Cartesian Product in HIVE
Posted by Bejoy KS <be...@outlook.com>.
Hi Abshiek
No need of any similar columns for map join to work. It is just taking the join process to mapper rather then doing the same in a reducer.
The actual bottle neck is the single reducer. Need to figure out why only one reducer is fired rather than the set value of 17. Are you using ORDER BY in your query? If so, it sets the number of reducers to 1.
Can you provide the full console stack here so that we'll be able to understand your issue and help you better? (starting from the properties you set, your query and the error ). Also can you get the exact data sizes for two tables.
Regards
Bejoy KS
> From: abhishek.dodda1@gmail.com
> Date: Sat, 29 Sep 2012 07:44:06 -0700
> Subject: Re: Cartesian Product in HIVE
> To: user@hive.apache.org; bejoy_ks@yahoo.com
>
> Thanks for the reply Bejoy.
>
> I tried to map join, by setting the property mentioned by you and Even
> increased the small table file size
> 20k table size would be not more than 200 mb but it doesnot work.
>
> Cartesian product of tables, they dont have any similar columns does
> map join work here??
>
> By applying below setting with STREAM TABLE HINT it was processing
> around 5 Billion rows per hour,so process took around 4 hrs.
>
> Set io.sort.mb=512
> Set mapred.reduce.tasks=17
> Set io.sort.factor=256
> Set mapred.child.jvm.opts=-Xmx2048mb
> Set hive.map.aggr=true
> Set hive.exec.parallel=true
> Set mapred.tasks.reuse.num.tasks=-1
> Set hive.mapred.map.speculative.execution=false
> Set hive.mapred.reduce.speculative.execution=false
>
> By using this map join hint set hive.auto.convert.join = true; and
> increasing the small table file size the job initiated but it was
>
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
>
> Till 30 min it was like this, so i killed the task.
>
> My doubts are:
>
> -- I have increased the reducer number mapred.reduce.tasks to 17, but
> the hive query engine fired only one reducer for the job.
> -- I have slave node memory around 96 GB can i over ride some
> parameters, other than the above mentioned and make efficient use of
> it.
> -- How can I increase number of rows processed by reducer at a single
> moment or per second
> -- Any other techniques to optimize the query
>
> Thanks for response and your time Bejoy.
>
> Regards
> abhi
>
>
>
>
>
>
>
> On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <be...@yahoo.com> wrote:
> > Hi Abshiek
> >
> > What is the data size of the 20k rows? If it is lesser then you can go in
> > for map join, which will give you a performance boost.
> >
> > set hive.auto.convert.join = true;
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> > ________________________________
> > From: Abhishek <ab...@gmail.com>
> > Date: Fri, 28 Sep 2012 23:56:16 -0400
> > To: Hive<us...@hive.apache.org>
> > ReplyTo: user@hive.apache.org
> > Cc: Bejoy Ks<be...@yahoo.com>
> > Subject: Cartesian Product in HIVE
> >
> > Hi all,
> >
> > I have use case where we are doing Cartesian product of two tables with
> > One table with
> > 990k rows
> > Second table
> > 20k rows
> >
> > Query is Cartesian product of just two columns.
> >
> > So it comes around 20 billion rows
> >
> > For one hour it is processing like around 5 billion rows.
> >
> > So the process takes around 4 hrs.
> >
> > I have over riden some of the properties in hive
> >
> > Set io.sort.mb=512
> >
> > Set mapred.reduce.tasks=17
> >
> > Set io.sort.factor=256
> > Set mapred.child.jvm.opts=-Xmx2048mb
> > Set hive.map.aggr=true
> > Set hive.exec.parallel=true
> > Set mapred.tasks.reuse.num.tasks=-1
> > Set hive.mapred.map.speculative.execution=false
> > Set hive.mapred.reduce.speculative.execution=false
> >
> >
> > How can optimize it to get better results.
> >
> > Even though I have set reduce tasks to 17, only one reduce is spawned for
> > the query . Did I do some thing wrong ??
> >
> > My cluster configuration is having
> > 20 slave nodes running cdh3u5.
> > With 240 map slots
> > 120 reduce slots
> > Block size is 128 mb
> > Memory on the slave node is 96GB
> >
> > How can the query perform better??
> >
> > How can I increase number of rows processed by reducer at a single moment,
> > or per second
> >
> > Can help is greatly appreciated.
> >
> > Regards
> > Abhi
Re: Cartesian Product in HIVE
Posted by abhishek dodda <ab...@gmail.com>.
Thanks for the reply Bejoy.
I tried to map join, by setting the property mentioned by you and Even
increased the small table file size
20k table size would be not more than 200 mb but it doesnot work.
Cartesian product of tables, they dont have any similar columns does
map join work here??
By applying below setting with STREAM TABLE HINT it was processing
around 5 Billion rows per hour,so process took around 4 hrs.
Set io.sort.mb=512
Set mapred.reduce.tasks=17
Set io.sort.factor=256
Set mapred.child.jvm.opts=-Xmx2048mb
Set hive.map.aggr=true
Set hive.exec.parallel=true
Set mapred.tasks.reuse.num.tasks=-1
Set hive.mapred.map.speculative.execution=false
Set hive.mapred.reduce.speculative.execution=false
By using this map join hint set hive.auto.convert.join = true; and
increasing the small table file size the job initiated but it was
map 0 % -- reduce 0%
map 0 % -- reduce 0%
map 0 % -- reduce 0%
map 0 % -- reduce 0%
map 0 % -- reduce 0%
Till 30 min it was like this, so i killed the task.
My doubts are:
-- I have increased the reducer number mapred.reduce.tasks to 17, but
the hive query engine fired only one reducer for the job.
-- I have slave node memory around 96 GB can i over ride some
parameters, other than the above mentioned and make efficient use of
it.
-- How can I increase number of rows processed by reducer at a single
moment or per second
-- Any other techniques to optimize the query
Thanks for response and your time Bejoy.
Regards
abhi
On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <be...@yahoo.com> wrote:
> Hi Abshiek
>
> What is the data size of the 20k rows? If it is lesser then you can go in
> for map join, which will give you a performance boost.
>
> set hive.auto.convert.join = true;
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Abhishek <ab...@gmail.com>
> Date: Fri, 28 Sep 2012 23:56:16 -0400
> To: Hive<us...@hive.apache.org>
> ReplyTo: user@hive.apache.org
> Cc: Bejoy Ks<be...@yahoo.com>
> Subject: Cartesian Product in HIVE
>
> Hi all,
>
> I have use case where we are doing Cartesian product of two tables with
> One table with
> 990k rows
> Second table
> 20k rows
>
> Query is Cartesian product of just two columns.
>
> So it comes around 20 billion rows
>
> For one hour it is processing like around 5 billion rows.
>
> So the process takes around 4 hrs.
>
> I have over riden some of the properties in hive
>
> Set io.sort.mb=512
>
> Set mapred.reduce.tasks=17
>
> Set io.sort.factor=256
> Set mapred.child.jvm.opts=-Xmx2048mb
> Set hive.map.aggr=true
> Set hive.exec.parallel=true
> Set mapred.tasks.reuse.num.tasks=-1
> Set hive.mapred.map.speculative.execution=false
> Set hive.mapred.reduce.speculative.execution=false
>
>
> How can optimize it to get better results.
>
> Even though I have set reduce tasks to 17, only one reduce is spawned for
> the query . Did I do some thing wrong ??
>
> My cluster configuration is having
> 20 slave nodes running cdh3u5.
> With 240 map slots
> 120 reduce slots
> Block size is 128 mb
> Memory on the slave node is 96GB
>
> How can the query perform better??
>
> How can I increase number of rows processed by reducer at a single moment,
> or per second
>
> Can help is greatly appreciated.
>
> Regards
> Abhi
Re: Cartesian Product in HIVE
Posted by Bejoy KS <be...@yahoo.com>.
Hi Abshiek
What is the data size of the 20k rows? If it is lesser then you can go in for map join, which will give you a performance boost.
set hive.auto.convert.join = true;
Regards
Bejoy KS
Sent from handheld, please excuse typos.
-----Original Message-----
From: Abhishek <ab...@gmail.com>
Date: Fri, 28 Sep 2012 23:56:16
To: Hive<us...@hive.apache.org>
Reply-To: user@hive.apache.org
Cc: Bejoy Ks<be...@yahoo.com>
Subject: Cartesian Product in HIVE
Hi all,
I have use case where we are doing Cartesian product of two tables with
One table with
990k rows
Second table
20k rows
Query is Cartesian product of just two columns.
So it comes around 20 billion rows
For one hour it is processing like around 5 billion rows.
So the process takes around 4 hrs.
I have over riden some of the properties in hive
>> Set io.sort.mb=512
Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false
How can optimize it to get better results.
Even though I have set reduce tasks to 17, only one reduce is spawned for the query . Did I do some thing wrong ??
My cluster configuration is having
20 slave nodes running cdh3u5.
With 240 map slots
120 reduce slots
Block size is 128 mb
Memory on the slave node is 96GB
How can the query perform better??
How can I increase number of rows processed by reducer at a single moment, or per second
Can help is greatly appreciated.
Regards
Abhi