You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by as...@gmail.com on 2010/12/24 03:50:35 UTC

(HIVE-1642) Convert join queries to map-join based on size of table/row

Hi,

I set hive.auto.convert.join=true and run the following query:

select t1.foo, count(t2.bar) from invites t1 join invites t2 on
(t1.foo=t2.foo) group by t1.foo;

I did not see it ran as map side join. Did I miss something? Is there any
precondition for this feature to work?

Thanks.

Re: (HIVE-1642) Convert join queries to map-join based on size of table/row

Posted by as...@gmail.com.
Thanks!

On Mon, Dec 27, 2010 at 11:56 PM, Liyin Tang <li...@gmail.com> wrote:

> Yes. Only execute one of the them.
>
> On 27 December 2010 23:43, <as...@gmail.com> wrote:
>
> > A question about the design doc:
> >
> > "If one of the tables is large and others are small enough to run Map
> Join,
> > then the Conditional Task will pick the corresponding Map Join Local Task
> > to
> > run."
> > Here you pick one table as big, hash all other tables into memory by join
> > key individually. If it works, it will be chosen. Can it be that if one
> of
> > the rest candidate tasks (chosing a different table as big table)
> performs
> > better (assuming), it will not be considered? Did I understand it right?
> >
>
> Thanks!
> > Alicia
> >
> > On Mon, Dec 27, 2010 at 12:05 AM, Liyin Tang <li...@gmail.com>
> wrote:
> >
> > > Hi,
> > > If multiple tables join on different join keys, it will be separated
> into
> > > multiple MapRed Tasks.
> > > Also the threshold of the small table file size means the sum of all
> the
> > > small table.
> > >
> > > There is a documentation and a  slide about this feature:
> > > http://www.slideshare.net/aiolos127/join-optimization-in-hive
> > > <http://www.slideshare.net/aiolos127/join-optimization-in-hive>
> > > http://wiki.apache.org/hadoop/Hive/JoinOptimization
> > >
> > > <http://wiki.apache.org/hadoop/Hive/JoinOptimization>Thanks
> > >
> > > On 26 December 2010 23:42, <as...@gmail.com> wrote:
> > >
> > > > Thanks for the reply. I want to get clarification on this feature.
> > > >
> > > > If one of the two joining tables table t1 is smaller than 25M and is
> > > > sharded, how does this feature work?
> > >
> > >
> > > > Suppose there are joins on multiple tables such as t1, t2 and t3. If
> t1
> > > and
> > > > t2 are smaller than 25M  and co-located with joining key, you can
> > stream
> > > t3
> > > > and map join them. What if t1 and t2 are not co-located with join
> keys?
> > > > What
> > > > if t1 and t2 together are bigger than 25M?
> > > >
> > > > Thanks in advance!
> > > > On Thu, Dec 23, 2010 at 7:16 PM, Liyin Tang <li...@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > > How large is t1 and t2 ?
> > > > > if both of t1 and t2 is larger than 25M (a default threshold), the
> > > query
> > > > > processor will do the common join.
> > > > >
> > > > > Thanks
> > > > > Liyin
> > > > >
> > > > > On 23 December 2010 18:50, <as...@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I set hive.auto.convert.join=true and run the following query:
> > > > > >
> > > > > > select t1.foo, count(t2.bar) from invites t1 join invites t2 on
> > > > > > (t1.foo=t2.foo) group by t1.foo;
> > > > > >
> > > > > > I did not see it ran as map side join. Did I miss something? Is
> > there
> > > > any
> > > > > > precondition for this feature to work?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards
> > > > > -Liyin
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > -Liyin
> > >
> >
>
>
>
> --
> Best Regards
> -Liyin
>

Re: (HIVE-1642) Convert join queries to map-join based on size of table/row

Posted by Liyin Tang <li...@gmail.com>.
Yes. Only execute one of the them.

On 27 December 2010 23:43, <as...@gmail.com> wrote:

> A question about the design doc:
>
> "If one of the tables is large and others are small enough to run Map Join,
> then the Conditional Task will pick the corresponding Map Join Local Task
> to
> run."
> Here you pick one table as big, hash all other tables into memory by join
> key individually. If it works, it will be chosen. Can it be that if one of
> the rest candidate tasks (chosing a different table as big table) performs
> better (assuming), it will not be considered? Did I understand it right?
>

Thanks!
> Alicia
>
> On Mon, Dec 27, 2010 at 12:05 AM, Liyin Tang <li...@gmail.com> wrote:
>
> > Hi,
> > If multiple tables join on different join keys, it will be separated into
> > multiple MapRed Tasks.
> > Also the threshold of the small table file size means the sum of all the
> > small table.
> >
> > There is a documentation and a  slide about this feature:
> > http://www.slideshare.net/aiolos127/join-optimization-in-hive
> > <http://www.slideshare.net/aiolos127/join-optimization-in-hive>
> > http://wiki.apache.org/hadoop/Hive/JoinOptimization
> >
> > <http://wiki.apache.org/hadoop/Hive/JoinOptimization>Thanks
> >
> > On 26 December 2010 23:42, <as...@gmail.com> wrote:
> >
> > > Thanks for the reply. I want to get clarification on this feature.
> > >
> > > If one of the two joining tables table t1 is smaller than 25M and is
> > > sharded, how does this feature work?
> >
> >
> > > Suppose there are joins on multiple tables such as t1, t2 and t3. If t1
> > and
> > > t2 are smaller than 25M  and co-located with joining key, you can
> stream
> > t3
> > > and map join them. What if t1 and t2 are not co-located with join keys?
> > > What
> > > if t1 and t2 together are bigger than 25M?
> > >
> > > Thanks in advance!
> > > On Thu, Dec 23, 2010 at 7:16 PM, Liyin Tang <li...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > > How large is t1 and t2 ?
> > > > if both of t1 and t2 is larger than 25M (a default threshold), the
> > query
> > > > processor will do the common join.
> > > >
> > > > Thanks
> > > > Liyin
> > > >
> > > > On 23 December 2010 18:50, <as...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I set hive.auto.convert.join=true and run the following query:
> > > > >
> > > > > select t1.foo, count(t2.bar) from invites t1 join invites t2 on
> > > > > (t1.foo=t2.foo) group by t1.foo;
> > > > >
> > > > > I did not see it ran as map side join. Did I miss something? Is
> there
> > > any
> > > > > precondition for this feature to work?
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > > -Liyin
> > > >
> > >
> >
> >
> >
> > --
> > Best Regards
> > -Liyin
> >
>



-- 
Best Regards
-Liyin

Re: (HIVE-1642) Convert join queries to map-join based on size of table/row

Posted by as...@gmail.com.
A question about the design doc:

"If one of the tables is large and others are small enough to run Map Join,
then the Conditional Task will pick the corresponding Map Join Local Task to
run."
Here you pick one table as big, hash all other tables into memory by join
key individually. If it works, it will be chosen. Can it be that if one of
the rest candidate tasks (chosing a different table as big table) performs
better (assuming), it will not be considered? Did I understand it right?

Thanks!
Alicia

On Mon, Dec 27, 2010 at 12:05 AM, Liyin Tang <li...@gmail.com> wrote:

> Hi,
> If multiple tables join on different join keys, it will be separated into
> multiple MapRed Tasks.
> Also the threshold of the small table file size means the sum of all the
> small table.
>
> There is a documentation and a  slide about this feature:
> http://www.slideshare.net/aiolos127/join-optimization-in-hive
> <http://www.slideshare.net/aiolos127/join-optimization-in-hive>
> http://wiki.apache.org/hadoop/Hive/JoinOptimization
>
> <http://wiki.apache.org/hadoop/Hive/JoinOptimization>Thanks
>
> On 26 December 2010 23:42, <as...@gmail.com> wrote:
>
> > Thanks for the reply. I want to get clarification on this feature.
> >
> > If one of the two joining tables table t1 is smaller than 25M and is
> > sharded, how does this feature work?
>
>
> > Suppose there are joins on multiple tables such as t1, t2 and t3. If t1
> and
> > t2 are smaller than 25M  and co-located with joining key, you can stream
> t3
> > and map join them. What if t1 and t2 are not co-located with join keys?
> > What
> > if t1 and t2 together are bigger than 25M?
> >
> > Thanks in advance!
> > On Thu, Dec 23, 2010 at 7:16 PM, Liyin Tang <li...@gmail.com> wrote:
> >
> > > Hi,
> > > How large is t1 and t2 ?
> > > if both of t1 and t2 is larger than 25M (a default threshold), the
> query
> > > processor will do the common join.
> > >
> > > Thanks
> > > Liyin
> > >
> > > On 23 December 2010 18:50, <as...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I set hive.auto.convert.join=true and run the following query:
> > > >
> > > > select t1.foo, count(t2.bar) from invites t1 join invites t2 on
> > > > (t1.foo=t2.foo) group by t1.foo;
> > > >
> > > > I did not see it ran as map side join. Did I miss something? Is there
> > any
> > > > precondition for this feature to work?
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > -Liyin
> > >
> >
>
>
>
> --
> Best Regards
> -Liyin
>

Re: (HIVE-1642) Convert join queries to map-join based on size of table/row

Posted by Liyin Tang <li...@gmail.com>.
Hi,
If multiple tables join on different join keys, it will be separated into
multiple MapRed Tasks.
Also the threshold of the small table file size means the sum of all the
small table.

There is a documentation and a  slide about this feature:
http://www.slideshare.net/aiolos127/join-optimization-in-hive
<http://www.slideshare.net/aiolos127/join-optimization-in-hive>
http://wiki.apache.org/hadoop/Hive/JoinOptimization

<http://wiki.apache.org/hadoop/Hive/JoinOptimization>Thanks

On 26 December 2010 23:42, <as...@gmail.com> wrote:

> Thanks for the reply. I want to get clarification on this feature.
>
> If one of the two joining tables table t1 is smaller than 25M and is
> sharded, how does this feature work?


> Suppose there are joins on multiple tables such as t1, t2 and t3. If t1 and
> t2 are smaller than 25M  and co-located with joining key, you can stream t3
> and map join them. What if t1 and t2 are not co-located with join keys?
> What
> if t1 and t2 together are bigger than 25M?
>
> Thanks in advance!
> On Thu, Dec 23, 2010 at 7:16 PM, Liyin Tang <li...@gmail.com> wrote:
>
> > Hi,
> > How large is t1 and t2 ?
> > if both of t1 and t2 is larger than 25M (a default threshold), the query
> > processor will do the common join.
> >
> > Thanks
> > Liyin
> >
> > On 23 December 2010 18:50, <as...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I set hive.auto.convert.join=true and run the following query:
> > >
> > > select t1.foo, count(t2.bar) from invites t1 join invites t2 on
> > > (t1.foo=t2.foo) group by t1.foo;
> > >
> > > I did not see it ran as map side join. Did I miss something? Is there
> any
> > > precondition for this feature to work?
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > Best Regards
> > -Liyin
> >
>



-- 
Best Regards
-Liyin

Re: (HIVE-1642) Convert join queries to map-join based on size of table/row

Posted by as...@gmail.com.
Thanks for the reply. I want to get clarification on this feature.

If one of the two joining tables table t1 is smaller than 25M and is
sharded, how does this feature work?

Suppose there are joins on multiple tables such as t1, t2 and t3. If t1 and
t2 are smaller than 25M  and co-located with joining key, you can stream t3
and map join them. What if t1 and t2 are not co-located with join keys? What
if t1 and t2 together are bigger than 25M?

Thanks in advance!
On Thu, Dec 23, 2010 at 7:16 PM, Liyin Tang <li...@gmail.com> wrote:

> Hi,
> How large is t1 and t2 ?
> if both of t1 and t2 is larger than 25M (a default threshold), the query
> processor will do the common join.
>
> Thanks
> Liyin
>
> On 23 December 2010 18:50, <as...@gmail.com> wrote:
>
> > Hi,
> >
> > I set hive.auto.convert.join=true and run the following query:
> >
> > select t1.foo, count(t2.bar) from invites t1 join invites t2 on
> > (t1.foo=t2.foo) group by t1.foo;
> >
> > I did not see it ran as map side join. Did I miss something? Is there any
> > precondition for this feature to work?
> >
> > Thanks.
> >
>
>
>
> --
> Best Regards
> -Liyin
>

Re: (HIVE-1642) Convert join queries to map-join based on size of table/row

Posted by Liyin Tang <li...@gmail.com>.
Hi,
How large is t1 and t2 ?
if both of t1 and t2 is larger than 25M (a default threshold), the query
processor will do the common join.

Thanks
Liyin

On 23 December 2010 18:50, <as...@gmail.com> wrote:

> Hi,
>
> I set hive.auto.convert.join=true and run the following query:
>
> select t1.foo, count(t2.bar) from invites t1 join invites t2 on
> (t1.foo=t2.foo) group by t1.foo;
>
> I did not see it ran as map side join. Did I miss something? Is there any
> precondition for this feature to work?
>
> Thanks.
>



-- 
Best Regards
-Liyin