You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Igor Tatarinov <ig...@decide.com> on 2011/03/20 07:54:49 UTC

skew join optimization

I have the following join that takes 4.5 hours (with 12 nodes) mostly
because of a single reduce task that gets the bulk of the work:

SELECT ...
FROM T
LEFT OUTER JOIN S
ON T.timestamp = S.timestamp and T.id = S.id

This is a 1:0/1 join so the size of the output is exactly the same as the
size of T (500M records). S is actually very small (5K).

I've tried:
- switching the order of the join conditions
- using a different hash function setting (jenkins instead of murmur)
- using SET set hive.auto.convert.join = true;
- using SET hive.optimize.skewjoin = true;

but nothing helped :(

Anything else I can try?

Thanks!

Re: skew join optimization

Posted by yongqiang he <he...@gmail.com>.

skew join does not work together with map join.

Map join does not require any reducer.
Please double check the hive that you use has the auto map join feature.
If there is auto covert join is your hive, only SET set
hive.auto.convert.join = true; should do the work.

thanks
yongqiang
On Sun, Mar 20, 2011 at 8:22 AM, Edward Capriolo <ed...@gmail.com> wrote:
> On Sun, Mar 20, 2011 at 11:20 AM, Ted Yu <yu...@gmail.com> wrote:
>> How about link to http://imageshack.us/ or TinyPic ?
>>
>> Thanks
>>
>> On Sun, Mar 20, 2011 at 7:56 AM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>>
>>> On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>>> > Can someone re-attach the missing figures for that wiki ?
>>> >
>>> > Thanks
>>> >
>>> > On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
>>> > <bh...@gmail.com> wrote:
>>> >>
>>> >> Hi Igor,
>>> >>
>>> >> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
>>> >> jira 1642 which automatically converts a normal join into map-join
>>> >> (Otherwise you can specify the mapjoin hints in the query itself.).
>>> >> Because your 'S' table is very small , it can be replicated across all
>>> >> the mappers and the reduce phase can be avoided. This can greatly
>>> >> reduce the runtime .. (See the results section in the page for
>>> >> details.).
>>> >>
>>> >> Hope this helps.
>>> >>
>>> >> Thanks
>>> >>
>>> >>
>>> >> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
>>> >> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
>>> >> >> I have the following join that takes 4.5 hours (with 12 nodes)
>>> >> >> mostly
>>> >> >> because of a single reduce task that gets the bulk of the work:
>>> >> >> SELECT ...
>>> >> >> FROM T
>>> >> >> LEFT OUTER JOIN S
>>> >> >> ON T.timestamp = S.timestamp and T.id = S.id
>>> >> >> This is a 1:0/1 join so the size of the output is exactly the same
>>> >> >> as
>>> >> >> the
>>> >> >> size of T (500M records). S is actually very small (5K).
>>> >> >> I've tried:
>>> >> >> - switching the order of the join conditions
>>> >> >> - using a different hash function setting (jenkins instead of
>>> >> >> murmur)
>>> >> >> - using SET set hive.auto.convert.join = true;
>>> >> >
>>> >> > are you sure your query convert to mapjoin? if not,try use explicit
>>> >> > mapjoin hint.
>>> >> >
>>> >> >
>>> >> >> - using SET hive.optimize.skewjoin = true;
>>> >> >> but nothing helped :(
>>> >> >> Anything else I can try?
>>> >> >> Thanks!
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Regards,
>>> >> Bharath .V
>>> >> w:http://research.iiit.ac.in/~bharath.v
>>> >
>>> >
>>>
>>> The wiki does not allow images, confluence does but we have not moved
>>> their yet.
>>
>>
>
> Not a bad option, but lame that the foundation with the most popular
> web server in the world has to host our images elsewhere :(
>

Re: skew join optimization

Posted by Edward Capriolo <ed...@gmail.com>.

On Sun, Mar 20, 2011 at 11:20 AM, Ted Yu <yu...@gmail.com> wrote:
> How about link to http://imageshack.us/ or TinyPic ?
>
> Thanks
>
> On Sun, Mar 20, 2011 at 7:56 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>
>> On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>> > Can someone re-attach the missing figures for that wiki ?
>> >
>> > Thanks
>> >
>> > On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
>> > <bh...@gmail.com> wrote:
>> >>
>> >> Hi Igor,
>> >>
>> >> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
>> >> jira 1642 which automatically converts a normal join into map-join
>> >> (Otherwise you can specify the mapjoin hints in the query itself.).
>> >> Because your 'S' table is very small , it can be replicated across all
>> >> the mappers and the reduce phase can be avoided. This can greatly
>> >> reduce the runtime .. (See the results section in the page for
>> >> details.).
>> >>
>> >> Hope this helps.
>> >>
>> >> Thanks
>> >>
>> >>
>> >> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
>> >> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
>> >> >> I have the following join that takes 4.5 hours (with 12 nodes)
>> >> >> mostly
>> >> >> because of a single reduce task that gets the bulk of the work:
>> >> >> SELECT ...
>> >> >> FROM T
>> >> >> LEFT OUTER JOIN S
>> >> >> ON T.timestamp = S.timestamp and T.id = S.id
>> >> >> This is a 1:0/1 join so the size of the output is exactly the same
>> >> >> as
>> >> >> the
>> >> >> size of T (500M records). S is actually very small (5K).
>> >> >> I've tried:
>> >> >> - switching the order of the join conditions
>> >> >> - using a different hash function setting (jenkins instead of
>> >> >> murmur)
>> >> >> - using SET set hive.auto.convert.join = true;
>> >> >
>> >> > are you sure your query convert to mapjoin? if not,try use explicit
>> >> > mapjoin hint.
>> >> >
>> >> >
>> >> >> - using SET hive.optimize.skewjoin = true;
>> >> >> but nothing helped :(
>> >> >> Anything else I can try?
>> >> >> Thanks!
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Bharath .V
>> >> w:http://research.iiit.ac.in/~bharath.v
>> >
>> >
>>
>> The wiki does not allow images, confluence does but we have not moved
>> their yet.
>
>

Not a bad option, but lame that the foundation with the most popular
web server in the world has to host our images elsewhere :(

Re: skew join optimization

Posted by Edward Capriolo <ed...@gmail.com>.

On Sun, Mar 20, 2011 at 11:20 AM, Ted Yu <yu...@gmail.com> wrote:
> How about link to http://imageshack.us/ or TinyPic ?
>
> Thanks
>
> On Sun, Mar 20, 2011 at 7:56 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>
>> On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>> > Can someone re-attach the missing figures for that wiki ?
>> >
>> > Thanks
>> >
>> > On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
>> > <bh...@gmail.com> wrote:
>> >>
>> >> Hi Igor,
>> >>
>> >> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
>> >> jira 1642 which automatically converts a normal join into map-join
>> >> (Otherwise you can specify the mapjoin hints in the query itself.).
>> >> Because your 'S' table is very small , it can be replicated across all
>> >> the mappers and the reduce phase can be avoided. This can greatly
>> >> reduce the runtime .. (See the results section in the page for
>> >> details.).
>> >>
>> >> Hope this helps.
>> >>
>> >> Thanks
>> >>
>> >>
>> >> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
>> >> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
>> >> >> I have the following join that takes 4.5 hours (with 12 nodes)
>> >> >> mostly
>> >> >> because of a single reduce task that gets the bulk of the work:
>> >> >> SELECT ...
>> >> >> FROM T
>> >> >> LEFT OUTER JOIN S
>> >> >> ON T.timestamp = S.timestamp and T.id = S.id
>> >> >> This is a 1:0/1 join so the size of the output is exactly the same
>> >> >> as
>> >> >> the
>> >> >> size of T (500M records). S is actually very small (5K).
>> >> >> I've tried:
>> >> >> - switching the order of the join conditions
>> >> >> - using a different hash function setting (jenkins instead of
>> >> >> murmur)
>> >> >> - using SET set hive.auto.convert.join = true;
>> >> >
>> >> > are you sure your query convert to mapjoin? if not,try use explicit
>> >> > mapjoin hint.
>> >> >
>> >> >
>> >> >> - using SET hive.optimize.skewjoin = true;
>> >> >> but nothing helped :(
>> >> >> Anything else I can try?
>> >> >> Thanks!
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Bharath .V
>> >> w:http://research.iiit.ac.in/~bharath.v
>> >
>> >
>>
>> The wiki does not allow images, confluence does but we have not moved
>> their yet.
>
>

Not a bad option, but lame that the foundation with the most popular
web server in the world has to host our images elsewhere :(

Re: skew join optimization

Posted by Ted Yu <yu...@gmail.com>.

How about link to http://imageshack.us/ or TinyPic ?

Thanks

On Sun, Mar 20, 2011 at 7:56 AM, Edward Capriolo <ed...@gmail.com>wrote:

> On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
> > Can someone re-attach the missing figures for that wiki ?
> >
> > Thanks
> >
> > On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
> > <bh...@gmail.com> wrote:
> >>
> >> Hi Igor,
> >>
> >> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
> >> jira 1642 which automatically converts a normal join into map-join
> >> (Otherwise you can specify the mapjoin hints in the query itself.).
> >> Because your 'S' table is very small , it can be replicated across all
> >> the mappers and the reduce phase can be avoided. This can greatly
> >> reduce the runtime .. (See the results section in the page for
> >> details.).
> >>
> >> Hope this helps.
> >>
> >> Thanks
> >>
> >>
> >> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
> >> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
> >> >> I have the following join that takes 4.5 hours (with 12 nodes) mostly
> >> >> because of a single reduce task that gets the bulk of the work:
> >> >> SELECT ...
> >> >> FROM T
> >> >> LEFT OUTER JOIN S
> >> >> ON T.timestamp = S.timestamp and T.id = S.id
> >> >> This is a 1:0/1 join so the size of the output is exactly the same as
> >> >> the
> >> >> size of T (500M records). S is actually very small (5K).
> >> >> I've tried:
> >> >> - switching the order of the join conditions
> >> >> - using a different hash function setting (jenkins instead of murmur)
> >> >> - using SET set hive.auto.convert.join = true;
> >> >
> >> > are you sure your query convert to mapjoin? if not,try use explicit
> >> > mapjoin hint.
> >> >
> >> >
> >> >> - using SET hive.optimize.skewjoin = true;
> >> >> but nothing helped :(
> >> >> Anything else I can try?
> >> >> Thanks!
> >> >
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Bharath .V
> >> w:http://research.iiit.ac.in/~bharath.v
> >
> >
>
> The wiki does not allow images, confluence does but we have not moved their
> yet.
>

Re: skew join optimization

Posted by Ted Yu <yu...@gmail.com>.

How about link to http://imageshack.us/ or TinyPic ?

Thanks

On Sun, Mar 20, 2011 at 7:56 AM, Edward Capriolo <ed...@gmail.com>wrote:

> On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
> > Can someone re-attach the missing figures for that wiki ?
> >
> > Thanks
> >
> > On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
> > <bh...@gmail.com> wrote:
> >>
> >> Hi Igor,
> >>
> >> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
> >> jira 1642 which automatically converts a normal join into map-join
> >> (Otherwise you can specify the mapjoin hints in the query itself.).
> >> Because your 'S' table is very small , it can be replicated across all
> >> the mappers and the reduce phase can be avoided. This can greatly
> >> reduce the runtime .. (See the results section in the page for
> >> details.).
> >>
> >> Hope this helps.
> >>
> >> Thanks
> >>
> >>
> >> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
> >> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
> >> >> I have the following join that takes 4.5 hours (with 12 nodes) mostly
> >> >> because of a single reduce task that gets the bulk of the work:
> >> >> SELECT ...
> >> >> FROM T
> >> >> LEFT OUTER JOIN S
> >> >> ON T.timestamp = S.timestamp and T.id = S.id
> >> >> This is a 1:0/1 join so the size of the output is exactly the same as
> >> >> the
> >> >> size of T (500M records). S is actually very small (5K).
> >> >> I've tried:
> >> >> - switching the order of the join conditions
> >> >> - using a different hash function setting (jenkins instead of murmur)
> >> >> - using SET set hive.auto.convert.join = true;
> >> >
> >> > are you sure your query convert to mapjoin? if not,try use explicit
> >> > mapjoin hint.
> >> >
> >> >
> >> >> - using SET hive.optimize.skewjoin = true;
> >> >> but nothing helped :(
> >> >> Anything else I can try?
> >> >> Thanks!
> >> >
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Bharath .V
> >> w:http://research.iiit.ac.in/~bharath.v
> >
> >
>
> The wiki does not allow images, confluence does but we have not moved their
> yet.
>

Re: skew join optimization

Posted by Edward Capriolo <ed...@gmail.com>.

On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
> Can someone re-attach the missing figures for that wiki ?
>
> Thanks
>
> On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
> <bh...@gmail.com> wrote:
>>
>> Hi Igor,
>>
>> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
>> jira 1642 which automatically converts a normal join into map-join
>> (Otherwise you can specify the mapjoin hints in the query itself.).
>> Because your 'S' table is very small , it can be replicated across all
>> the mappers and the reduce phase can be avoided. This can greatly
>> reduce the runtime .. (See the results section in the page for
>> details.).
>>
>> Hope this helps.
>>
>> Thanks
>>
>>
>> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
>> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
>> >> I have the following join that takes 4.5 hours (with 12 nodes) mostly
>> >> because of a single reduce task that gets the bulk of the work:
>> >> SELECT ...
>> >> FROM T
>> >> LEFT OUTER JOIN S
>> >> ON T.timestamp = S.timestamp and T.id = S.id
>> >> This is a 1:0/1 join so the size of the output is exactly the same as
>> >> the
>> >> size of T (500M records). S is actually very small (5K).
>> >> I've tried:
>> >> - switching the order of the join conditions
>> >> - using a different hash function setting (jenkins instead of murmur)
>> >> - using SET set hive.auto.convert.join = true;
>> >
>> > are you sure your query convert to mapjoin? if not,try use explicit
>> > mapjoin hint.
>> >
>> >
>> >> - using SET hive.optimize.skewjoin = true;
>> >> but nothing helped :(
>> >> Anything else I can try?
>> >> Thanks!
>> >
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://research.iiit.ac.in/~bharath.v
>
>

The wiki does not allow images, confluence does but we have not moved their yet.

Re: skew join optimization

Posted by Edward Capriolo <ed...@gmail.com>.

On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
> Can someone re-attach the missing figures for that wiki ?
>
> Thanks
>
> On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
> <bh...@gmail.com> wrote:
>>
>> Hi Igor,
>>
>> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
>> jira 1642 which automatically converts a normal join into map-join
>> (Otherwise you can specify the mapjoin hints in the query itself.).
>> Because your 'S' table is very small , it can be replicated across all
>> the mappers and the reduce phase can be avoided. This can greatly
>> reduce the runtime .. (See the results section in the page for
>> details.).
>>
>> Hope this helps.
>>
>> Thanks
>>
>>
>> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
>> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
>> >> I have the following join that takes 4.5 hours (with 12 nodes) mostly
>> >> because of a single reduce task that gets the bulk of the work:
>> >> SELECT ...
>> >> FROM T
>> >> LEFT OUTER JOIN S
>> >> ON T.timestamp = S.timestamp and T.id = S.id
>> >> This is a 1:0/1 join so the size of the output is exactly the same as
>> >> the
>> >> size of T (500M records). S is actually very small (5K).
>> >> I've tried:
>> >> - switching the order of the join conditions
>> >> - using a different hash function setting (jenkins instead of murmur)
>> >> - using SET set hive.auto.convert.join = true;
>> >
>> > are you sure your query convert to mapjoin? if not,try use explicit
>> > mapjoin hint.
>> >
>> >
>> >> - using SET hive.optimize.skewjoin = true;
>> >> but nothing helped :(
>> >> Anything else I can try?
>> >> Thanks!
>> >
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://research.iiit.ac.in/~bharath.v
>
>

The wiki does not allow images, confluence does but we have not moved their yet.

Re: skew join optimization

Posted by Ted Yu <yu...@gmail.com>.

Can someone re-attach the missing figures for that wiki ?

Thanks

On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Hi Igor,
>
> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
> jira 1642 which automatically converts a normal join into map-join
> (Otherwise you can specify the mapjoin hints in the query itself.).
> Because your 'S' table is very small , it can be replicated across all
> the mappers and the reduce phase can be avoided. This can greatly
> reduce the runtime .. (See the results section in the page for
> details.).
>
> Hope this helps.
>
> Thanks
>
>
> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
> >> I have the following join that takes 4.5 hours (with 12 nodes) mostly
> >> because of a single reduce task that gets the bulk of the work:
> >> SELECT ...
> >> FROM T
> >> LEFT OUTER JOIN S
> >> ON T.timestamp = S.timestamp and T.id = S.id
> >> This is a 1:0/1 join so the size of the output is exactly the same as
> the
> >> size of T (500M records). S is actually very small (5K).
> >> I've tried:
> >> - switching the order of the join conditions
> >> - using a different hash function setting (jenkins instead of murmur)
> >> - using SET set hive.auto.convert.join = true;
> >
> > are you sure your query convert to mapjoin? if not,try use explicit
> > mapjoin hint.
> >
> >
> >> - using SET hive.optimize.skewjoin = true;
> >> but nothing helped :(
> >> Anything else I can try?
> >> Thanks!
> >
>
>
>
> --
> Regards,
> Bharath .V
> w:http://research.iiit.ac.in/~bharath.v
>

Re: skew join optimization

Posted by Ted Yu <yu...@gmail.com>.

Can someone re-attach the missing figures for that wiki ?

Thanks

On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Hi Igor,
>
> See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
> jira 1642 which automatically converts a normal join into map-join
> (Otherwise you can specify the mapjoin hints in the query itself.).
> Because your 'S' table is very small , it can be replicated across all
> the mappers and the reduce phase can be avoided. This can greatly
> reduce the runtime .. (See the results section in the page for
> details.).
>
> Hope this helps.
>
> Thanks
>
>
> On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
> > 2011/3/20 Igor Tatarinov <ig...@decide.com>:
> >> I have the following join that takes 4.5 hours (with 12 nodes) mostly
> >> because of a single reduce task that gets the bulk of the work:
> >> SELECT ...
> >> FROM T
> >> LEFT OUTER JOIN S
> >> ON T.timestamp = S.timestamp and T.id = S.id
> >> This is a 1:0/1 join so the size of the output is exactly the same as
> the
> >> size of T (500M records). S is actually very small (5K).
> >> I've tried:
> >> - switching the order of the join conditions
> >> - using a different hash function setting (jenkins instead of murmur)
> >> - using SET set hive.auto.convert.join = true;
> >
> > are you sure your query convert to mapjoin? if not,try use explicit
> > mapjoin hint.
> >
> >
> >> - using SET hive.optimize.skewjoin = true;
> >> but nothing helped :(
> >> Anything else I can try?
> >> Thanks!
> >
>
>
>
> --
> Regards,
> Bharath .V
> w:http://research.iiit.ac.in/~bharath.v
>

Re: skew join optimization

Posted by bharath vissapragada <bh...@gmail.com>.

Hi Igor,

See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
jira 1642 which automatically converts a normal join into map-join
(Otherwise you can specify the mapjoin hints in the query itself.).
Because your 'S' table is very small , it can be replicated across all
the mappers and the reduce phase can be avoided. This can greatly
reduce the runtime .. (See the results section in the page for
details.).

Hope this helps.

Thanks

On Sun, Mar 20, 2011 at 6:37 PM, Jov <zh...@gmail.com> wrote:
> 2011/3/20 Igor Tatarinov <ig...@decide.com>:
>> I have the following join that takes 4.5 hours (with 12 nodes) mostly
>> because of a single reduce task that gets the bulk of the work:
>> SELECT ...
>> FROM T
>> LEFT OUTER JOIN S
>> ON T.timestamp = S.timestamp and T.id = S.id
>> This is a 1:0/1 join so the size of the output is exactly the same as the
>> size of T (500M records). S is actually very small (5K).
>> I've tried:
>> - switching the order of the join conditions
>> - using a different hash function setting (jenkins instead of murmur)
>> - using SET set hive.auto.convert.join = true;
>
> are you sure your query convert to mapjoin? if not,try use explicit
> mapjoin hint.
>
>
>> - using SET hive.optimize.skewjoin = true;
>> but nothing helped :(
>> Anything else I can try?
>> Thanks!
>

-- 
Regards,
Bharath .V
w:http://research.iiit.ac.in/~bharath.v

Re: skew join optimization

Posted by Igor Tatarinov <ig...@decide.com>.

Thanks everyone!

I had a typo when setting auto convert to true. You can actually see it in
my first email ('set' was repeated twice but there was no syntax error).
With map joins enabled, my join finished in 30 minutes. Sweet!

Looks like 'true' should be the default option for auto.convert

Another article explaining hive joins with pictures can be found at
http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919


On Sun, Mar 20, 2011 at 6:07 AM, Jov <zh...@gmail.com> wrote:

> 2011/3/20 Igor Tatarinov <ig...@decide.com>:
> > I have the following join that takes 4.5 hours (with 12 nodes) mostly
> > because of a single reduce task that gets the bulk of the work:
> > SELECT ...
> > FROM T
> > LEFT OUTER JOIN S
> > ON T.timestamp = S.timestamp and T.id = S.id
> > This is a 1:0/1 join so the size of the output is exactly the same as the
> > size of T (500M records). S is actually very small (5K).
> > I've tried:
> > - switching the order of the join conditions
> > - using a different hash function setting (jenkins instead of murmur)
> > - using SET set hive.auto.convert.join = true;
>
> are you sure your query convert to mapjoin? if not,try use explicit
> mapjoin hint.
>
>
> > - using SET hive.optimize.skewjoin = true;
> > but nothing helped :(
> > Anything else I can try?
> > Thanks!
>

Re: skew join optimization

Posted by Jov <zh...@gmail.com>.

2011/3/20 Igor Tatarinov <ig...@decide.com>:
> I have the following join that takes 4.5 hours (with 12 nodes) mostly
> because of a single reduce task that gets the bulk of the work:
> SELECT ...
> FROM T
> LEFT OUTER JOIN S
> ON T.timestamp = S.timestamp and T.id = S.id
> This is a 1:0/1 join so the size of the output is exactly the same as the
> size of T (500M records). S is actually very small (5K).
> I've tried:
> - switching the order of the join conditions
> - using a different hash function setting (jenkins instead of murmur)
> - using SET set hive.auto.convert.join = true;

are you sure your query convert to mapjoin? if not,try use explicit
mapjoin hint.


> - using SET hive.optimize.skewjoin = true;
> but nothing helped :(
> Anything else I can try?
> Thanks!