You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Navis류승우 <na...@nexr.com> on 2014/08/04 04:20:52 UTC

Re: Why does SMB join generate hash table locally, even if input tables are large?

I don't think hash table generation is needed for SMB joins. Could you
check the result of explain extended?

Thanks,
Navis


2014-07-31 4:08 GMT+09:00 Pala M Muthaia <mc...@rocketfuelinc.com>:

> +hive-users
>
>
> On Tue, Jul 29, 2014 at 1:56 PM, Pala M Muthaia <
> mchettiar@rocketfuelinc.com
> > wrote:
>
> > Hi,
> >
> > I am testing SMB join for 2 large tables. The tables are bucketed and
> > sorted on the join column. I notice that even though the table is large,
> > Hive attempts to generate hash table for the 'small' table locally,
> >  similar to map join. Since the table is large in my case, the client
> runs
> > out of memory and the query fails.
> >
> > I am using Hive 0.12 with the following settings:
> >
> > set hive.optimize.bucketmapjoin=true;
> > set hive.optimize.bucketmapjoin.sortedmerge=true;
> > set hive.input.format =
> > org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> >
> > My test query does a simple join and a select, no subqueries/nested
> > queries etc.
> >
> > I understand why a (bucket) map join requires hash table generation, but
> > why is that included for an SMB join? Shouldn't a SMB join just spin up
> one
> > mapper for each bucket and perform a sort merge join directly on the
> mapper?
> >
> >
> > Thanks,
> > pala
> >
> >
> >
> >
>

Re: Why does SMB join generate hash table locally, even if input tables are large?

Posted by Pala M Muthaia <mc...@rocketfuelinc.com>.

If anybody is interested:

 To enable SMB join, in addition to the config values listed above, i had
to set the following as well:

set hive.auto.convert.sortmerge.join = true;

By default, the value was false. After the above, i saw a map only job as
expected.


Thanks.


On Mon, Aug 4, 2014 at 6:10 PM, Pala M Muthaia <mc...@rocketfuelinc.com>
wrote:

> Thanks for the response Navis.
>
> I tried the repro again from the beginning, and it doesn't result in hash
> table generation. I may have had some setting that enforced map join. The
> plan generated shows a conditional stage pointing to a simple map and
> reduce stage.
>
> At runtime, however, the query results in a MR job with a reduce stage
> that performs the join.
>
> Shouldn't SMB join result in a map only job for a table bucketed and
> sorted on join column? Is there size restriction on SMB join (i.e. SMB join
> kicks in only if bucket sizes are below some limit?)
>
> Thanks.
>
>
>
> On Sun, Aug 3, 2014 at 7:20 PM, Navis류승우 <na...@nexr.com> wrote:
>
>> I don't think hash table generation is needed for SMB joins. Could you
>> check the result of explain extended?
>>
>> Thanks,
>> Navis
>>
>>
>> 2014-07-31 4:08 GMT+09:00 Pala M Muthaia <mc...@rocketfuelinc.com>:
>>
>> > +hive-users
>> >
>> >
>> > On Tue, Jul 29, 2014 at 1:56 PM, Pala M Muthaia <
>> > mchettiar@rocketfuelinc.com
>> > > wrote:
>> >
>> > > Hi,
>> > >
>> > > I am testing SMB join for 2 large tables. The tables are bucketed and
>> > > sorted on the join column. I notice that even though the table is
>> large,
>> > > Hive attempts to generate hash table for the 'small' table locally,
>> > >  similar to map join. Since the table is large in my case, the client
>> > runs
>> > > out of memory and the query fails.
>> > >
>> > > I am using Hive 0.12 with the following settings:
>> > >
>> > > set hive.optimize.bucketmapjoin=true;
>> > > set hive.optimize.bucketmapjoin.sortedmerge=true;
>> > > set hive.input.format =
>> > > org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>> > >
>> > > My test query does a simple join and a select, no subqueries/nested
>> > > queries etc.
>> > >
>> > > I understand why a (bucket) map join requires hash table generation,
>> but
>> > > why is that included for an SMB join? Shouldn't a SMB join just spin
>> up
>> > one
>> > > mapper for each bucket and perform a sort merge join directly on the
>> > mapper?
>> > >
>> > >
>> > > Thanks,
>> > > pala
>> > >
>> > >
>> > >
>> > >
>> >
>>
>
>

Re: Why does SMB join generate hash table locally, even if input tables are large?

Posted by Pala M Muthaia <mc...@rocketfuelinc.com>.

Thanks for the response Navis.

I tried the repro again from the beginning, and it doesn't result in hash
table generation. I may have had some setting that enforced map join. The
plan generated shows a conditional stage pointing to a simple map and
reduce stage.

At runtime, however, the query results in a MR job with a reduce stage that
performs the join.

Shouldn't SMB join result in a map only job for a table bucketed and sorted
on join column? Is there size restriction on SMB join (i.e. SMB join kicks
in only if bucket sizes are below some limit?)

Thanks.

On Sun, Aug 3, 2014 at 7:20 PM, Navis류승우 <na...@nexr.com> wrote:

> I don't think hash table generation is needed for SMB joins. Could you
> check the result of explain extended?
>
> Thanks,
> Navis
>
>
> 2014-07-31 4:08 GMT+09:00 Pala M Muthaia <mc...@rocketfuelinc.com>:
>
> > +hive-users
> >
> >
> > On Tue, Jul 29, 2014 at 1:56 PM, Pala M Muthaia <
> > mchettiar@rocketfuelinc.com
> > > wrote:
> >
> > > Hi,
> > >
> > > I am testing SMB join for 2 large tables. The tables are bucketed and
> > > sorted on the join column. I notice that even though the table is
> large,
> > > Hive attempts to generate hash table for the 'small' table locally,
> > >  similar to map join. Since the table is large in my case, the client
> > runs
> > > out of memory and the query fails.
> > >
> > > I am using Hive 0.12 with the following settings:
> > >
> > > set hive.optimize.bucketmapjoin=true;
> > > set hive.optimize.bucketmapjoin.sortedmerge=true;
> > > set hive.input.format =
> > > org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> > >
> > > My test query does a simple join and a select, no subqueries/nested
> > > queries etc.
> > >
> > > I understand why a (bucket) map join requires hash table generation,
> but
> > > why is that included for an SMB join? Shouldn't a SMB join just spin up
> > one
> > > mapper for each bucket and perform a sort merge join directly on the
> > mapper?
> > >
> > >
> > > Thanks,
> > > pala
> > >
> > >
> > >
> > >
> >
>