You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Baahu <ba...@gmail.com> on 2013/12/03 10:17:09 UTC

STREAMTABLE And MAPJOIN

Hi,
What is the difference between hints STREAMTABLE ,MAPJOIN .

Thanks,
Baahu

Re: STREAMTABLE And MAPJOIN

Posted by Lefty Leverenz <le...@gmail.com>.
This seems useful, so I added a sentence to the explanation of STREAMTABLE
in the JOINS wikidoc<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins#LanguageManualJoins-Examples>
:


>    -
>
>    In every map/reduce stage of the join, the table to be streamed can be
>    specified via a hint. e.g. in
>    SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON
>    (a.key = b.key1) JOIN c ON (c.key = b.key1)
>
>    all the three tables are joined in a single map/reduce job and the
>    values for a particular value of the key for tables b and c are buffered in
>    the memory in the reducers. Then for each row retrieved from a, the join is
>    computed with the buffered rows. If the STREAMTABLE hint is omitted,
>    Hive streams the rightmost table in the join.
>
>
But I didn't specify inner joins.  Should that be made clear?

Thanks.  -- Lefty


On Tue, Dec 3, 2013 at 1:40 AM, Nitin Pawar <ni...@gmail.com> wrote:

> This is my understanding of both. Wait for the hive guru's to correct me
> if i made any mistake
>
>
> In Hive, when an inner join query happens the table at the last position
> on the right streams its records to the reducers. This is the default
> behavior.
>
> So say, you have a query select blah blah from t1 join t2 join t3 join t4
> on (blah blah)
> all the maps emitting key values on table t1, t2, t3 just send it to
> reducers and are bufferred in memory but for table t4 it streams the
> records to the reducer for better memory management and thats why its
> advised that you have largest table on the right
>
> This default behavior is changed by STREAMTABLE(t1) where you can tell
> which table data you want to be streamed.
>
> On the other hand, mapjoin is a concept where there are no reducers are
> involved. Its a join where the smaller table is buffered into memory of
> each map and then the joins are performed by the maps itself. As the
> smaller table data is available in memory, map jobs are very fast as the
> reduce step is completely removed.
>
>
> On Tue, Dec 3, 2013 at 2:47 PM, Baahu <ba...@gmail.com> wrote:
>
>> Hi,
>> What is the difference between hints STREAMTABLE ,MAPJOIN .
>>
>> Thanks,
>> Baahu
>>
>>
>
>
> --
> Nitin Pawar
>

Re: STREAMTABLE And MAPJOIN

Posted by Nitin Pawar <ni...@gmail.com>.
This is my understanding of both. Wait for the hive guru's to correct me if
i made any mistake


In Hive, when an inner join query happens the table at the last position on
the right streams its records to the reducers. This is the default
behavior.

So say, you have a query select blah blah from t1 join t2 join t3 join t4
on (blah blah)
all the maps emitting key values on table t1, t2, t3 just send it to
reducers and are bufferred in memory but for table t4 it streams the
records to the reducer for better memory management and thats why its
advised that you have largest table on the right

This default behavior is changed by STREAMTABLE(t1) where you can tell
which table data you want to be streamed.

On the other hand, mapjoin is a concept where there are no reducers are
involved. Its a join where the smaller table is buffered into memory of
each map and then the joins are performed by the maps itself. As the
smaller table data is available in memory, map jobs are very fast as the
reduce step is completely removed.


On Tue, Dec 3, 2013 at 2:47 PM, Baahu <ba...@gmail.com> wrote:

> Hi,
> What is the difference between hints STREAMTABLE ,MAPJOIN .
>
> Thanks,
> Baahu
>
>


-- 
Nitin Pawar