You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Jason Michael <jm...@videoegg.com> on 2009/07/14 06:44:31 UTC

Join optimization for star schemas

In our hive instance, we have one large fact-type table that joins to several dimension tables on integer keys.  I know from reading the Language Manual that in ordering joins it is best to join the largest table last in the sequence in order to minimize memory usage.  This won't work in the situation where you want to join the large fact table to more than one dimension.  Something like:

select ... from small_table1 join big_table on ... join small_table2 on ...

I have to imagine this is a pretty common pattern, is there any guidance for doing this sort of star schema join?

Re: Join optimization for star schemas

Posted by Namit Jain <nj...@facebook.com>.

You are doing 2 joins which are not getting merged because the join keys are different.

 1.  Map Join is the best way to get around this, both the dimension tables will be in memory. This should work for you.
 2.  In order to make sure that fact table is the outer most table, break up the queries:

select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

Change it to:

Insert overwrite table tmp1
select * from d1 join f on d1.d1_key = f.d1_key;

select tmp1.d1_value, d2.d2_value, sum(measure) from
d2 join tmp1 on d2.d2_key = tmp1.d2_key group by tmp1.d1_value, d2.d2_value;

-namit

On 7/14/09 8:11 AM, "David Lerman" <dl...@videoegg.com> wrote:

Thanks.  Namit, map-join looks promising.  He, I'm not quite sure I'm
following yet.

In our (simplified) case, we have a fact table with many millions of rows
(call it f), which joins to two dimension tables (call them d1 and d2), each
of which has at most a few thousand rows.

So:
create table f(f_key int, d1_key int, d2_key int, measure int);
create table d1(d1_key int, d1_value string);
create table d2(d2_key int, d2_value string);

The facts are split among the dimensions extremely disproportionately, so
although d2 has 1000 rows, 90% of the facts are linked to the top 10.

The query is:
select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

We're finding that though we have 100 reducers, 98 of them are executing
very quickly while 2 get stuck with the vast majority of the reduce records
and crash with out-of-memory exceptions even though they have multiple gigs
of memory allocated.

This raises two questions:

1.  With this kind of join (small x big x small), what's the best way to
keep the memory usage down?  Map-join definitely looks promising, and He,
I'd like to understand the index push down join approach so we can try it
out as well.

2.  I'm assuming we're getting those 2 reducers stuck with the whole load
because the reduce records get partitioned by the join key and the vast
majority of records use just a few keys.  Is there anything we can do about
this?

Thanks so much!
Dave

On 7/14/09 8:51 AM, "He Yongqiang" <he...@software.ict.ac.cn> wrote:

> mapside join is the most efficient, but as Namit mentioned, it also has some
> limitation.
>
>>> If you dimension tables are small, you can use map-join.
> Can use temporary table for map-join if a dimension table  is not small but
> predicates on that table can dramatically reduce number of rows.
>
> Another optimization you can experiment is so called “index push down join”.
> Since hive does not support index ( and of course no bitmap/bitvector) right
> now, you can mimic the techniques by yourself.  The key is to pre-execute
> several join tables,  and to find some ways first join small tables and then
> join with the fact table.
>
> create table dimension1_fact(d1_key, fact_key, attributes of dimension table
> d1 which are mostly used in queries );
> create table dimension2_fact(d2_key, fact_key, attributes of dimension table
> d2 which are mostly used in queries );
> etc ...
> And you can translate you queries to these manually created tables.
> The query looks like:
> Select .... from  big_table  join (select factkey,... From dimension1_fact
> join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
> big_table.key = join_dimension_table .factkey
> Can not sure this can reduce the execution time, and may increase the
> execution time (you can do some experiments :) ).
>
>
> On 09-7-14 下午1:45, "Namit Jain" <nj...@facebook.com> wrote:
>
>> The large table is only a problem if the number of values for a given key are
>> very large – they are stored in memory.
>>
>> If you dimension tables are small, you can use map-join. That way, no
>> reduction is needed. You need to specify the hint in the select clause
>> and the list of tables which are small.
>>
>>
>> For eg:
>>
>>
>> select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1
>> join
>> big_table on ... join small_table2 on ...
>>
>>
>> Thanks,
>> -namit
>>
>>
>> On 7/13/09 9:44 PM, "Jason Michael" <jm...@videoegg.com> wrote:
>>
>>> In our hive instance, we have one large fact-type table that joins to
>>> several
>>> dimension tables on integer keys.  I know from reading the Language Manual
>>> that in ordering joins it is best to join the largest table last in the
>>> sequence in order to minimize memory usage.  This won’t work in the
>>> situation
>>> where you want to join the large fact table to more than one dimension.
>>> Something like:
>>>
>>> select ... from small_table1 join big_table on ... join small_table2 on ...
>>>
>>> I have to imagine this is a pretty common pattern, is there any guidance for
>>> doing this sort of star schema join?
>>>
>>>
>>

Re: Join optimization for star schemas

Posted by He Yongqiang <he...@software.ict.ac.cn>.

If the dimension table only got a few thousand rows, then map side join will
be the best.

What I was saying is not real index push down join, you can refer to
http://portal.acm.org/citation.cfm?id=564691.564754 for more details about
index push down join.
My previous description of index push down join is not suitable for your
situation:
1) the dimension table only got a few thousand rows.
2) your fact/dimension table is not fat at all.
3) there is no predicate for dimension table

And an example to explain my previous description:
create table f(f_key int, d1_key int, d2_key int, measure int);
create table d1(d1_key int, d1_value string);
create table d2(d2_key int, d2_value string);

create several view table(these tables can be shared among many related
queries):
create table f_d1(f_key int, d1_key int, d1_attr...);
create table f_d2(f_key int, d2_key int, d2_attr...);

First do f_d1 join f_d2, then join with the big fat fact table. (the big
table is at last and only scan one time).
It assumes f_d1 and f_d2 are much smaller than fact table and dimension
table, which does not hold in your situation, they become much bigger :( .

On 09-7-14 下午11:11, "David Lerman" <dl...@videoegg.com> wrote:

> Thanks.  Namit, map-join looks promising.  He, I'm not quite sure I'm
> following yet.
> 
> In our (simplified) case, we have a fact table with many millions of rows
> (call it f), which joins to two dimension tables (call them d1 and d2), each
> of which has at most a few thousand rows.
> 
> So:
> create table f(f_key int, d1_key int, d2_key int, measure int);
> create table d1(d1_key int, d1_value string);
> create table d2(d2_key int, d2_value string);
> 
> The facts are split among the dimensions extremely disproportionately, so
> although d2 has 1000 rows, 90% of the facts are linked to the top 10.
> 
> The query is:
> select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
> f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;
> 
> We're finding that though we have 100 reducers, 98 of them are executing
> very quickly while 2 get stuck with the vast majority of the reduce records
> and crash with out-of-memory exceptions even though they have multiple gigs
> of memory allocated.
> 
> This raises two questions:
> 
> 1.  With this kind of join (small x big x small), what's the best way to
> keep the memory usage down?  Map-join definitely looks promising, and He,
> I'd like to understand the index push down join approach so we can try it
> out as well.
> 
> 2.  I'm assuming we're getting those 2 reducers stuck with the whole load
> because the reduce records get partitioned by the join key and the vast
> majority of records use just a few keys.  Is there anything we can do about
> this?
> 
> Thanks so much!
> Dave
> 
> On 7/14/09 8:51 AM, "He Yongqiang" <he...@software.ict.ac.cn> wrote:
> 
>> mapside join is the most efficient, but as Namit mentioned, it also has some
>> limitation.
>> 
>>>> If you dimension tables are small, you can use map-join.
>> Can use temporary table for map-join if a dimension table  is not small but
>> predicates on that table can dramatically reduce number of rows.
>> 
>> Another optimization you can experiment is so called “index push down join”.
>> Since hive does not support index ( and of course no bitmap/bitvector) right
>> now, you can mimic the techniques by yourself.  The key is to pre-execute
>> several join tables,  and to find some ways first join small tables and then
>> join with the fact table.
>> 
>> create table dimension1_fact(d1_key, fact_key, attributes of dimension table
>> d1 which are mostly used in queries );
>> create table dimension2_fact(d2_key, fact_key, attributes of dimension table
>> d2 which are mostly used in queries );
>> etc ...
>> And you can translate you queries to these manually created tables.
>> The query looks like:
>> Select .... from  big_table  join (select factkey,... From dimension1_fact
>> join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
>> big_table.key = join_dimension_table .factkey
>> Can not sure this can reduce the execution time, and may increase the
>> execution time (you can do some experiments :) ).
>> 
>> 
>> On 09-7-14 下午1:45, "Namit Jain" <nj...@facebook.com> wrote:
>> 
>>> The large table is only a problem if the number of values for a given key
>>> are
>>> very large – they are stored in memory.
>>> 
>>> If you dimension tables are small, you can use map-join. That way, no
>>> reduction is needed. You need to specify the hint in the select clause
>>> and the list of tables which are small.
>>> 
>>> 
>>> For eg:
>>> 
>>> 
>>> select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1
>>> join 
>>> big_table on ... join small_table2 on ...
>>> 
>>> 
>>> Thanks,
>>> -namit
>>> 
>>> 
>>> On 7/13/09 9:44 PM, "Jason Michael" <jm...@videoegg.com> wrote:
>>> 
>>>> In our hive instance, we have one large fact-type table that joins to
>>>> several 
>>>> dimension tables on integer keys.  I know from reading the Language Manual
>>>> that in ordering joins it is best to join the largest table last in the
>>>> sequence in order to minimize memory usage.  This won’t work in the
>>>> situation 
>>>> where you want to join the large fact table to more than one dimension.
>>>> Something like:
>>>> 
>>>> select ... from small_table1 join big_table on ... join small_table2 on ...
>>>> 
>>>> I have to imagine this is a pretty common pattern, is there any guidance
>>>> for
>>>> doing this sort of star schema join?
>>>> 
>>>> 
>>> 
>

Re: Join optimization for star schemas

Posted by David Lerman <dl...@videoegg.com>.

Thanks.  Namit, map-join looks promising.  He, I'm not quite sure I'm
following yet.

In our (simplified) case, we have a fact table with many millions of rows
(call it f), which joins to two dimension tables (call them d1 and d2), each
of which has at most a few thousand rows.

So:
create table f(f_key int, d1_key int, d2_key int, measure int);
create table d1(d1_key int, d1_value string);
create table d2(d2_key int, d2_value string);

The facts are split among the dimensions extremely disproportionately, so
although d2 has 1000 rows, 90% of the facts are linked to the top 10.

The query is:
select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

We're finding that though we have 100 reducers, 98 of them are executing
very quickly while 2 get stuck with the vast majority of the reduce records
and crash with out-of-memory exceptions even though they have multiple gigs
of memory allocated.

This raises two questions:

1.  With this kind of join (small x big x small), what's the best way to
keep the memory usage down?  Map-join definitely looks promising, and He,
I'd like to understand the index push down join approach so we can try it
out as well.

2.  I'm assuming we're getting those 2 reducers stuck with the whole load
because the reduce records get partitioned by the join key and the vast
majority of records use just a few keys.  Is there anything we can do about
this?

Thanks so much!
Dave

On 7/14/09 8:51 AM, "He Yongqiang" <he...@software.ict.ac.cn> wrote:

> mapside join is the most efficient, but as Namit mentioned, it also has some
> limitation.
> 
>>> If you dimension tables are small, you can use map-join.
> Can use temporary table for map-join if a dimension table  is not small but
> predicates on that table can dramatically reduce number of rows.
> 
> Another optimization you can experiment is so called “index push down join”.
> Since hive does not support index ( and of course no bitmap/bitvector) right
> now, you can mimic the techniques by yourself.  The key is to pre-execute
> several join tables,  and to find some ways first join small tables and then
> join with the fact table.
> 
> create table dimension1_fact(d1_key, fact_key, attributes of dimension table
> d1 which are mostly used in queries );
> create table dimension2_fact(d2_key, fact_key, attributes of dimension table
> d2 which are mostly used in queries );
> etc ...
> And you can translate you queries to these manually created tables.
> The query looks like:
> Select .... from  big_table  join (select factkey,... From dimension1_fact
> join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
> big_table.key = join_dimension_table .factkey
> Can not sure this can reduce the execution time, and may increase the
> execution time (you can do some experiments :) ).
> 
> 
> On 09-7-14 下午1:45, "Namit Jain" <nj...@facebook.com> wrote:
> 
>> The large table is only a problem if the number of values for a given key are
>> very large – they are stored in memory.
>> 
>> If you dimension tables are small, you can use map-join. That way, no
>> reduction is needed. You need to specify the hint in the select clause
>> and the list of tables which are small.
>> 
>> 
>> For eg:
>> 
>> 
>> select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1
>> join 
>> big_table on ... join small_table2 on ...
>> 
>> 
>> Thanks,
>> -namit
>> 
>> 
>> On 7/13/09 9:44 PM, "Jason Michael" <jm...@videoegg.com> wrote:
>> 
>>> In our hive instance, we have one large fact-type table that joins to
>>> several 
>>> dimension tables on integer keys.  I know from reading the Language Manual
>>> that in ordering joins it is best to join the largest table last in the
>>> sequence in order to minimize memory usage.  This won’t work in the
>>> situation 
>>> where you want to join the large fact table to more than one dimension.
>>> Something like:
>>> 
>>> select ... from small_table1 join big_table on ... join small_table2 on ...
>>> 
>>> I have to imagine this is a pretty common pattern, is there any guidance for
>>> doing this sort of star schema join?
>>> 
>>> 
>>

Re: Join optimization for star schemas

Posted by He Yongqiang <he...@software.ict.ac.cn>.

mapside join is the most efficient, but as Namit mentioned, it also has some
limitation.

>>If you dimension tables are small, you can use map-join.
Can use temporary table for map-join if a dimension table  is not small but
predicates on that table can dramatically reduce number of rows.

Another optimization you can experiment is so called “index push down join”.
Since hive does not support index ( and of course no bitmap/bitvector) right
now, you can mimic the techniques by yourself.  The key is to pre-execute
several join tables,  and to find some ways first join small tables and then
join with the fact table.

create table dimension1_fact(d1_key, fact_key, attributes of dimension table
d1 which are mostly used in queries );
create table dimension2_fact(d2_key, fact_key, attributes of dimension table
d2 which are mostly used in queries );
etc ...
And you can translate you queries to these manually created tables.
The query looks like:
Select .... from  big_table  join (select factkey,... From dimension1_fact
join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
big_table.key = join_dimension_table .factkey
Can not sure this can reduce the execution time, and may increase the
execution time (you can do some experiments :) ).

On 09-7-14 下午1:45, "Namit Jain" <nj...@facebook.com> wrote:

> The large table is only a problem if the number of values for a given key are
> very large – they are stored in memory.
> 
> If you dimension tables are small, you can use map-join. That way, no
> reduction is needed. You need to specify the hint in the select clause
> and the list of tables which are small.
> 
> 
> For eg:
> 
> 
> select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1 join
> big_table on ... join small_table2 on ...
> 
> 
> Thanks,
> -namit
> 
> 
> On 7/13/09 9:44 PM, "Jason Michael" <jm...@videoegg.com> wrote:
> 
>> In our hive instance, we have one large fact-type table that joins to several
>> dimension tables on integer keys.  I know from reading the Language Manual
>> that in ordering joins it is best to join the largest table last in the
>> sequence in order to minimize memory usage.  This won’t work in the situation
>> where you want to join the large fact table to more than one dimension.
>> Something like:
>> 
>> select ... from small_table1 join big_table on ... join small_table2 on ...
>> 
>> I have to imagine this is a pretty common pattern, is there any guidance for
>> doing this sort of star schema join?
>> 
>> 
>

Re: Join optimization for star schemas

Posted by Namit Jain <nj...@facebook.com>.

The large table is only a problem if the number of values for a given key are very large - they are stored in memory.

If you dimension tables are small, you can use map-join. That way, no reduction is needed. You need to specify the hint in the select clause
and the list of tables which are small.


For eg:


select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1 join big_table on ... join small_table2 on ...


Thanks,
-namit


On 7/13/09 9:44 PM, "Jason Michael" <jm...@videoegg.com> wrote:

In our hive instance, we have one large fact-type table that joins to several dimension tables on integer keys.  I know from reading the Language Manual that in ordering joins it is best to join the largest table last in the sequence in order to minimize memory usage.  This won't work in the situation where you want to join the large fact table to more than one dimension.  Something like:

select ... from small_table1 join big_table on ... join small_table2 on ...

I have to imagine this is a pretty common pattern, is there any guidance for doing this sort of star schema join?