You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Avrilia Floratou <av...@gmail.com> on 2013/12/11 02:05:38 UTC

Question on correlation optimizer

Hi,

I'm running TPCH query 21 on Hive. 0.12 and have enabled
hive.optimize.correlation.
I could see the effect of the correlation optimizer on query 17 but when
running query 21 I don't actually see the optimizer being used. I used the
publicly available tpc-h queries for hive and merged all the intermediate
subqueries into one for Q21. In this query there is a correlation between
multiple subqueries since they all get lineitem as input. But what I
observe from the query plan and the execution of the query is that the
subqueries are executed one by one and their results are materialized
before the joins among them are executed. Is there any other parameter that
I need to set to make this work?

Thanks,
Avrilia

Re: Question on correlation optimizer

Posted by Avrilia Floratou <av...@gmail.com>.
Hi Yin,

Thank you very much for the detailed response.

Avrilia


On Wed, Dec 11, 2013 at 6:34 AM, Yin Huai <hu...@gmail.com> wrote:

> Hi Avrilia,
>
> I am attaching the operator tree in the reduce side.
> [image: Inline image 1]
>
> When printing out the plan, MuxOperator was called twice from its both
> parents. So, the plan from the MuxOperator was printed out twice. In the
> reduce phase, those two JoinOperators appearing immediately after the
> MuxOperator are the same JoinOperator. Another JoinOperator showing the
> following info is the one joining lineitem and part.
> Join Operator
>             condition map:
>                  Inner Join 0 to 1
>             condition expressions:
>               0
>               1 {VALUE._col1} {VALUE._col4} {VALUE._col5}
>             handleSkewJoin: false
>             outputColumnNames: _col12, _col15, _col16
>             Select Operator
>               expressions:
>                     expr: _col15
>                     type: double
>                     expr: _col12
>                     type: int
>                     expr: _col16
>                     type: double
>               outputColumnNames: _col0, _col1, _col2
>
> I created https://issues.apache.org/jira/browse/HIVE-6007 to track the
> cleanup of the EXPLAIN output. I will find time to work on it.
>
> Thanks,
>
> Yin
>
>
> On Tue, Dec 10, 2013 at 11:19 PM, Avrilia Floratou <
> avrilia.floratou@gmail.com> wrote:
>
>> Hi Yin,
>>
>> Thanks for the detailed explanation. I have one more question for the
>> correlation optimizer. When I ran explain in query 17 I get the plan for
>> stage 1 where the bulk of the time goes. I can understand what is happening
>> in the map phase but the reduce phase confuses me when the optimizer kicks
>> in, mainly because I see many join operators and I can't find any join
>> between the part and lineitem table.
>>
>> I see that in the reduce phase there is a global aggregation on
>> avg(l_quantity) and then I can see 3 join operators. Why are there 3 joins?
>> Also what exactly are the demux and mux operators? Are they related to the
>> correlation optimizer or not?
>>
>> The explain output is attached and the query is the following:
>>
>> select
>>   sum(l_extendedprice) / 7.0 as avg_yearly
>> from
>>   (select l_quantity, l_extendedprice, t_avg_quantity from
>>    (select
>>        l_partkey as t_partkey, 0.2 * avg(l_quantity) as t_avg_quantity
>>      from lineitem
>>      group by l_partkey
>>      ) t join
>>      (select
>>         l_quantity, l_partkey, l_extendedprice
>>       from
>>         part p join lineitem l
>>         on
>>           p.p_partkey = l.l_partkey
>>           and p.p_brand = 'Brand#23'
>>           and p.p_container = 'MED BOX'
>>       ) l1 on l1.l_partkey = t.t_partkey
>>    ) a
>> where l_quantity < t_avg_quantity;
>>
>>
>> Thanks,
>> Avrilia
>>
>>
>>
>> On Tue, Dec 10, 2013 at 7:38 PM, Yin Huai <hu...@gmail.com> wrote:
>>
>>> Hi Avrilia,
>>>
>>> It is caused by distinct aggregations in TPC-H Q21. Because Hive adds
>>> those distinct columns in the key columns of ReduceSinkOperators and
>>> correlation optimizer only check exact same key columns right now, this
>>> query will not be optimized. The jira of this issue is
>>> https://issues.apache.org/jira/browse/HIVE-4751. If you remove distinct
>>> from those aggregation functions, you will see the optimized plan. Also,
>>> another kind of cases that the correlation optimizer does not optimize
>>> right now is that a table is used in multiple MR jobs but rows in this
>>> table are shuffled in different ways.
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>>
>>> On Tue, Dec 10, 2013 at 8:05 PM, Avrilia Floratou <
>>> avrilia.floratou@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm running TPCH query 21 on Hive. 0.12 and have enabled hive.optimize.correlation.
>>>> I could see the effect of the correlation optimizer on query 17 but when
>>>> running query 21 I don't actually see the optimizer being used. I used the
>>>> publicly available tpc-h queries for hive and merged all the intermediate
>>>> subqueries into one for Q21. In this query there is a correlation
>>>> between multiple subqueries since they all get lineitem as input. But what
>>>> I observe from the query plan and the execution of the query is that the
>>>> subqueries are executed one by one and their results are materialized
>>>> before the joins among them are executed. Is there any other parameter that
>>>> I need to set to make this work?
>>>>
>>>> Thanks,
>>>> Avrilia
>>>>
>>>
>>>
>>
>

Re: Question on correlation optimizer

Posted by Yin Huai <hu...@gmail.com>.
Hi Avrilia,

I am attaching the operator tree in the reduce side.
[image: Inline image 1]

When printing out the plan, MuxOperator was called twice from its both
parents. So, the plan from the MuxOperator was printed out twice. In the
reduce phase, those two JoinOperators appearing immediately after the
MuxOperator are the same JoinOperator. Another JoinOperator showing the
following info is the one joining lineitem and part.
Join Operator
            condition map:
                 Inner Join 0 to 1
            condition expressions:
              0
              1 {VALUE._col1} {VALUE._col4} {VALUE._col5}
            handleSkewJoin: false
            outputColumnNames: _col12, _col15, _col16
            Select Operator
              expressions:
                    expr: _col15
                    type: double
                    expr: _col12
                    type: int
                    expr: _col16
                    type: double
              outputColumnNames: _col0, _col1, _col2

I created https://issues.apache.org/jira/browse/HIVE-6007 to track the
cleanup of the EXPLAIN output. I will find time to work on it.

Thanks,

Yin


On Tue, Dec 10, 2013 at 11:19 PM, Avrilia Floratou <
avrilia.floratou@gmail.com> wrote:

> Hi Yin,
>
> Thanks for the detailed explanation. I have one more question for the
> correlation optimizer. When I ran explain in query 17 I get the plan for
> stage 1 where the bulk of the time goes. I can understand what is happening
> in the map phase but the reduce phase confuses me when the optimizer kicks
> in, mainly because I see many join operators and I can't find any join
> between the part and lineitem table.
>
> I see that in the reduce phase there is a global aggregation on
> avg(l_quantity) and then I can see 3 join operators. Why are there 3 joins?
> Also what exactly are the demux and mux operators? Are they related to the
> correlation optimizer or not?
>
> The explain output is attached and the query is the following:
>
> select
>   sum(l_extendedprice) / 7.0 as avg_yearly
> from
>   (select l_quantity, l_extendedprice, t_avg_quantity from
>    (select
>        l_partkey as t_partkey, 0.2 * avg(l_quantity) as t_avg_quantity
>      from lineitem
>      group by l_partkey
>      ) t join
>      (select
>         l_quantity, l_partkey, l_extendedprice
>       from
>         part p join lineitem l
>         on
>           p.p_partkey = l.l_partkey
>           and p.p_brand = 'Brand#23'
>           and p.p_container = 'MED BOX'
>       ) l1 on l1.l_partkey = t.t_partkey
>    ) a
> where l_quantity < t_avg_quantity;
>
>
> Thanks,
> Avrilia
>
>
>
> On Tue, Dec 10, 2013 at 7:38 PM, Yin Huai <hu...@gmail.com> wrote:
>
>> Hi Avrilia,
>>
>> It is caused by distinct aggregations in TPC-H Q21. Because Hive adds
>> those distinct columns in the key columns of ReduceSinkOperators and
>> correlation optimizer only check exact same key columns right now, this
>> query will not be optimized. The jira of this issue is
>> https://issues.apache.org/jira/browse/HIVE-4751. If you remove distinct
>> from those aggregation functions, you will see the optimized plan. Also,
>> another kind of cases that the correlation optimizer does not optimize
>> right now is that a table is used in multiple MR jobs but rows in this
>> table are shuffled in different ways.
>>
>> Thanks,
>>
>> Yin
>>
>>
>> On Tue, Dec 10, 2013 at 8:05 PM, Avrilia Floratou <
>> avrilia.floratou@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm running TPCH query 21 on Hive. 0.12 and have enabled hive.optimize.correlation.
>>> I could see the effect of the correlation optimizer on query 17 but when
>>> running query 21 I don't actually see the optimizer being used. I used the
>>> publicly available tpc-h queries for hive and merged all the intermediate
>>> subqueries into one for Q21. In this query there is a correlation
>>> between multiple subqueries since they all get lineitem as input. But what
>>> I observe from the query plan and the execution of the query is that the
>>> subqueries are executed one by one and their results are materialized
>>> before the joins among them are executed. Is there any other parameter that
>>> I need to set to make this work?
>>>
>>> Thanks,
>>> Avrilia
>>>
>>
>>
>

Re: Question on correlation optimizer

Posted by Avrilia Floratou <av...@gmail.com>.
Hi Yin,

Thanks for the detailed explanation. I have one more question for the
correlation optimizer. When I ran explain in query 17 I get the plan for
stage 1 where the bulk of the time goes. I can understand what is happening
in the map phase but the reduce phase confuses me when the optimizer kicks
in, mainly because I see many join operators and I can't find any join
between the part and lineitem table.

I see that in the reduce phase there is a global aggregation on
avg(l_quantity) and then I can see 3 join operators. Why are there 3 joins?
Also what exactly are the demux and mux operators? Are they related to the
correlation optimizer or not?

The explain output is attached and the query is the following:

select
  sum(l_extendedprice) / 7.0 as avg_yearly
from
  (select l_quantity, l_extendedprice, t_avg_quantity from
   (select
       l_partkey as t_partkey, 0.2 * avg(l_quantity) as t_avg_quantity
     from lineitem
     group by l_partkey
     ) t join
     (select
        l_quantity, l_partkey, l_extendedprice
      from
        part p join lineitem l
        on
          p.p_partkey = l.l_partkey
          and p.p_brand = 'Brand#23'
          and p.p_container = 'MED BOX'
      ) l1 on l1.l_partkey = t.t_partkey
   ) a
where l_quantity < t_avg_quantity;


Thanks,
Avrilia



On Tue, Dec 10, 2013 at 7:38 PM, Yin Huai <hu...@gmail.com> wrote:

> Hi Avrilia,
>
> It is caused by distinct aggregations in TPC-H Q21. Because Hive adds
> those distinct columns in the key columns of ReduceSinkOperators and
> correlation optimizer only check exact same key columns right now, this
> query will not be optimized. The jira of this issue is
> https://issues.apache.org/jira/browse/HIVE-4751. If you remove distinct
> from those aggregation functions, you will see the optimized plan. Also,
> another kind of cases that the correlation optimizer does not optimize
> right now is that a table is used in multiple MR jobs but rows in this
> table are shuffled in different ways.
>
> Thanks,
>
> Yin
>
>
> On Tue, Dec 10, 2013 at 8:05 PM, Avrilia Floratou <
> avrilia.floratou@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running TPCH query 21 on Hive. 0.12 and have enabled hive.optimize.correlation.
>> I could see the effect of the correlation optimizer on query 17 but when
>> running query 21 I don't actually see the optimizer being used. I used the
>> publicly available tpc-h queries for hive and merged all the intermediate
>> subqueries into one for Q21. In this query there is a correlation
>> between multiple subqueries since they all get lineitem as input. But what
>> I observe from the query plan and the execution of the query is that the
>> subqueries are executed one by one and their results are materialized
>> before the joins among them are executed. Is there any other parameter that
>> I need to set to make this work?
>>
>> Thanks,
>> Avrilia
>>
>
>

Re: Question on correlation optimizer

Posted by Yin Huai <hu...@gmail.com>.
Hi Avrilia,

It is caused by distinct aggregations in TPC-H Q21. Because Hive adds those
distinct columns in the key columns of ReduceSinkOperators and correlation
optimizer only check exact same key columns right now, this query will not
be optimized. The jira of this issue is
https://issues.apache.org/jira/browse/HIVE-4751. If you remove distinct
from those aggregation functions, you will see the optimized plan. Also,
another kind of cases that the correlation optimizer does not optimize
right now is that a table is used in multiple MR jobs but rows in this
table are shuffled in different ways.

Thanks,

Yin


On Tue, Dec 10, 2013 at 8:05 PM, Avrilia Floratou <
avrilia.floratou@gmail.com> wrote:

> Hi,
>
> I'm running TPCH query 21 on Hive. 0.12 and have enabled hive.optimize.correlation.
> I could see the effect of the correlation optimizer on query 17 but when
> running query 21 I don't actually see the optimizer being used. I used the
> publicly available tpc-h queries for hive and merged all the intermediate
> subqueries into one for Q21. In this query there is a correlation between
> multiple subqueries since they all get lineitem as input. But what I
> observe from the query plan and the execution of the query is that the
> subqueries are executed one by one and their results are materialized
> before the joins among them are executed. Is there any other parameter that
> I need to set to make this work?
>
> Thanks,
> Avrilia
>