You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Navis Ryu <na...@nexr.com> on 2013/12/11 03:13:00 UTC

Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------

Review request for hive.


Bugs: HIVE-5945
    https://issues.apache.org/jira/browse/HIVE-5945


Repository: hive-git


Description
-------

Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 197a20f 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b 
  ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
  ql/src/test/results/clientpositive/auto_join25.q.out 7427239 

Diff: https://reviews.apache.org/r/16172/diff/


Testing
-------


Thanks,

Navis Ryu


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Navis Ryu <na...@nexr.com>.

> On Dec. 18, 2013, 2:02 p.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java, line 427
> > <https://reviews.apache.org/r/16172/diff/2/?file=399281#file399281line427>
> >
> >     Seems it is not an error? If so, let's not put it in the ErrorMsg.

done.


> On Dec. 18, 2013, 2:02 p.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java, line 262
> > <https://reviews.apache.org/r/16172/diff/2/?file=399284#file399284line262>
> >
> >     Is this one necessary?

changed to debug message


- Navis


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30612
-----------------------------------------------------------


On Dec. 18, 2013, 5:04 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
> 
> (Updated Dec. 18, 2013, 5:04 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-5945
>     https://issues.apache.org/jira/browse/HIVE-5945
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Here is an example
> {code}
> select
>    i_item_id,
>    s_state,
>    avg(ss_quantity) agg1,
>    avg(ss_list_price) agg2,
>    avg(ss_coupon_amt) agg3,
>    avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>    cd_gender = 'F' and
>    cd_marital_status = 'U' and
>    cd_education_status = 'Primary' and
>    d_year = 2002 and
>    s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>    i_item_id,
>    s_state
> order by
>    i_item_id,
>    s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
> 
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java 45acc2b 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 9afc80b 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b 
>   ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
>   ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
>   ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
>   ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 
> 
> Diff: https://reviews.apache.org/r/16172/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Yin Huai <hu...@cse.ohio-state.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30612
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java
<https://reviews.apache.org/r/16172/#comment58622>

    Seems it is not an error? If so, let's not put it in the ErrorMsg.



ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java
<https://reviews.apache.org/r/16172/#comment58623>

    Is this one necessary?


- Yin Huai


On Dec. 18, 2013, 5:04 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
> 
> (Updated Dec. 18, 2013, 5:04 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-5945
>     https://issues.apache.org/jira/browse/HIVE-5945
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Here is an example
> {code}
> select
>    i_item_id,
>    s_state,
>    avg(ss_quantity) agg1,
>    avg(ss_list_price) agg2,
>    avg(ss_coupon_amt) agg3,
>    avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>    cd_gender = 'F' and
>    cd_marital_status = 'U' and
>    cd_education_status = 'Primary' and
>    d_year = 2002 and
>    s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>    i_item_id,
>    s_state
> order by
>    i_item_id,
>    s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
> 
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java 45acc2b 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 9afc80b 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b 
>   ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
>   ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
>   ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
>   ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 
> 
> Diff: https://reviews.apache.org/r/16172/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Yin Huai <hu...@cse.ohio-state.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30977
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java
<https://reviews.apache.org/r/16172/#comment59277>

    For the example in the description of HIVE-5945, the same alias "$INTNAME" can actually refer to different intermediate tables. So, at here, we will not update the correct size for an alias "$INTNAME".


- Yin Huai


On Dec. 30, 2013, 2:20 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
> 
> (Updated Dec. 30, 2013, 2:20 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-5945
>     https://issues.apache.org/jira/browse/HIVE-5945
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Here is an example
> {code}
> select
>    i_item_id,
>    s_state,
>    avg(ss_quantity) agg1,
>    avg(ss_list_price) agg2,
>    avg(ss_coupon_amt) agg3,
>    avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>    cd_gender = 'F' and
>    cd_marital_status = 'U' and
>    cd_education_status = 'Primary' and
>    d_year = 2002 and
>    s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>    i_item_id,
>    s_state
> order by
>    i_item_id,
>    s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
> 
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java daf4e4a 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 37ed275 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366 
>   ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
>   ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
>   ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
>   ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 
> 
> Diff: https://reviews.apache.org/r/16172/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------

(Updated Jan. 13, 2014, 8:46 a.m.)


Review request for hive.


Changes
-------

Fixed test fails


Bugs: HIVE-5945
    https://issues.apache.org/jira/browse/HIVE-5945


Repository: hive-git


Description
-------

Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java fccea89 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java efa9768 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366 
  ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
  ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
  ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
  ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 

Diff: https://reviews.apache.org/r/16172/diff/


Testing
-------


Thanks,

Navis Ryu


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------

(Updated Jan. 13, 2014, 2:33 a.m.)


Review request for hive.


Changes
-------

Addressed comments in JIRA


Bugs: HIVE-5945
    https://issues.apache.org/jira/browse/HIVE-5945


Repository: hive-git


Description
-------

Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java fccea89 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java efa9768 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366 
  ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
  ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
  ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
  ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 

Diff: https://reviews.apache.org/r/16172/diff/


Testing
-------


Thanks,

Navis Ryu


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------

(Updated Jan. 2, 2014, 2:32 a.m.)


Review request for hive.


Changes
-------

Make not to share size of intermediate tables among ConditionalTasks.


Bugs: HIVE-5945
    https://issues.apache.org/jira/browse/HIVE-5945


Repository: hive-git


Description
-------

Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java e7aa2c9 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 37ed275 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366 
  ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
  ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
  ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
  ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 

Diff: https://reviews.apache.org/r/16172/diff/


Testing
-------


Thanks,

Navis Ryu


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------

(Updated Dec. 30, 2013, 2:20 a.m.)


Review request for hive.


Changes
-------

Added log & test case


Bugs: HIVE-5945
    https://issues.apache.org/jira/browse/HIVE-5945


Repository: hive-git


Description
-------

Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java daf4e4a 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 37ed275 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366 
  ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
  ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
  ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
  ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 

Diff: https://reviews.apache.org/r/16172/diff/


Testing
-------


Thanks,

Navis Ryu


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------

(Updated Dec. 27, 2013, 3:13 a.m.)


Review request for hive.


Bugs: HIVE-5945
    https://issues.apache.org/jira/browse/HIVE-5945


Repository: hive-git


Description
-------

Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java daf4e4a 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 37ed275 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366 
  ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
  ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
  ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
  ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 

Diff: https://reviews.apache.org/r/16172/diff/


Testing
-------


Thanks,

Navis Ryu


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------

(Updated Dec. 18, 2013, 5:04 a.m.)


Review request for hive.


Changes
-------

Added log messages for exception in resolver


Bugs: HIVE-5945
    https://issues.apache.org/jira/browse/HIVE-5945


Repository: hive-git


Description
-------

Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java 45acc2b 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 9afc80b 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b 
  ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
  ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
  ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739 
  ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e 

Diff: https://reviews.apache.org/r/16172/diff/


Testing
-------


Thanks,

Navis Ryu


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Navis Ryu <na...@nexr.com>.

> On Dec. 18, 2013, 1:47 a.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java, line 242
> > <https://reviews.apache.org/r/16172/diff/1/?file=396419#file396419line242>
> >
> >     aliasToKnownSize can also contain tables which will not be used in the next job. For example, we have a query like SELECT ... FROM a JOIN b ON (a.key1=b.key1) JOIN c ON (a.key2=b.key). Let's also assume that "a" is the big table. We can first use a Map only job to do a JOIN b. Then, we should evaluate the size of table c and the result of a JOIN b. But, at here, aliasToKnownSize also has the size of table a which will be counted in sumOfOthers.

No. it's not. Below is the log messages.

[ConditionalResolverCommonJoin/resolveMapJoinTask] aliasToKnownSize : {b=11624, c=11624, a=11624}
[ConditionalResolverCommonJoin/resolveMapJoinTask] aliases   : [b, a]

[ConditionalResolverCommonJoin/resolveMapJoinTask] aliasToKnownSize : {b=11624, c=11624, a=11624, $INTNAME=167608}
[ConditionalResolverCommonJoin/resolveMapJoinTask] aliases   : [c, $INTNAME]


> On Dec. 18, 2013, 1:47 a.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java, line 467
> > <https://reviews.apache.org/r/16172/diff/1/?file=396418#file396418line467>
> >
> >     A question which is not very related to this issue. Have we documented that we prefer the right most alias as the big table? I also see we have such assumption in JoinOperator.

Preferring the right most alias is introduced in this patch first (it was decided by iteration order of aliasToWork), changing result of auto_join25.q. (This part of change is not related to this very issue but I thought it's too confusing to understand)


> On Dec. 18, 2013, 1:47 a.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java, line 255
> > <https://reviews.apache.org/r/16172/diff/1/?file=396419#file396419line255>
> >
> >     Let's change it to log the exception instead of printing the stack trace.

ok.


- Navis


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30594
-----------------------------------------------------------


On Dec. 11, 2013, 2:12 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2013, 2:12 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-5945
>     https://issues.apache.org/jira/browse/HIVE-5945
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Here is an example
> {code}
> select
>    i_item_id,
>    s_state,
>    avg(ss_quantity) agg1,
>    avg(ss_list_price) agg2,
>    avg(ss_coupon_amt) agg3,
>    avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>    cd_gender = 'F' and
>    cd_marital_status = 'U' and
>    cd_education_status = 'Primary' and
>    d_year = 2002 and
>    s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>    i_item_id,
>    s_state
> order by
>    i_item_id,
>    s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
> 
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 197a20f 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b 
>   ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
>   ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
> 
> Diff: https://reviews.apache.org/r/16172/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>


Re: Review Request 16172: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

Posted by Yin Huai <hu...@cse.ohio-state.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30594
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
<https://reviews.apache.org/r/16172/#comment58604>

    A question which is not very related to this issue. Have we documented that we prefer the right most alias as the big table? I also see we have such assumption in JoinOperator.



ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java
<https://reviews.apache.org/r/16172/#comment58601>

    aliasToKnownSize can also contain tables which will not be used in the next job. For example, we have a query like SELECT ... FROM a JOIN b ON (a.key1=b.key1) JOIN c ON (a.key2=b.key). Let's also assume that "a" is the big table. We can first use a Map only job to do a JOIN b. Then, we should evaluate the size of table c and the result of a JOIN b. But, at here, aliasToKnownSize also has the size of table a which will be counted in sumOfOthers.



ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java
<https://reviews.apache.org/r/16172/#comment58602>

    Let's change it to log the exception instead of printing the stack trace.


- Yin Huai


On Dec. 11, 2013, 2:12 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2013, 2:12 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-5945
>     https://issues.apache.org/jira/browse/HIVE-5945
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Here is an example
> {code}
> select
>    i_item_id,
>    s_state,
>    avg(ss_quantity) agg1,
>    avg(ss_list_price) agg2,
>    avg(ss_coupon_amt) agg3,
>    avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>    cd_gender = 'F' and
>    cd_marital_status = 'U' and
>    cd_education_status = 'Primary' and
>    d_year = 2002 and
>    s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>    i_item_id,
>    s_state
> order by
>    i_item_id,
>    s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
> 
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.  
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 197a20f 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b 
>   ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9 
>   ql/src/test/results/clientpositive/auto_join25.q.out 7427239 
> 
> Diff: https://reviews.apache.org/r/16172/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>