You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Navis Ryu <na...@nexr.com> on 2013/12/11 03:13:00 UTC
Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------
Review request for hive.
Bugs: HIVE-5945
https://issues.apache.org/jira/browse/HIVE-5945
Repository: hive-git
Description
-------
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
Diffs
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 197a20f
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b
ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
ql/src/test/results/clientpositive/auto_join25.q.out 7427239
Diff: https://reviews.apache.org/r/16172/diff/
Testing
-------
Thanks,
Navis Ryu
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Navis Ryu <na...@nexr.com>.
> On Dec. 18, 2013, 2:02 p.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java, line 427
> > <https://reviews.apache.org/r/16172/diff/2/?file=399281#file399281line427>
> >
> > Seems it is not an error? If so, let's not put it in the ErrorMsg.
done.
> On Dec. 18, 2013, 2:02 p.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java, line 262
> > <https://reviews.apache.org/r/16172/diff/2/?file=399284#file399284line262>
> >
> > Is this one necessary?
changed to debug message
- Navis
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30612
-----------------------------------------------------------
On Dec. 18, 2013, 5:04 a.m., Navis Ryu wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
>
> (Updated Dec. 18, 2013, 5:04 a.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-5945
> https://issues.apache.org/jira/browse/HIVE-5945
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> Here is an example
> {code}
> select
> i_item_id,
> s_state,
> avg(ss_quantity) agg1,
> avg(ss_list_price) agg2,
> avg(ss_coupon_amt) agg3,
> avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
> cd_gender = 'F' and
> cd_marital_status = 'U' and
> cd_education_status = 'Primary' and
> d_year = 2002 and
> s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
> i_item_id,
> s_state
> order by
> i_item_id,
> s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
>
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java 45acc2b
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 9afc80b
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b
> ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
> ql/src/test/results/clientpositive/auto_join25.q.out 7427239
> ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
> ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
>
> Diff: https://reviews.apache.org/r/16172/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Navis Ryu
>
>
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Yin Huai <hu...@cse.ohio-state.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30612
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java
<https://reviews.apache.org/r/16172/#comment58622>
Seems it is not an error? If so, let's not put it in the ErrorMsg.
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java
<https://reviews.apache.org/r/16172/#comment58623>
Is this one necessary?
- Yin Huai
On Dec. 18, 2013, 5:04 a.m., Navis Ryu wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
>
> (Updated Dec. 18, 2013, 5:04 a.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-5945
> https://issues.apache.org/jira/browse/HIVE-5945
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> Here is an example
> {code}
> select
> i_item_id,
> s_state,
> avg(ss_quantity) agg1,
> avg(ss_list_price) agg2,
> avg(ss_coupon_amt) agg3,
> avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
> cd_gender = 'F' and
> cd_marital_status = 'U' and
> cd_education_status = 'Primary' and
> d_year = 2002 and
> s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
> i_item_id,
> s_state
> order by
> i_item_id,
> s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
>
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java 45acc2b
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 9afc80b
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b
> ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
> ql/src/test/results/clientpositive/auto_join25.q.out 7427239
> ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
> ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
>
> Diff: https://reviews.apache.org/r/16172/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Navis Ryu
>
>
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Yin Huai <hu...@cse.ohio-state.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30977
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java
<https://reviews.apache.org/r/16172/#comment59277>
For the example in the description of HIVE-5945, the same alias "$INTNAME" can actually refer to different intermediate tables. So, at here, we will not update the correct size for an alias "$INTNAME".
- Yin Huai
On Dec. 30, 2013, 2:20 a.m., Navis Ryu wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
>
> (Updated Dec. 30, 2013, 2:20 a.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-5945
> https://issues.apache.org/jira/browse/HIVE-5945
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> Here is an example
> {code}
> select
> i_item_id,
> s_state,
> avg(ss_quantity) agg1,
> avg(ss_list_price) agg2,
> avg(ss_coupon_amt) agg3,
> avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
> cd_gender = 'F' and
> cd_marital_status = 'U' and
> cd_education_status = 'Primary' and
> d_year = 2002 and
> s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
> i_item_id,
> s_state
> order by
> i_item_id,
> s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
>
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java daf4e4a
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 37ed275
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366
> ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
> ql/src/test/results/clientpositive/auto_join25.q.out 7427239
> ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
> ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
>
> Diff: https://reviews.apache.org/r/16172/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Navis Ryu
>
>
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------
(Updated Jan. 13, 2014, 8:46 a.m.)
Review request for hive.
Changes
-------
Fixed test fails
Bugs: HIVE-5945
https://issues.apache.org/jira/browse/HIVE-5945
Repository: hive-git
Description
-------
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java fccea89
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java efa9768
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366
ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
ql/src/test/results/clientpositive/auto_join25.q.out 7427239
ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
Diff: https://reviews.apache.org/r/16172/diff/
Testing
-------
Thanks,
Navis Ryu
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------
(Updated Jan. 13, 2014, 2:33 a.m.)
Review request for hive.
Changes
-------
Addressed comments in JIRA
Bugs: HIVE-5945
https://issues.apache.org/jira/browse/HIVE-5945
Repository: hive-git
Description
-------
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java fccea89
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java efa9768
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366
ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
ql/src/test/results/clientpositive/auto_join25.q.out 7427239
ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
Diff: https://reviews.apache.org/r/16172/diff/
Testing
-------
Thanks,
Navis Ryu
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------
(Updated Jan. 2, 2014, 2:32 a.m.)
Review request for hive.
Changes
-------
Make not to share size of intermediate tables among ConditionalTasks.
Bugs: HIVE-5945
https://issues.apache.org/jira/browse/HIVE-5945
Repository: hive-git
Description
-------
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java e7aa2c9
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 37ed275
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366
ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
ql/src/test/results/clientpositive/auto_join25.q.out 7427239
ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
Diff: https://reviews.apache.org/r/16172/diff/
Testing
-------
Thanks,
Navis Ryu
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------
(Updated Dec. 30, 2013, 2:20 a.m.)
Review request for hive.
Changes
-------
Added log & test case
Bugs: HIVE-5945
https://issues.apache.org/jira/browse/HIVE-5945
Repository: hive-git
Description
-------
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java daf4e4a
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 37ed275
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366
ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
ql/src/test/results/clientpositive/auto_join25.q.out 7427239
ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
Diff: https://reviews.apache.org/r/16172/diff/
Testing
-------
Thanks,
Navis Ryu
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------
(Updated Dec. 27, 2013, 3:13 a.m.)
Review request for hive.
Bugs: HIVE-5945
https://issues.apache.org/jira/browse/HIVE-5945
Repository: hive-git
Description
-------
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java daf4e4a
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 37ed275
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java f75e366
ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
ql/src/test/results/clientpositive/auto_join25.q.out 7427239
ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
Diff: https://reviews.apache.org/r/16172/diff/
Testing
-------
Thanks,
Navis Ryu
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Navis Ryu <na...@nexr.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/
-----------------------------------------------------------
(Updated Dec. 18, 2013, 5:04 a.m.)
Review request for hive.
Changes
-------
Added log messages for exception in resolver
Bugs: HIVE-5945
https://issues.apache.org/jira/browse/HIVE-5945
Repository: hive-git
Description
-------
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java 45acc2b
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 9afc80b
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b
ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
ql/src/test/results/clientpositive/auto_join25.q.out 7427239
ql/src/test/results/clientpositive/infer_bucket_sort_convert_join.q.out 7d06739
ql/src/test/results/clientpositive/mapjoin_hook.q.out d60d16e
Diff: https://reviews.apache.org/r/16172/diff/
Testing
-------
Thanks,
Navis Ryu
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Navis Ryu <na...@nexr.com>.
> On Dec. 18, 2013, 1:47 a.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java, line 242
> > <https://reviews.apache.org/r/16172/diff/1/?file=396419#file396419line242>
> >
> > aliasToKnownSize can also contain tables which will not be used in the next job. For example, we have a query like SELECT ... FROM a JOIN b ON (a.key1=b.key1) JOIN c ON (a.key2=b.key). Let's also assume that "a" is the big table. We can first use a Map only job to do a JOIN b. Then, we should evaluate the size of table c and the result of a JOIN b. But, at here, aliasToKnownSize also has the size of table a which will be counted in sumOfOthers.
No. it's not. Below is the log messages.
[ConditionalResolverCommonJoin/resolveMapJoinTask] aliasToKnownSize : {b=11624, c=11624, a=11624}
[ConditionalResolverCommonJoin/resolveMapJoinTask] aliases : [b, a]
[ConditionalResolverCommonJoin/resolveMapJoinTask] aliasToKnownSize : {b=11624, c=11624, a=11624, $INTNAME=167608}
[ConditionalResolverCommonJoin/resolveMapJoinTask] aliases : [c, $INTNAME]
> On Dec. 18, 2013, 1:47 a.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java, line 467
> > <https://reviews.apache.org/r/16172/diff/1/?file=396418#file396418line467>
> >
> > A question which is not very related to this issue. Have we documented that we prefer the right most alias as the big table? I also see we have such assumption in JoinOperator.
Preferring the right most alias is introduced in this patch first (it was decided by iteration order of aliasToWork), changing result of auto_join25.q. (This part of change is not related to this very issue but I thought it's too confusing to understand)
> On Dec. 18, 2013, 1:47 a.m., Yin Huai wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java, line 255
> > <https://reviews.apache.org/r/16172/diff/1/?file=396419#file396419line255>
> >
> > Let's change it to log the exception instead of printing the stack trace.
ok.
- Navis
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30594
-----------------------------------------------------------
On Dec. 11, 2013, 2:12 a.m., Navis Ryu wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
>
> (Updated Dec. 11, 2013, 2:12 a.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-5945
> https://issues.apache.org/jira/browse/HIVE-5945
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> Here is an example
> {code}
> select
> i_item_id,
> s_state,
> avg(ss_quantity) agg1,
> avg(ss_list_price) agg2,
> avg(ss_coupon_amt) agg3,
> avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
> cd_gender = 'F' and
> cd_marital_status = 'U' and
> cd_education_status = 'Primary' and
> d_year = 2002 and
> s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
> i_item_id,
> s_state
> order by
> i_item_id,
> s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
>
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 197a20f
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b
> ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
> ql/src/test/results/clientpositive/auto_join25.q.out 7427239
>
> Diff: https://reviews.apache.org/r/16172/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Navis Ryu
>
>
Re: Review Request 16172:
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those
tables which are not used in the child of this conditional task.
Posted by Yin Huai <hu...@cse.ohio-state.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16172/#review30594
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
<https://reviews.apache.org/r/16172/#comment58604>
A question which is not very related to this issue. Have we documented that we prefer the right most alias as the big table? I also see we have such assumption in JoinOperator.
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java
<https://reviews.apache.org/r/16172/#comment58601>
aliasToKnownSize can also contain tables which will not be used in the next job. For example, we have a query like SELECT ... FROM a JOIN b ON (a.key1=b.key1) JOIN c ON (a.key2=b.key). Let's also assume that "a" is the big table. We can first use a Map only job to do a JOIN b. Then, we should evaluate the size of table c and the result of a JOIN b. But, at here, aliasToKnownSize also has the size of table a which will be counted in sumOfOthers.
ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java
<https://reviews.apache.org/r/16172/#comment58602>
Let's change it to log the exception instead of printing the stack trace.
- Yin Huai
On Dec. 11, 2013, 2:12 a.m., Navis Ryu wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16172/
> -----------------------------------------------------------
>
> (Updated Dec. 11, 2013, 2:12 a.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-5945
> https://issues.apache.org/jira/browse/HIVE-5945
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> Here is an example
> {code}
> select
> i_item_id,
> s_state,
> avg(ss_quantity) agg1,
> avg(ss_list_price) agg2,
> avg(ss_coupon_amt) agg3,
> avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
> cd_gender = 'F' and
> cd_marital_status = 'U' and
> cd_education_status = 'Primary' and
> d_year = 2002 and
> s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
> i_item_id,
> s_state
> order by
> i_item_id,
> s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job (for reduce joins.)
>
> So, I checked the conditional task determining the plan of the join involving item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all input tables used in this query and the intermediate table generated by joining store_sales and date_dim. So, when we sum the size of all small tables, the size of store_sales (which is around 45GB in my test) will be also counted.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 197a20f
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java 2efa7c2
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java faf2f9b
> ql/src/test/org/apache/hadoop/hive/ql/plan/TestConditionalResolverCommonJoin.java 67203c9
> ql/src/test/results/clientpositive/auto_join25.q.out 7427239
>
> Diff: https://reviews.apache.org/r/16172/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Navis Ryu
>
>