You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Bing Li via Review Board <no...@reviews.apache.org> on 2017/07/04 08:48:56 UTC
Review Request 60632: HIVE-16659: Query plan should reflect
hive.spark.use.groupby.shuffle
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60632/
-----------------------------------------------------------
Review request for hive.
Repository: hive-git
Description
-------
HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle
Diffs
-----
itests/src/test/resources/testconfiguration.properties 19ff316
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RepartitionShuffler.java d0c708c
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 5f85f9e
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java b9901da
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java afbeccb
ql/src/test/queries/clientpositive/spark_explain_groupbyshuffle.q PRE-CREATION
ql/src/test/results/clientpositive/spark/spark_explain_groupbyshuffle.q.out PRE-CREATION
Diff: https://reviews.apache.org/r/60632/diff/1/
Testing
-------
set hive.spark.use.groupby.shuffle=true;
explain select key, count(val) from t1 group by key;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 2)
DagName: root_20170625202742_58335619-7107-4026-9911-43d2ec449088:2
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int), val (type: string)
outputColumnNames: key, val
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(val)
keys: key (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
set hive.spark.use.groupby.shuffle=false;
explain select key, count(val) from t1 group by key;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 2)
DagName: root_20170625203122_3afe01dd-41cc-477e-9098-ddd58b37ad4e:3
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int), val (type: string)
outputColumnNames: key, val
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(val)
keys: key (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Thanks,
Bing Li
Re: Review Request 60632: HIVE-16659: Query plan should reflect
hive.spark.use.groupby.shuffle
Posted by Rui Li <ru...@intel.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60632/#review179554
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
Lines 68 (patched)
<https://reviews.apache.org/r/60632/#comment254315>
Please avoid * import
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
Lines 432 (patched)
<https://reviews.apache.org/r/60632/#comment254316>
it's preferable to use HiveConf::getBoolVar
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
Line 438 (original), 441 (patched)
<https://reviews.apache.org/r/60632/#comment254317>
nit: extra space before !useSparkGroupBy
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
Line 471 (original), 477 (patched)
<https://reviews.apache.org/r/60632/#comment254319>
let's delete this comment
- Rui Li
On July 4, 2017, 8:48 a.m., Bing Li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60632/
> -----------------------------------------------------------
>
> (Updated July 4, 2017, 8:48 a.m.)
>
>
> Review request for hive.
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle
>
>
> Diffs
> -----
>
> itests/src/test/resources/testconfiguration.properties 19ff316
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RepartitionShuffler.java d0c708c
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 5f85f9e
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java b9901da
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java afbeccb
> ql/src/test/queries/clientpositive/spark_explain_groupbyshuffle.q PRE-CREATION
> ql/src/test/results/clientpositive/spark/spark_explain_groupbyshuffle.q.out PRE-CREATION
>
>
> Diff: https://reviews.apache.org/r/60632/diff/1/
>
>
> Testing
> -------
>
> set hive.spark.use.groupby.shuffle=true;
> explain select key, count(val) from t1 group by key;
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
> Stage: Stage-1
> Spark
> Edges:
> Reducer 2 <- Map 1 (GROUP, 2)
> DagName: root_20170625202742_58335619-7107-4026-9911-43d2ec449088:2
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: t1
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: key (type: int), val (type: string)
> outputColumnNames: key, val
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count(val)
> keys: key (type: int)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col1 (type: bigint)
> Reducer 2
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
>
>
> set hive.spark.use.groupby.shuffle=false;
> explain select key, count(val) from t1 group by key;
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
> Stage: Stage-1
> Spark
> Edges:
> Reducer 2 <- Map 1 (GROUP, 2)
> DagName: root_20170625203122_3afe01dd-41cc-477e-9098-ddd58b37ad4e:3
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: t1
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: key (type: int), val (type: string)
> outputColumnNames: key, val
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count(val)
> keys: key (type: int)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col1 (type: bigint)
> Reducer 2
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
>
>
> Thanks,
>
> Bing Li
>
>
Re: Review Request 60632: HIVE-16659: Query plan should reflect
hive.spark.use.groupby.shuffle
Posted by Rui Li <ru...@intel.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60632/#review179595
-----------------------------------------------------------
Ship it!
Ship It!
- Rui Li
On July 5, 2017, 4:07 a.m., Bing Li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60632/
> -----------------------------------------------------------
>
> (Updated July 5, 2017, 4:07 a.m.)
>
>
> Review request for hive.
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle
>
>
> Diffs
> -----
>
> itests/src/test/resources/testconfiguration.properties 19ff316
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RepartitionShuffler.java d0c708c
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 5f85f9e
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java b9901da
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java afbeccb
> ql/src/test/queries/clientpositive/spark_explain_groupbyshuffle.q PRE-CREATION
> ql/src/test/results/clientpositive/spark/spark_explain_groupbyshuffle.q.out PRE-CREATION
>
>
> Diff: https://reviews.apache.org/r/60632/diff/2/
>
>
> Testing
> -------
>
> set hive.spark.use.groupby.shuffle=true;
> explain select key, count(val) from t1 group by key;
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
> Stage: Stage-1
> Spark
> Edges:
> Reducer 2 <- Map 1 (GROUP, 2)
> DagName: root_20170625202742_58335619-7107-4026-9911-43d2ec449088:2
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: t1
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: key (type: int), val (type: string)
> outputColumnNames: key, val
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count(val)
> keys: key (type: int)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col1 (type: bigint)
> Reducer 2
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
>
>
> set hive.spark.use.groupby.shuffle=false;
> explain select key, count(val) from t1 group by key;
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
> Stage: Stage-1
> Spark
> Edges:
> Reducer 2 <- Map 1 (GROUP, 2)
> DagName: root_20170625203122_3afe01dd-41cc-477e-9098-ddd58b37ad4e:3
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: t1
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: key (type: int), val (type: string)
> outputColumnNames: key, val
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count(val)
> keys: key (type: int)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col1 (type: bigint)
> Reducer 2
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
>
>
> Thanks,
>
> Bing Li
>
>
Re: Review Request 60632: HIVE-16659: Query plan should reflect
hive.spark.use.groupby.shuffle
Posted by Bing Li via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60632/
-----------------------------------------------------------
(Updated July 5, 2017, 4:07 a.m.)
Review request for hive.
Changes
-------
Update GenSparkUtils.java based on Rui's comments
Repository: hive-git
Description
-------
HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle
Diffs (updated)
-----
itests/src/test/resources/testconfiguration.properties 19ff316
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RepartitionShuffler.java d0c708c
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 5f85f9e
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java b9901da
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java afbeccb
ql/src/test/queries/clientpositive/spark_explain_groupbyshuffle.q PRE-CREATION
ql/src/test/results/clientpositive/spark/spark_explain_groupbyshuffle.q.out PRE-CREATION
Diff: https://reviews.apache.org/r/60632/diff/2/
Changes: https://reviews.apache.org/r/60632/diff/1-2/
Testing
-------
set hive.spark.use.groupby.shuffle=true;
explain select key, count(val) from t1 group by key;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 2)
DagName: root_20170625202742_58335619-7107-4026-9911-43d2ec449088:2
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int), val (type: string)
outputColumnNames: key, val
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(val)
keys: key (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
set hive.spark.use.groupby.shuffle=false;
explain select key, count(val) from t1 group by key;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 2)
DagName: root_20170625203122_3afe01dd-41cc-477e-9098-ddd58b37ad4e:3
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int), val (type: string)
outputColumnNames: key, val
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(val)
keys: key (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Thanks,
Bing Li