You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Xuefu Zhang <xz...@cloudera.com> on 2017/01/20 18:07:36 UTC

Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/
-----------------------------------------------------------

Review request for hive, Chao Sun and Rui Li.


Bugs: HIVE-15580
    https://issues.apache.org/jira/browse/HIVE-15580


Repository: hive-git


Description
-------

See JIRA description.


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
  ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
  ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
  ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
  ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
  ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
  ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 

Diff: https://reviews.apache.org/r/55776/diff/


Testing
-------

All test passed


Thanks,

Xuefu Zhang


Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Posted by Xuefu Zhang <xz...@cloudera.com>.

On Jan. 20, 2017, 6:26 p.m., Xuefu Zhang wrote:
> > I think we also need to update ql/src/test/results/clientpositive/union_top_level.q.out
> 
> Xuefu Zhang wrote:
>     No. I verified that MR's plan and result don't change at all. This is because the keys are the same for group by and order by.
> 
> Chao Sun wrote:
>     Hmm.. I'm surprised. We changed the input qfile and how come the result is not changed?
> 
> Xuefu Zhang wrote:
>     MR group by is also sorted, so the order by is something not needed so eliminated during optimization. So you, the test didn't fail in the Jenkins result.
> 
> Chao Sun wrote:
>     No, I mean the input query is changed, so the output should also be changed. If you look at the MR output qfile, it still has
>     ```
>     PREHOOK: query: explain
>     select * from (select s1.key as k, s2.value as v from src s1 join src s2 on (s1.key = s2.key) limit 10)a
>     union all
>     select * from (select s1.key as k, s2.value as v from src s1 join src s2 on (s1.key = s2.key) limit 10)b
>     PREHOOK: type: QUERY
>     POSTHOOK: query: explain
>     select * from (select s1.key as k, s2.value as v from src s1 join src s2 on (s1.key = s2.key) limit 10)a
>     union all
>     select * from (select s1.key as k, s2.value as v from src s1 join src s2 on (s1.key = s2.key) limit 10)b
>     POSTHOOK: type: QUERY
>     ```
>     which suggest the test is not triggered on the MR path. Anyway, maybe the test is turned off for MR.

Yeah, Got it. Maybe. I don't believe that's a blocker. We can file a followup JIRA for this. Do you have any other comments?


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162449
-----------------------------------------------------------


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>


Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Posted by Xuefu Zhang <xz...@cloudera.com>.

On Jan. 20, 2017, 6:26 p.m., Xuefu Zhang wrote:
> > I think we also need to update ql/src/test/results/clientpositive/union_top_level.q.out

No. I verified that MR's plan and result don't change at all. This is because the keys are the same for group by and order by.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162449
-----------------------------------------------------------


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>


Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Posted by Xuefu Zhang <xz...@cloudera.com>.

On Jan. 20, 2017, 6:26 p.m., Xuefu Zhang wrote:
> > I think we also need to update ql/src/test/results/clientpositive/union_top_level.q.out
> 
> Xuefu Zhang wrote:
>     No. I verified that MR's plan and result don't change at all. This is because the keys are the same for group by and order by.
> 
> Chao Sun wrote:
>     Hmm.. I'm surprised. We changed the input qfile and how come the result is not changed?

MR group by is also sorted, so the order by is something not needed so eliminated during optimization. So you, the test didn't fail in the Jenkins result.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162449
-----------------------------------------------------------


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>


Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Posted by Chao Sun <ch...@cloudera.com>.

On Jan. 20, 2017, 6:26 p.m., Xuefu Zhang wrote:
> > I think we also need to update ql/src/test/results/clientpositive/union_top_level.q.out
> 
> Xuefu Zhang wrote:
>     No. I verified that MR's plan and result don't change at all. This is because the keys are the same for group by and order by.

Hmm.. I'm surprised. We changed the input qfile and how come the result is not changed?


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162449
-----------------------------------------------------------


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>


Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Posted by Xuefu Zhang <xz...@cloudera.com>.

> On Jan. 20, 2017, 6:26 p.m., Chao Sun wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java, line 31
> > <https://reviews.apache.org/r/55776/diff/1/?file=1610799#file1610799line31>
> >
> >     Is it possible that `numPartitions` equals to 0?

No. If partition number is zero, that means no partition. Then we will not even get here. Nevertheless, if it's set to 0, we take 1 instead.


> On Jan. 20, 2017, 6:26 p.m., Chao Sun wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java, line 34
> > <https://reviews.apache.org/r/55776/diff/1/?file=1610799#file1610799line34>
> >
> >     I wonder whether this also has some extra cost comparing to the original `groupByKey`, since it needs to sort all records by key in a single partition, right?

Well, we don't know which one performs better yet. repartitionAndSortWithinPartitions() brings extra softing, but it eliminates grouping in groupByKey(). Also, groupByKey() has unbounded memory usage, which is the problem we are tryig to solve. As described in the JIRA description. We will follow up with performance testing, and may provide an option to use either groupBy() which might be more performing but w/ unlimitted memory usage or the new way where memory usage is bounded.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162449
-----------------------------------------------------------


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>


Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Posted by Chao Sun <ch...@cloudera.com>.

On Jan. 20, 2017, 6:26 p.m., Xuefu Zhang wrote:
> > I think we also need to update ql/src/test/results/clientpositive/union_top_level.q.out
> 
> Xuefu Zhang wrote:
>     No. I verified that MR's plan and result don't change at all. This is because the keys are the same for group by and order by.
> 
> Chao Sun wrote:
>     Hmm.. I'm surprised. We changed the input qfile and how come the result is not changed?
> 
> Xuefu Zhang wrote:
>     MR group by is also sorted, so the order by is something not needed so eliminated during optimization. So you, the test didn't fail in the Jenkins result.

No, I mean the input query is changed, so the output should also be changed. If you look at the MR output qfile, it still has
```
PREHOOK: query: explain
select * from (select s1.key as k, s2.value as v from src s1 join src s2 on (s1.key = s2.key) limit 10)a
union all
select * from (select s1.key as k, s2.value as v from src s1 join src s2 on (s1.key = s2.key) limit 10)b
PREHOOK: type: QUERY
POSTHOOK: query: explain
select * from (select s1.key as k, s2.value as v from src s1 join src s2 on (s1.key = s2.key) limit 10)a
union all
select * from (select s1.key as k, s2.value as v from src s1 join src s2 on (s1.key = s2.key) limit 10)b
POSTHOOK: type: QUERY
```
which suggest the test is not triggered on the MR path. Anyway, maybe the test is turned off for MR.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162449
-----------------------------------------------------------


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>


Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Posted by Chao Sun <ch...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162449
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java (line 31)
<https://reviews.apache.org/r/55776/#comment233782>

    Is it possible that `numPartitions` equals to 0?



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java (line 34)
<https://reviews.apache.org/r/55776/#comment233785>

    I wonder whether this also has some extra cost comparing to the original `groupByKey`, since it needs to sort all records by key in a single partition, right?


I think we also need to update ql/src/test/results/clientpositive/union_top_level.q.out

- Chao Sun


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>


Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Posted by Chao Sun <ch...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162487
-----------------------------------------------------------


Ship it!




+1. Patch looks good to me, even though a little concerned by the possible performance downgrade. Please file follow up JIRAs for the TODO.
It also may be good to have Rui had a look before this is committed. Thanks

- Chao Sun


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java eeb4443 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 66ffe5d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>