You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Szehon Ho <sz...@cloudera.com> on 2014/12/02 02:34:23 UTC

Re: Review Request 28500: HIVE-8943 : Fix memory limit check for combine nested mapjoins [Spark Branch]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28500/
-----------------------------------------------------------

(Updated Dec. 2, 2014, 1:34 a.m.)


Review request for hive, Chao Sun, Suhas Satish, and Xuefu Zhang.


Changes
-------

Fix algorithm and cleanup after discussion with Xuefu.  Original code was too aggressively incorporating connected mapjoins into its size calculation, new code only looks at the big table's connected mapjoins.


Bugs: HIVE-8943
    https://issues.apache.org/jira/browse/HIVE-8943


Repository: hive-git


Description
-------

SparkMapJoinOptimizer by default combines nested mapjoins into one work due to removal of RS for big-table. So we need to enhance the mapjoin check to calculate if all the MapJoins in that work (spark-stage) will fit into the memory, otherwise it might overwhelm memory for that particular spark executor.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 819eef1 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java 0c339a5 
  ql/src/test/queries/clientpositive/auto_join_stats.q PRE-CREATION 
  ql/src/test/queries/clientpositive/auto_join_stats2.q PRE-CREATION 
  ql/src/test/results/clientpositive/auto_join_stats.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/auto_join_stats2.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/spark/auto_join_stats.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/spark/auto_join_stats2.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/28500/diff/


Testing
-------

Added two unit tests:

1.  auto_join_stats, which sets a memory limit and checks that algorithm does not put more than 1 mapjoin in one BaseWork
2.  auto_join_stats2, which is the same query without memory limit, and check that algorithm puts all mapjoin in one BaseWork because it can.


Thanks,

Szehon Ho