You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Szehon Ho <sz...@cloudera.com> on 2014/12/02 02:34:23 UTC

Re: Review Request 28500: HIVE-8943 : Fix memory limit check for combine nested mapjoins [Spark Branch]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28500/
-----------------------------------------------------------

(Updated Dec. 2, 2014, 1:34 a.m.)

Review request for hive, Chao Sun, Suhas Satish, and Xuefu Zhang.

Changes
-------

Fix algorithm and cleanup after discussion with Xuefu. Original code was too aggressively incorporating connected mapjoins into its size calculation, new code only looks at the big table's connected mapjoins.

Bugs: HIVE-8943
https://issues.apache.org/jira/browse/HIVE-8943

Repository: hive-git

Description
-------

SparkMapJoinOptimizer by default combines nested mapjoins into one work due to removal of RS for big-table. So we need to enhance the mapjoin check to calculate if all the MapJoins in that work (spark-stage) will fit into the memory, otherwise it might overwhelm memory for that particular spark executor.

Diffs (updated)
-----

ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 819eef1
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java 0c339a5
ql/src/test/queries/clientpositive/auto_join_stats.q PRE-CREATION
ql/src/test/queries/clientpositive/auto_join_stats2.q PRE-CREATION
ql/src/test/results/clientpositive/auto_join_stats.q.out PRE-CREATION
ql/src/test/results/clientpositive/auto_join_stats2.q.out PRE-CREATION
ql/src/test/results/clientpositive/spark/auto_join_stats.q.out PRE-CREATION
ql/src/test/results/clientpositive/spark/auto_join_stats2.q.out PRE-CREATION

Diff: https://reviews.apache.org/r/28500/diff/

Testing
-------

Added two unit tests:

1. auto_join_stats, which sets a memory limit and checks that algorithm does not put more than 1 mapjoin in one BaseWork
2. auto_join_stats2, which is the same query without memory limit, and check that algorithm puts all mapjoin in one BaseWork because it can.

Thanks,

Szehon Ho