You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Chao Sun <ch...@cloudera.com> on 2014/11/05 18:51:26 UTC

Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/
-----------------------------------------------------------

Review request for hive.


Bugs: HIVE-8622
    https://issues.apache.org/jira/browse/HIVE-8622


Repository: hive-git


Description
-------

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 

Diff: https://reviews.apache.org/r/27627/diff/


Testing
-------


Thanks,

Chao Sun

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60530
-----------------------------------------------------------

Ship it!


Ship It!

- Xuefu Zhang


On Nov. 9, 2014, 10:39 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2014, 10:39 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/
-----------------------------------------------------------

(Updated Nov. 9, 2014, 10:39 p.m.)


Review request for hive.


Changes
-------

Adopting Xuefu's pseudo code. Now for each BaseWork with MJ operator, use a SparkWork for its parent BaseWorks that contain HashTableSinkOperator.
I manually tested this patch with several qfiles containing map-join queries, and results look correct.


Bugs: HIVE-8622
    https://issues.apache.org/jira/browse/HIVE-8622


Repository: hive-git


Description
-------

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf 

Diff: https://reviews.apache.org/r/27627/diff/


Testing
-------


Thanks,

Chao Sun

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear.
> 
> Chao Sun wrote:
>     I was thinking, in this case, if there's no dependency between w1 and w2, they can be put in the same SparkWork, right?
>     Otherwise, they will form a linear dependency too.

w1 and w2 are fine. they will be in the same SparkWork. This SparkWork will depends on both the SparkWork generated at w1 and SparkWork generated at w2. This dependency is not linear.

To put more details, for each work that has map join op, we need to create a SparkWork to handle its small tables. So, both w1 and w2 will need to create such SparkWork. While w1 and w2 are in the same SparkWork, this SparkWork depends on the two SparkWorks created.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear.
> 
> Chao Sun wrote:
>     I was thinking, in this case, if there's no dependency between w1 and w2, they can be put in the same SparkWork, right?
>     Otherwise, they will form a linear dependency too.
> 
> Xuefu Zhang wrote:
>     w1 and w2 are fine. they will be in the same SparkWork. This SparkWork will depends on both the SparkWork generated at w1 and SparkWork generated at w2. This dependency is not linear.
>     
>     To put more details, for each work that has map join op, we need to create a SparkWork to handle its small tables. So, both w1 and w2 will need to create such SparkWork. While w1 and w2 are in the same SparkWork, this SparkWork depends on the two SparkWorks created.
> 
> Chao Sun wrote:
>     I'm not getting it, why "This dependency is not linear"? Can you give a counter example?
>     Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:
>     
>          HTS_1   HTS_2     HTS_3    HTS_4
>            \      /           \     /
>             \    /             \   /
>               MJ_1              MJ_2
>                |                 |
>                |                 |
>               HTS_5            HTS_6
>                   \            /
>                    \          /
>                     \        /
>                      \      /
>                       \    /
>                         MJ_3
>                         
>     Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same SparkWork, say SW_1
>     then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and MJ_3 in another SparkWork SW_3.
>     SW_1 -> SW_2 -> SW_3.
> 
> Xuefu Zhang wrote:
>     I don't think we should put (HTS1,HTS2) and (HTS3, HTS4) in the same SparkWork. They belong to different MJ handling different sets of small tables. This will complicate things, making HashTableSinkOperator and HashTableLoader more complicated.
>     
>     Per dependency, MJ1 doesn't need to wait for HTS3/HTS4 in order to run, and vice versa.
>     
>     Please refer to pseudo code posted in the JIRA for implementation ideas. Thanks.

Resolved via a offline chat.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 9, 2014, 10:39 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2014, 10:39 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear.
> 
> Chao Sun wrote:
>     I was thinking, in this case, if there's no dependency between w1 and w2, they can be put in the same SparkWork, right?
>     Otherwise, they will form a linear dependency too.
> 
> Xuefu Zhang wrote:
>     w1 and w2 are fine. they will be in the same SparkWork. This SparkWork will depends on both the SparkWork generated at w1 and SparkWork generated at w2. This dependency is not linear.
>     
>     To put more details, for each work that has map join op, we need to create a SparkWork to handle its small tables. So, both w1 and w2 will need to create such SparkWork. While w1 and w2 are in the same SparkWork, this SparkWork depends on the two SparkWorks created.
> 
> Chao Sun wrote:
>     I'm not getting it, why "This dependency is not linear"? Can you give a counter example?
>     Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:
>     
>          HTS_1   HTS_2     HTS_3    HTS_4
>            \      /           \     /
>             \    /             \   /
>               MJ_1              MJ_2
>                |                 |
>                |                 |
>               HTS_5            HTS_6
>                   \            /
>                    \          /
>                     \        /
>                      \      /
>                       \    /
>                         MJ_3
>                         
>     Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same SparkWork, say SW_1
>     then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and MJ_3 in another SparkWork SW_3.
>     SW_1 -> SW_2 -> SW_3.

I don't think we should put (HTS1,HTS2) and (HTS3, HTS4) in the same SparkWork. They belong to different MJ handling different sets of small tables. This will complicate things, making HashTableSinkOperator and HashTableLoader more complicated.

Per dependency, MJ1 doesn't need to wait for HTS3/HTS4 in order to run, and vice versa.

Please refer to pseudo code posted in the JIRA for implementation ideas. Thanks.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear.
> 
> Chao Sun wrote:
>     I was thinking, in this case, if there's no dependency between w1 and w2, they can be put in the same SparkWork, right?
>     Otherwise, they will form a linear dependency too.
> 
> Xuefu Zhang wrote:
>     w1 and w2 are fine. they will be in the same SparkWork. This SparkWork will depends on both the SparkWork generated at w1 and SparkWork generated at w2. This dependency is not linear.
>     
>     To put more details, for each work that has map join op, we need to create a SparkWork to handle its small tables. So, both w1 and w2 will need to create such SparkWork. While w1 and w2 are in the same SparkWork, this SparkWork depends on the two SparkWorks created.

I'm not getting it, why "This dependency is not linear"? Can you give a counter example?
Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:

     HTS_1   HTS_2     HTS_3    HTS_4
       \      /           \     /
        \    /             \   /
          MJ_1              MJ_2
           |                 |
           |                 |
          HTS_5            HTS_6
              \            /
               \          /
                \        /
                 \      /
                  \    /
                    MJ_3
                    
Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same SparkWork, say SW_1
then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and MJ_3 in another SparkWork SW_3.
SW_1 -> SW_2 -> SW_3.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear.

I was thinking, in this case, if there's no dependency between w1 and w2, they can be put in the same SparkWork, right?
Otherwise, they will form a linear dependency too.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101882>

    This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear.


- Xuefu Zhang


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.


> On Nov. 8, 2014, 12:44 a.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 224
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line224>
> >
> >     I've been thinking about this, as you had brought up a pretty rare use-case where a big-table parent of mapjoin1 still had a HTS , but its for another(!) mapjoin.  I dont know if this is still a valid case , but do you think this handles it, as it just indisciriminately adds it to the parent map if it has HTS?

Fixed through a offline chat.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60380
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Szehon Ho <sz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60380
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101745>

    We can add a comment.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101746>

    We should not start this with capital letters.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101846>

    I've been thinking about this, as you had brought up a pretty rare use-case where a big-table parent of mapjoin1 still had a HTS , but its for another(!) mapjoin.  I dont know if this is still a valid case , but do you think this handles it, as it just indisciriminately adds it to the parent map if it has HTS?


- Szehon Ho


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.


> On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 100
> > <https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line100>
> >
> >     It seems possible that current is MJwork, right? Are you going to add it to the target?

Yes, it's possible. But that MJwork will be a one of which all HTS are already handled, so we can go through it to some HTS for other MJworks.


> On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 115
> > <https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line115>
> >
> >     Frankly, I'm not 100% following the logic. The diagram has operators mixed with works, which makes it hard. But I'm seeing where you're coming from. Maybe you can explain to me better in person.

Here the operator name (MJ, HTS) means a work contains the operator, so MJ is a BaseWork containing MJ operator, and same for HTS.
Yes, I think explaining in person would be better.


> On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 155
> > <https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line155>
> >
> >     I think there is a separate JIRA handling combining mapjoins, owned by Szehon.

In my understanding, Szehon's JIRA is try to put MJ operators in the same BaseWork. But, there're some cases that we cannot apply this optimization, and MJ operators will be in different BaseWorks. My work here is to try to put them in the same SparkWork, if there's no dependency among them.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60403
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60403
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101790>

    Nit: need space before and after -. Same below in multiple occurrances.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101795>

    It seems possible that current is MJwork, right? Are you going to add it to the target?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101808>

    Frankly, I'm not 100% following the logic. The diagram has operators mixed with works, which makes it hard. But I'm seeing where you're coming from. Maybe you can explain to me better in person.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101799>

    I think there is a separate JIRA handling combining mapjoins, owned by Szehon.


- Xuefu Zhang


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/
-----------------------------------------------------------

(Updated Nov. 7, 2014, 6:07 p.m.)


Review request for hive.


Changes
-------

Instead of using a Set, we should use a Map from a BaseWork w/ MJ to all its parent BaseWorks w/ HTSs. The principle is, we cannot process all BaseWorks below this MJ until all HTSs are processed.


Bugs: HIVE-8622
    https://issues.apache.org/jira/browse/HIVE-8622


Repository: hive-git


Description
-------

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 

Diff: https://reviews.apache.org/r/27627/diff/


Testing
-------


Thanks,

Chao Sun

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/
-----------------------------------------------------------

(Updated Nov. 7, 2014, 3:57 p.m.)


Review request for hive.


Changes
-------

Another patch with a cleaner solution in my opinion. I tested it with subquery_multiinsert.q and result looks fine. Please give suggestions!


Bugs: HIVE-8622
    https://issues.apache.org/jira/browse/HIVE-8622


Repository: hive-git


Description
-------

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 

Diff: https://reviews.apache.org/r/27627/diff/


Testing
-------


Thanks,

Chao Sun

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.


> On Nov. 5, 2014, 7:16 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 128
> > <https://reviews.apache.org/r/27627/diff/1/?file=750389#file750389line128>
> >
> >     Do you mean parentTasks != null?

That was a silly mistake.


> On Nov. 5, 2014, 7:16 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 185
> > <https://reviews.apache.org/r/27627/diff/1/?file=750389#file750389line185>
> >
> >     Merge with itself?

Yes, in this case (current BaseWork has no MJ), we merge all parent SparkWorks into the current SparkWork.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review59987
-----------------------------------------------------------


On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 5:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review59987
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101309>

    Do you mean parentTasks != null?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101335>

    it seems this check should be the first line in the outer for loop, for better efficiency and clarity.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27627/#comment101336>

    Merge with itself?


- Xuefu Zhang


On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 5:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.


> On Nov. 5, 2014, 9:24 p.m., Szehon Ho wrote:
> > Hi Chao, I left a review for a form of this patch at https://reviews.apache.org/r/27640/, as Suhas put it up for a separate review in combination with his patch.

Thanks, I'll take a look there.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60034
-----------------------------------------------------------


On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 5:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Posted by Szehon Ho <sz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60034
-----------------------------------------------------------


Hi Chao, I left a review for a form of this patch at https://reviews.apache.org/r/27640/, as Suhas put it up for a separate review in combination with his patch.

- Szehon Ho


On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 5:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>