You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Suhas Satish <su...@gmail.com> on 2014/11/05 21:29:27 UTC

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/
-----------------------------------------------------------

(Updated Nov. 5, 2014, 8:29 p.m.)


Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.


Summary (updated)
-----------------

HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]


Repository: hive-git


Description
-------

This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 

Diff: https://reviews.apache.org/r/27640/diff/


Testing
-------


Thanks,

Suhas Satish

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Suhas Satish <su...@gmail.com>.


> On Nov. 5, 2014, 10:41 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 201
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line201>
> >
> >     Hi Suhas, I was taking a look through the code, I dont think its easy now to identify which is the big-table parent vs small-table parent.  There is a HashTableDummyOperator representing the small-table but it only has some basic information.
> >     
> >     Maybe you know more about it, but was wondering do we need to save the info to a context when we cut the small-table RS from MapJoin in ReduceSinkMapJoinProc?  Thanks.

Hi Szehon,
GenSparkProcContext has this - 
  // we need to keep the original list of operators in the map join to know
  // what position in the mapjoin the different parent work items will have.
  public final Map<MapJoinOperator, List<Operator<?>>> mapJoinParentMap;
  
There is also another data structure in GenSparkProcContext to keep track of which MapJoinWork is connected to which ReduceSinks. 
  // a map to keep track of what reduce sinks have to be hooked up to
  // map join work
  public final Map<BaseWork, List<ReduceSinkOperator>> linkWorkWithReduceSinkMap;
  
Maybe we need to introduce a similar one for HashTableSinkOperator  like 
 public final Map<BaseWork, List<HashTableSinkOperator>> linkWorkWithHashTableSinkMap;
 
In any case,  we should pass this GenSparkProcContext along to the physicalContext in the pyhsical resolvers. Let me know your thoughts.


- Suhas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60052
-----------------------------------------------------------


On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 8:29 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Szehon Ho <sz...@cloudera.com>.


> On Nov. 5, 2014, 10:41 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 201
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line201>
> >
> >     Hi Suhas, I was taking a look through the code, I dont think its easy now to identify which is the big-table parent vs small-table parent.  There is a HashTableDummyOperator representing the small-table but it only has some basic information.
> >     
> >     Maybe you know more about it, but was wondering do we need to save the info to a context when we cut the small-table RS from MapJoin in ReduceSinkMapJoinProc?  Thanks.
> 
> Suhas Satish wrote:
>     Hi Szehon,
>     GenSparkProcContext has this - 
>       // we need to keep the original list of operators in the map join to know
>       // what position in the mapjoin the different parent work items will have.
>       public final Map<MapJoinOperator, List<Operator<?>>> mapJoinParentMap;
>       
>     There is also another data structure in GenSparkProcContext to keep track of which MapJoinWork is connected to which ReduceSinks. 
>       // a map to keep track of what reduce sinks have to be hooked up to
>       // map join work
>       public final Map<BaseWork, List<ReduceSinkOperator>> linkWorkWithReduceSinkMap;
>       
>     Maybe we need to introduce a similar one for HashTableSinkOperator  like 
>      public final Map<BaseWork, List<HashTableSinkOperator>> linkWorkWithHashTableSinkMap;
>      
>     In any case,  we should pass this GenSparkProcContext along to the physicalContext in the pyhsical resolvers. Let me know your thoughts.

Hi Suhas, can we re-use that even?  It seems that only small-table RS are connected to MJ at this point.  So big-table RS should never get into here.  If we can't re-use it we will have to create a new data structure.  The idea is to identify which RS to replace with HashTableSink.  Hope that makes sense, thanks.


- Szehon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60052
-----------------------------------------------------------


On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 8:29 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Suhas Satish <su...@gmail.com>.


> On Nov. 5, 2014, 10:41 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 201
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line201>
> >
> >     Hi Suhas, I was taking a look through the code, I dont think its easy now to identify which is the big-table parent vs small-table parent.  There is a HashTableDummyOperator representing the small-table but it only has some basic information.
> >     
> >     Maybe you know more about it, but was wondering do we need to save the info to a context when we cut the small-table RS from MapJoin in ReduceSinkMapJoinProc?  Thanks.
> 
> Suhas Satish wrote:
>     Hi Szehon,
>     GenSparkProcContext has this - 
>       // we need to keep the original list of operators in the map join to know
>       // what position in the mapjoin the different parent work items will have.
>       public final Map<MapJoinOperator, List<Operator<?>>> mapJoinParentMap;
>       
>     There is also another data structure in GenSparkProcContext to keep track of which MapJoinWork is connected to which ReduceSinks. 
>       // a map to keep track of what reduce sinks have to be hooked up to
>       // map join work
>       public final Map<BaseWork, List<ReduceSinkOperator>> linkWorkWithReduceSinkMap;
>       
>     Maybe we need to introduce a similar one for HashTableSinkOperator  like 
>      public final Map<BaseWork, List<HashTableSinkOperator>> linkWorkWithHashTableSinkMap;
>      
>     In any case,  we should pass this GenSparkProcContext along to the physicalContext in the pyhsical resolvers. Let me know your thoughts.
> 
> Szehon Ho wrote:
>     Hi Suhas, can we re-use that even?  It seems that only small-table RS are connected to MJ at this point.  So big-table RS should never get into here.  If we can't re-use it we will have to create a new data structure.  The idea is to identify which RS to replace with HashTableSink.  Hope that makes sense, thanks.

Agreed. We only replace a reduceSink with HashTableSink if it exists in mapJoinParentMap for that mapJoin. Also, another thing I had forgotten in the code was to set the parents for hashTableSinkOps introduced. I have added this now hashTableSinkOp.setParentOperators(parentOps);

No new data structures are needed.


- Suhas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60052
-----------------------------------------------------------


On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 8:29 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Szehon Ho <sz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60052
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101408>

    Hi Suhas, I was taking a look through the code, I dont think its easy now to identify which is the big-table parent vs small-table parent.  There is a HashTableDummyOperator representing the small-table but it only has some basic information.
    
    Maybe you know more about it, but was wondering do we need to save the info to a context when we cut the small-table RS from MapJoin in ReduceSinkMapJoinProc?  Thanks.


- Szehon Ho


On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 8:29 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Chao Sun <ch...@cloudera.com>.


> On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 188
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line188>
> >
> >     Can you elaborate why we need this assumption?  This may not be true in all cases.

Actually, we don't need this assumption anymore. I'll remove it.


> On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 141
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line141>
> >
> >     Please use proper javadoc notation for your javadocs.

I didn't use javadoc since it's a private method. Maybe I can write a better description on what it does?


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60031
-----------------------------------------------------------


On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 8:29 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Suhas Satish <su...@gmail.com>.


> On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 254
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line254>
> >
> >     Are you sure we dont need to initialize the HTSOperator's values like it does in LocalMapJoinProcFactory?
> 
> Suhas Satish wrote:
>     I will take a closer look.

I dug into the history of this changeset a bit. 
It was introduced in this commit 
https://github.com/apache/hive/commit/9b4ba6a9bb2a1184857fc8cca11e3dc6c48c1380

>From one of the comments on HIVE-4867, 
there is a problem in mapjoin on tez. MR compiler replaces RS with HashSink made from value exprs of Join but Tez compiler uses RS as is,  assuming it has same columns with value exprs of Join, which is not true

HIVE-4867 dedups columns in RS for reducer join and RS for order-by. But small aliases of mapjoin of MR tasks still contains key columns in value exprs.
 
Not having this can at worst, be a performance issue on memory (slightly larger footprint) but not impact functionality.


- Suhas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60031
-----------------------------------------------------------


On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 8:29 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Suhas Satish <su...@gmail.com>.


> On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 201
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line201>
> >
> >     We need to handle the case of where parent is the mapwork of big-table.  Basically if we are walking from big-table to a reduce work of mapjoin, we should not replace RS with HTS.  (it can happen for example, if big-table is the result from a group by).
> >     
> >     Only work of small-tables should get replaced.

Good point, I will add a check for this.


> On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 202
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line202>
> >
> >     Let's just get mapJoinOp and check if its null in one call.  (Can get rid of one of the methods).

agreed, will fix this.


> On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 244
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line244>
> >
> >     Please fix all these tabs, and make sure to indent them properly.

Sorry about that, will do. Need to change my IDE default settings


> On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 254
> > <https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line254>
> >
> >     Are you sure we dont need to initialize the HTSOperator's values like it does in LocalMapJoinProcFactory?

I will take a closer look.


- Suhas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60031
-----------------------------------------------------------


On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 8:29 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Szehon Ho <sz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60031
-----------------------------------------------------------


Hi, I know this is a combination of patches from Suhas and Chao, but I left all some comments/questions here for the overall patch.


ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101369>

    Please use proper javadoc notation for your javadocs.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101365>

    Can you elaborate why we need this assumption?  This may not be true in all cases.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101373>

    We need to handle the case of where parent is the mapwork of big-table.  Basically if we are walking from big-table to a reduce work of mapjoin, we should not replace RS with HTS.  (it can happen for example, if big-table is the result from a group by).
    
    Only work of small-tables should get replaced.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101371>

    Let's just get mapJoinOp and check if its null in one call.  (Can get rid of one of the methods).



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101363>

    Fix indent here.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101366>

    Please fix all these tabs, and make sure to indent them properly.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101368>

    Are you sure we dont need to initialize the HTSOperator's values like it does in LocalMapJoinProcFactory?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
<https://reviews.apache.org/r/27640/#comment101367>

    Nit: 'parentOps' is probably a better name.


- Szehon Ho


On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2014, 8:29 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Suhas Satish <su...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/#review60278
-----------------------------------------------------------


Few comments about the patch: 

It is triggered only when RS-MJ is found in the operator graph with ReduceSink the parent of Map-Join operator. So this correctly identifies ReduceSinks to convert to HashTableSink and does not inadvertently convert other RSs as demonstrated in the example nested query plan in the Join Optimization document - 

“explain select * FROM
(SELECT avg(key) as x1, value as x2 FROM src group by value) x
JOIN
(SELECT avg(key) as y1, value as y2 FROM src group by value) y ON (x1 = y1)”

- Suhas Satish


On Nov. 6, 2014, 11:43 p.m., Suhas Satish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27640/
> -----------------------------------------------------------
> 
> (Updated Nov. 6, 2014, 11:43 p.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java f1c3564 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java 92600be 
> 
> Diff: https://reviews.apache.org/r/27640/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Suhas Satish
> 
>

Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

Posted by Suhas Satish <su...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27640/
-----------------------------------------------------------

(Updated Nov. 6, 2014, 11:43 p.m.)


Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang.


Repository: hive-git


Description
-------

This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. 


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java f1c3564 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java 92600be 

Diff: https://reviews.apache.org/r/27640/diff/


Testing
-------


Thanks,

Suhas Satish