You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Jimmy Xiang <jx...@cloudera.com> on 2014/11/07 22:34:09 UTC

Review Request 27745: HIVE-8621 Dump small table join data for map-join [Spark Branch]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/
-----------------------------------------------------------

Review request for hive and Xuefu Zhang.


Bugs: HIVE-8621
    https://issues.apache.org/jira/browse/HIVE-8621


Repository: hive-git


Description
-------

In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7 

Diff: https://reviews.apache.org/r/27745/diff/


Testing
-------


Thanks,

Jimmy Xiang

Re: Review Request 27745: HIVE-8621 Dump small table join data for map-join [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60543
-----------------------------------------------------------

Ship it!


Ship It!

- Xuefu Zhang


On Nov. 7, 2014, 11:59 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 11:59 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8621
>     https://issues.apache.org/jira/browse/HIVE-8621
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7 
> 
> Diff: https://reviews.apache.org/r/27745/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 27745: HIVE-8621 Dump small table join data for map-join [Spark Branch]

Posted by Jimmy Xiang <jx...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/
-----------------------------------------------------------

(Updated Nov. 7, 2014, 11:59 p.m.)


Review request for hive and Xuefu Zhang.


Bugs: HIVE-8621
    https://issues.apache.org/jira/browse/HIVE-8621


Repository: hive-git


Description
-------

In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7 

Diff: https://reviews.apache.org/r/27745/diff/


Testing
-------


Thanks,

Jimmy Xiang

Re: Review Request 27745: HIVE-8621 Dump small table join data for map-join [Spark Branch]

Posted by Jimmy Xiang <jx...@cloudera.com>.


> On Nov. 7, 2014, 9:51 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java, line 207
> > <https://reviews.apache.org/r/27745/diff/1/?file=754765#file754765line207>
> >
> >     Don't we need this any more?

It is not used any place.


> On Nov. 7, 2014, 9:51 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java, line 311
> > <https://reviews.apache.org/r/27745/diff/1/?file=754765#file754765line311>
> >
> >     This doesn't seem resolve conflicts for files generated by different partitions. These partitions can run on different nodes, so fileIndex might be the same.

One operator, many tasks because we want more reducers. You are right. That won't work. Let me fix this.


- Jimmy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60386
-----------------------------------------------------------


On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:34 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8621
>     https://issues.apache.org/jira/browse/HIVE-8621
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7 
> 
> Diff: https://reviews.apache.org/r/27745/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 27745: HIVE-8621 Dump small table join data for map-join [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60386
-----------------------------------------------------------


If it's found that too much customization is needed for Spark, we might as well extend it from instead.


ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
<https://reviews.apache.org/r/27745/#comment101750>

    Don't we need this any more?



ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
<https://reviews.apache.org/r/27745/#comment101752>

    This doesn't seem resolve conflicts for files generated by different partitions. These partitions can run on different nodes, so fileIndex might be the same.


- Xuefu Zhang


On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:34 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8621
>     https://issues.apache.org/jira/browse/HIVE-8621
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7 
> 
> Diff: https://reviews.apache.org/r/27745/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 27745: HIVE-8621 Dump small table join data for map-join [Spark Branch]

Posted by Jimmy Xiang <jx...@cloudera.com>.


> On Nov. 7, 2014, 9:51 p.m., Suhas Satish wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java, line 314
> > <https://reviews.apache.org/r/27745/diff/1/?file=754765#file754765line314>
> >
> >     What if there are 2 partitions for big table?  I guess they will then be processed on 2 separate spark nodes, right?  
> >     
> >     So in this case, there are 2 replicas created for this HashTableSink. How do we control that these 2 replicas will be on the same data nodes as the ones where the 2 big table partitions will be processing map-joins ?

We can't, if we don't know where the big table partitions are.  If there are just two partitions, if we copy the small table to more nodes, it may take more time, than fetch the data over network?


- Jimmy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60388
-----------------------------------------------------------


On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:34 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8621
>     https://issues.apache.org/jira/browse/HIVE-8621
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7 
> 
> Diff: https://reviews.apache.org/r/27745/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 27745: HIVE-8621 Dump small table join data for map-join [Spark Branch]

Posted by Suhas Satish <su...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60388
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
<https://reviews.apache.org/r/27745/#comment101753>

    What if there are 2 partitions for big table?  I guess they will then be processed on 2 separate spark nodes, right?  
    
    So in this case, there are 2 replicas created for this HashTableSink. How do we control that these 2 replicas will be on the same data nodes as the ones where the 2 big table partitions will be processing map-joins ?


- Suhas Satish


On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:34 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8621
>     https://issues.apache.org/jira/browse/HIVE-8621
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7 
> 
> Diff: https://reviews.apache.org/r/27745/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>