You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Jimmy Xiang <jx...@cloudera.com> on 2014/11/07 22:34:09 UTC
Review Request 27745: HIVE-8621 Dump small table join data for
map-join [Spark Branch]
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/
-----------------------------------------------------------
Review request for hive and Xuefu Zhang.
Bugs: HIVE-8621
https://issues.apache.org/jira/browse/HIVE-8621
Repository: hive-git
Description
-------
In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
Diffs
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7
Diff: https://reviews.apache.org/r/27745/diff/
Testing
-------
Thanks,
Jimmy Xiang
Re: Review Request 27745: HIVE-8621 Dump small table join data for
map-join [Spark Branch]
Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60543
-----------------------------------------------------------
Ship it!
Ship It!
- Xuefu Zhang
On Nov. 7, 2014, 11:59 p.m., Jimmy Xiang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
>
> (Updated Nov. 7, 2014, 11:59 p.m.)
>
>
> Review request for hive and Xuefu Zhang.
>
>
> Bugs: HIVE-8621
> https://issues.apache.org/jira/browse/HIVE-8621
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7
>
> Diff: https://reviews.apache.org/r/27745/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Jimmy Xiang
>
>
Re: Review Request 27745: HIVE-8621 Dump small table join data for
map-join [Spark Branch]
Posted by Jimmy Xiang <jx...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/
-----------------------------------------------------------
(Updated Nov. 7, 2014, 11:59 p.m.)
Review request for hive and Xuefu Zhang.
Bugs: HIVE-8621
https://issues.apache.org/jira/browse/HIVE-8621
Repository: hive-git
Description
-------
In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7
Diff: https://reviews.apache.org/r/27745/diff/
Testing
-------
Thanks,
Jimmy Xiang
Re: Review Request 27745: HIVE-8621 Dump small table join data for
map-join [Spark Branch]
Posted by Jimmy Xiang <jx...@cloudera.com>.
> On Nov. 7, 2014, 9:51 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java, line 207
> > <https://reviews.apache.org/r/27745/diff/1/?file=754765#file754765line207>
> >
> > Don't we need this any more?
It is not used any place.
> On Nov. 7, 2014, 9:51 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java, line 311
> > <https://reviews.apache.org/r/27745/diff/1/?file=754765#file754765line311>
> >
> > This doesn't seem resolve conflicts for files generated by different partitions. These partitions can run on different nodes, so fileIndex might be the same.
One operator, many tasks because we want more reducers. You are right. That won't work. Let me fix this.
- Jimmy
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60386
-----------------------------------------------------------
On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
>
> (Updated Nov. 7, 2014, 9:34 p.m.)
>
>
> Review request for hive and Xuefu Zhang.
>
>
> Bugs: HIVE-8621
> https://issues.apache.org/jira/browse/HIVE-8621
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7
>
> Diff: https://reviews.apache.org/r/27745/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Jimmy Xiang
>
>
Re: Review Request 27745: HIVE-8621 Dump small table join data for
map-join [Spark Branch]
Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60386
-----------------------------------------------------------
If it's found that too much customization is needed for Spark, we might as well extend it from instead.
ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
<https://reviews.apache.org/r/27745/#comment101750>
Don't we need this any more?
ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
<https://reviews.apache.org/r/27745/#comment101752>
This doesn't seem resolve conflicts for files generated by different partitions. These partitions can run on different nodes, so fileIndex might be the same.
- Xuefu Zhang
On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
>
> (Updated Nov. 7, 2014, 9:34 p.m.)
>
>
> Review request for hive and Xuefu Zhang.
>
>
> Bugs: HIVE-8621
> https://issues.apache.org/jira/browse/HIVE-8621
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7
>
> Diff: https://reviews.apache.org/r/27745/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Jimmy Xiang
>
>
Re: Review Request 27745: HIVE-8621 Dump small table join data for
map-join [Spark Branch]
Posted by Jimmy Xiang <jx...@cloudera.com>.
> On Nov. 7, 2014, 9:51 p.m., Suhas Satish wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java, line 314
> > <https://reviews.apache.org/r/27745/diff/1/?file=754765#file754765line314>
> >
> > What if there are 2 partitions for big table? I guess they will then be processed on 2 separate spark nodes, right?
> >
> > So in this case, there are 2 replicas created for this HashTableSink. How do we control that these 2 replicas will be on the same data nodes as the ones where the 2 big table partitions will be processing map-joins ?
We can't, if we don't know where the big table partitions are. If there are just two partitions, if we copy the small table to more nodes, it may take more time, than fetch the data over network?
- Jimmy
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60388
-----------------------------------------------------------
On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
>
> (Updated Nov. 7, 2014, 9:34 p.m.)
>
>
> Review request for hive and Xuefu Zhang.
>
>
> Bugs: HIVE-8621
> https://issues.apache.org/jira/browse/HIVE-8621
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7
>
> Diff: https://reviews.apache.org/r/27745/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Jimmy Xiang
>
>
Re: Review Request 27745: HIVE-8621 Dump small table join data for
map-join [Spark Branch]
Posted by Suhas Satish <su...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27745/#review60388
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java
<https://reviews.apache.org/r/27745/#comment101753>
What if there are 2 partitions for big table? I guess they will then be processed on 2 separate spark nodes, right?
So in this case, there are 2 replicas created for this HashTableSink. How do we control that these 2 replicas will be on the same data nodes as the ones where the 2 big table partitions will be processing map-joins ?
- Suhas Satish
On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27745/
> -----------------------------------------------------------
>
> (Updated Nov. 7, 2014, 9:34 p.m.)
>
>
> Review request for hive and Xuefu Zhang.
>
>
> Bugs: HIVE-8621
> https://issues.apache.org/jira/browse/HIVE-8621
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> In case spark, HashTableSinkOperator should dump files to a folder expected by HashTableLoader.
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java f0e04e7
>
> Diff: https://reviews.apache.org/r/27745/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Jimmy Xiang
>
>