You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Jimmy Xiang <jx...@cloudera.com> on 2014/11/06 18:25:03 UTC

Review Request 27687: HIVE-8649 Increase level of parallelism in reduce phase [Spark Branch]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27687/
-----------------------------------------------------------

Review request for hive and Xuefu Zhang.


Bugs: HIVE-8649
    https://issues.apache.org/jira/browse/HIVE-8649


Repository: hive-git


Description
-------

First patch for HIVE-8649, to increase the number of reducers for spark based on some info about the spark cluster.
We need to add a SparkListener to handle cluster status change if such events are supported by spark.


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java 5766787 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java 2dbb5a3 

Diff: https://reviews.apache.org/r/27687/diff/


Testing
-------


Thanks,

Jimmy Xiang

Re: Review Request 27687: HIVE-8649 Increase level of parallelism in reduce phase [Spark Branch]

Posted by Jimmy Xiang <jx...@cloudera.com>.


> On Nov. 6, 2014, 7:01 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java, line 86
> > <https://reviews.apache.org/r/27687/diff/1/?file=751768#file751768line86>
> >
> >     Can we document what are in the tuple, especially what each means?

Sure. Will add a doc.


> On Nov. 6, 2014, 7:01 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java, line 75
> > <https://reviews.apache.org/r/27687/diff/1/?file=751768#file751768line75>
> >
> >     I don't feel we need to cache this, as this can change during a user session.

Yes, it will change during a user session. I was thinking to update this when things are changed base on some event callbacks.

Such info may be needed many times if there are many reducers. It should save us some time to go to the Spark master (assuming getExecutorMemoryStatus checking with the master).


> On Nov. 6, 2014, 7:01 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java, line 89
> > <https://reviews.apache.org/r/27687/diff/1/?file=751768#file751768line89>
> >
> >     I'm not sure why this needs to be synchronized. Will this method be called by concurrent threads? It doesn't seem to be the case.

Are you saying it won't be called by many threads? Each JVM can run one query at a time during all deployment modes? How come SparkClient.getInstance is synchronized?


- Jimmy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27687/#review60210
-----------------------------------------------------------


On Nov. 6, 2014, 5:25 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27687/
> -----------------------------------------------------------
> 
> (Updated Nov. 6, 2014, 5:25 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8649
>     https://issues.apache.org/jira/browse/HIVE-8649
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> First patch for HIVE-8649, to increase the number of reducers for spark based on some info about the spark cluster.
> We need to add a SparkListener to handle cluster status change if such events are supported by spark.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java 5766787 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java 2dbb5a3 
> 
> Diff: https://reviews.apache.org/r/27687/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 27687: HIVE-8649 Increase level of parallelism in reduce phase [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.


> On Nov. 6, 2014, 7:01 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java, line 75
> > <https://reviews.apache.org/r/27687/diff/1/?file=751768#file751768line75>
> >
> >     I don't feel we need to cache this, as this can change during a user session.
> 
> Jimmy Xiang wrote:
>     Yes, it will change during a user session. I was thinking to update this when things are changed base on some event callbacks.
>     
>     Such info may be needed many times if there are many reducers. It should save us some time to go to the Spark master (assuming getExecutorMemoryStatus checking with the master).

1. I don't think there will be a callback.
2. Yeah, it will be called many times if there are multiple reducers. Therefore, it probably makes sense to put the info at SetSparkReducerParallelism, which is created for each query.
3. You also need to make sure this works for Spark standalone cluster. I'm not sure if you can get number of exectors/memory in the same way.


> On Nov. 6, 2014, 7:01 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java, line 89
> > <https://reviews.apache.org/r/27687/diff/1/?file=751768#file751768line89>
> >
> >     I'm not sure why this needs to be synchronized. Will this method be called by concurrent threads? It doesn't seem to be the case.
> 
> Jimmy Xiang wrote:
>     Are you saying it won't be called by many threads? Each JVM can run one query at a time during all deployment modes? How come SparkClient.getInstance is synchronized?

Yeah. Right now this is a little messy. Changes are coming. Concurrency isn't tested yet. It's fine to leave the synchronization there.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27687/#review60210
-----------------------------------------------------------


On Nov. 6, 2014, 5:25 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27687/
> -----------------------------------------------------------
> 
> (Updated Nov. 6, 2014, 5:25 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8649
>     https://issues.apache.org/jira/browse/HIVE-8649
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> First patch for HIVE-8649, to increase the number of reducers for spark based on some info about the spark cluster.
> We need to add a SparkListener to handle cluster status change if such events are supported by spark.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java 5766787 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java 2dbb5a3 
> 
> Diff: https://reviews.apache.org/r/27687/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 27687: HIVE-8649 Increase level of parallelism in reduce phase [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27687/#review60210
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java
<https://reviews.apache.org/r/27687/#comment101574>

    I don't feel we need to cache this, as this can change during a user session.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java
<https://reviews.apache.org/r/27687/#comment101575>

    Can we document what are in the tuple, especially what each means?



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java
<https://reviews.apache.org/r/27687/#comment101576>

    I'm not sure why this needs to be synchronized. Will this method be called by concurrent threads? It doesn't seem to be the case.


- Xuefu Zhang


On Nov. 6, 2014, 5:25 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27687/
> -----------------------------------------------------------
> 
> (Updated Nov. 6, 2014, 5:25 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8649
>     https://issues.apache.org/jira/browse/HIVE-8649
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> First patch for HIVE-8649, to increase the number of reducers for spark based on some info about the spark cluster.
> We need to add a SparkListener to handle cluster status change if such events are supported by spark.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java 5766787 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java 2dbb5a3 
> 
> Diff: https://reviews.apache.org/r/27687/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 27687: HIVE-8649 Increase level of parallelism in reduce phase [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27687/#review60253
-----------------------------------------------------------

Ship it!


Ship It!

- Xuefu Zhang


On Nov. 6, 2014, 9:23 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27687/
> -----------------------------------------------------------
> 
> (Updated Nov. 6, 2014, 9:23 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8649
>     https://issues.apache.org/jira/browse/HIVE-8649
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> First patch for HIVE-8649, to increase the number of reducers for spark based on some info about the spark cluster.
> We need to add a SparkListener to handle cluster status change if such events are supported by spark.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java 5766787 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java 2dbb5a3 
> 
> Diff: https://reviews.apache.org/r/27687/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 27687: HIVE-8649 Increase level of parallelism in reduce phase [Spark Branch]

Posted by Jimmy Xiang <jx...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27687/
-----------------------------------------------------------

(Updated Nov. 6, 2014, 9:23 p.m.)


Review request for hive and Xuefu Zhang.


Bugs: HIVE-8649
    https://issues.apache.org/jira/browse/HIVE-8649


Repository: hive-git


Description
-------

First patch for HIVE-8649, to increase the number of reducers for spark based on some info about the spark cluster.
We need to add a SparkListener to handle cluster status change if such events are supported by spark.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkClient.java 5766787 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java 2dbb5a3 

Diff: https://reviews.apache.org/r/27687/diff/


Testing
-------


Thanks,

Jimmy Xiang