You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/26 06:51:26 UTC

[GitHub] [spark] tiehexue opened a new pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be …

tiehexue opened a new pull request #32326:
URL: https://github.com/apache/spark/pull/32326


   There are three LocationStrategy: PreferBrokers, PreferConsistent, PreferFixed. I got a scenario that I need a random one. There are plenty of topic partitions that are varies from each other with different records inside. And I have a lot of executors. PreferBrokers does not help here. PreferConsistent will make things worse that some executor will always get heavy tasks. PreferFixed does not help too, because it is fixed, neither to say I have to create a mapping manually.
   
   A random LocationStrategy should dispatch a topic partition to different executors in different window. This would balance the load among spark executors.
   
   ### What changes were proposed in this pull request?
   I added a new method getExecutorHosts in SparkContext which provides host name list. And a PreferRandom case object, which has a random method that returns a "faked" map. That map's get method randomly return a host name.
   
   ### Does this PR introduce _any_ user-facing change?
   User will have another option that may be helpful.
   
   
   ### How was this patch tested?
   I constructed PreferFixed with RandomLocationStrategyMap inside, and verified with 1000+ topic partitions across against 2000+ executors. It worked.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32326:
URL: https://github.com/apache/spark/pull/32326#issuecomment-826219740


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32326:
URL: https://github.com/apache/spark/pull/32326#issuecomment-826219740


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tiehexue commented on a change in pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
tiehexue commented on a change in pull request #32326:
URL: https://github.com/apache/spark/pull/32326#discussion_r619775081



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1812,6 +1812,13 @@ class SparkContext(config: SparkConf) extends Logging {
   /** The version of Spark on which this application is running. */
   def version: String = SPARK_VERSION
 
+  /**
+   * Return an array of executors' host name from the block manager.
+   */
+  def getExecutorHosts: Array[String] = {

Review comment:
       Thanks for your comment. Your comment just remind me there is a typo-error in previous commit. It should be:
   getExecutorMemoryStatus.map(_._1.split(":")(0)).toArray
   
   It is not an api in my first thought. But the splitting here is kind of messy in other place. Making it as an api would help me out, and would enclose implementation of how we get the host name list.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32326:
URL: https://github.com/apache/spark/pull/32326#discussion_r619767006



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1812,6 +1812,13 @@ class SparkContext(config: SparkConf) extends Logging {
   /** The version of Spark on which this application is running. */
   def version: String = SPARK_VERSION
 
+  /**
+   * Return an array of executors' host name from the block manager.
+   */
+  def getExecutorHosts: Array[String] = {

Review comment:
       I wouldn't add this as an API at `SparkContext`. You could just do it with `getExecutorMemoryStatus.map(_._1).toArray`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32326:
URL: https://github.com/apache/spark/pull/32326#discussion_r619767006



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1812,6 +1812,13 @@ class SparkContext(config: SparkConf) extends Logging {
   /** The version of Spark on which this application is running. */
   def version: String = SPARK_VERSION
 
+  /**
+   * Return an array of executors' host name from the block manager.
+   */
+  def getExecutorHosts: Array[String] = {

Review comment:
       I wouldn't add this as an API at `SparkContext`. You could just do it with `getExecutorMemoryStatus.map(_._1).toArray`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #32326:
URL: https://github.com/apache/spark/pull/32326#issuecomment-892250497


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #32326:
URL: https://github.com/apache/spark/pull/32326


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #32326:
URL: https://github.com/apache/spark/pull/32326


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tiehexue commented on a change in pull request #32326: [SPARK-35212][Spark Core][DStreams] Added PreferRandom for the scenario that topic partitions need to be randomly distributed across all executors

Posted by GitBox <gi...@apache.org>.
tiehexue commented on a change in pull request #32326:
URL: https://github.com/apache/spark/pull/32326#discussion_r619775081



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1812,6 +1812,13 @@ class SparkContext(config: SparkConf) extends Logging {
   /** The version of Spark on which this application is running. */
   def version: String = SPARK_VERSION
 
+  /**
+   * Return an array of executors' host name from the block manager.
+   */
+  def getExecutorHosts: Array[String] = {

Review comment:
       Thanks for your comment. Your comment just remind me there is a typo-error in previous commit. It should be:
   getExecutorMemoryStatus.map(_._1.split(":")(0)).toArray
   
   It is not an api in my first thought. But the splitting here is kind of messy in other place. Making it as an api would help me out, and would enclose implementation of how we get the host name list.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org