You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/27 16:42:29 UTC

[GitHub] [spark] mengxr edited a comment on issue #27722: [SPARK-30969][CORE] Remove resource coordination support from Standalone

mengxr edited a comment on issue #27722: [SPARK-30969][CORE] Remove resource coordination support from Standalone
URL: https://github.com/apache/spark/pull/27722#issuecomment-592057231

@tgravescs I suggested removal of this resource coordination code in an offline 3.0 API audit discussion to keep Spark codebase simple. Scheduling is already an area that lacks of maintainers. Increasing its complexity would keep potential maintainers away. This is why we have been very careful when we introduced the resource-aware scheduling feature. Here are the reasons to remove this feature before Spark 3.0 release:

* In a real standalone deployment, having multiple workers on the same host is not longer needed. As @Ngone51 mentioned, the feature was introduced to keep worker JVM small to avoid long GC pause when worker and executors share the same JVM. Now executors run in a different JVM, so having one worker per host is sufficient. Correct me if I'm wrong since you mentioned several deployment using multiple workers on the same host.
* Due to the reason above, instead of supporting GPU scheduling for multiple workers on the same host, we should deprecate entirely the support of multiple workers on the same host in 3.0 and remove it in a future release, to further simplify the codebase.
* The local-cluster mode is not a public feature, which should only be used in Spark tests. Actually this change is the first to mention "local-cluster" in the user guide, which makes it "public". I don't think we want to add (even localized) complexity just for this mode. In a test setup, we can use one worker process and separate driver/worker scripts to simplify the resource allocation. Again, using multiple workers in local-cluster mode is to simulate a real cluster, not a "production" setup. If this is for simulation, we just need fake discovery scripts.
* For worker recovery, my understanding is that there shouldn't be a case that the old worker and the recovery worker are running at the same time. Because the recovery is usually done by monit which monitors the worker process. cc: @jiangxb1987

cc: @squito

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org