You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/29 07:38:14 UTC
[GitHub] [spark] zsxwing commented on a change in pull request #28986: [SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors.

zsxwing commented on a change in pull request #28986:
URL: https://github.com/apache/spark/pull/28986#discussion_r461194397



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -2554,6 +2557,19 @@ object SparkContext extends Logging {
     }
   }
 
+  /**
+   * Called to ensure that SparkContext is created or accessed only on the Driver.
+   *
+   * Throws an exception if a SparkContext is about to be created in executors.
+   */
+  private def assertOnDriver(): Unit = {
+    if (TaskContext.get != null) {

Review comment:
       This is pretty similar to the old `spark.driver.allowMultipleContexts`. It looks like working if you are lucky, but there are some surprising behaviors which are pretty hard to debug. For example, `SparkEnv.get` will be set to a new one when someone creates a SparkContext in an executor. (Search `SparkEnv.get` and you will find lots of codes running in executors call it).
   
   I'm pretty surprised that nobody reports this issue in MLFlow. For myself, I used SparkContext in executors to run stress tests for codes running in driver, but I did hit multiple weird issues. I managed to workaround them by some hacks, such as setting Spark.env back. But it requires deep Spark knowledge. I believe many users won't be able to figure out the root cause without an explicit error added by this PR.
   
   IMO, it's better to stop people from doing this, but we should also add a flag to give the users an option to turn it off, so that they have some time to migrate their workloads away from the fragile pattern. We can eventually remove the flag like what we did for `spark.driver.allowMultipleContexts`.

##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -2554,6 +2557,19 @@ object SparkContext extends Logging {
     }
   }
 
+  /**
+   * Called to ensure that SparkContext is created or accessed only on the Driver.
+   *
+   * Throws an exception if a SparkContext is about to be created in executors.
+   */
+  private def assertOnDriver(): Unit = {
+    if (TaskContext.get != null) {

Review comment:
       @smurching how does mlflow create SparkContext? IIUC, it doesn't create SparkContext in the executor JVM, so it should not be impacted by this change.

##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -2554,6 +2557,19 @@ object SparkContext extends Logging {
     }
   }
 
+  /**
+   * Called to ensure that SparkContext is created or accessed only on the Driver.
+   *
+   * Throws an exception if a SparkContext is about to be created in executors.
+   */
+  private def assertOnDriver(): Unit = {
+    if (TaskContext.get != null) {

Review comment:
       NVM. I thought the check was only added in JVM side. If we remove the pyspark side change, it should not impact MLflow. That's actually a case will work pretty well...

##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -2554,6 +2557,19 @@ object SparkContext extends Logging {
     }
   }
 
+  /**
+   * Called to ensure that SparkContext is created or accessed only on the Driver.
+   *
+   * Throws an exception if a SparkContext is about to be created in executors.
+   */
+  private def assertOnDriver(): Unit = {
+    if (TaskContext.get != null) {

Review comment:
       NVM. I thought the check was only added in JVM side. If we remove the pyspark side change, it should not impact MLflow. That's actually a case working pretty well...

##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -2554,6 +2557,19 @@ object SparkContext extends Logging {
     }
   }
 
+  /**
+   * Called to ensure that SparkContext is created or accessed only on the Driver.
+   *
+   * Throws an exception if a SparkContext is about to be created in executors.
+   */
+  private def assertOnDriver(): Unit = {
+    if (TaskContext.get != null) {

Review comment:
       So I'm +1 on blocking creating SparkContext in executors because it messes up the singleton `SparkEnv` (there may be others). But blocking it in Python process is probably not worth since it's working pretty well (unless in future we introduce some singleton objects in executor python processes).

##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -2554,6 +2557,19 @@ object SparkContext extends Logging {
     }
   }
 
+  /**
+   * Called to ensure that SparkContext is created or accessed only on the Driver.
+   *
+   * Throws an exception if a SparkContext is about to be created in executors.
+   */
+  private def assertOnDriver(): Unit = {
+    if (TaskContext.get != null) {

Review comment:
       So I'm +1 on blocking creating SparkContext in executor JVM because it messes up the singleton `SparkEnv` (there may be others). But blocking it in Python process is probably not worth since it's working pretty well (unless in future we introduce some singleton objects in executor python processes).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org