You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2023/07/11 12:24:23 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request, #41942: [WIP][SPARK-44348][CORE][CONNECT][TESTS] Reenable test_artifact

HyukjinKwon opened a new pull request, #41942:
URL: https://github.com/apache/spark/pull/41942

   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   ### How was this patch tested?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #41942: [SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #41942:
URL: https://github.com/apache/spark/pull/41942#issuecomment-1633342926

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vicennial commented on a diff in pull request #41942: [SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes

Posted by "vicennial (via GitHub)" <gi...@apache.org>.

vicennial commented on code in PR #41942:
URL: https://github.com/apache/spark/pull/41942#discussion_r1261364588


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1775,21 +1773,31 @@ class SparkContext(config: SparkConf) extends Logging {
     }
 
     val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
+    // If the session ID was specified from SparkSession, it's from a Spark Connect client.
+    // Specify a dedicated directory for Spark Connect client.
+    // We're running Spark Connect as a service so regular PySpark path
+    // is not affected.
+    lazy val root = if (jobArtifactUUID != "default") {
+      val newDest = new File(SparkFiles.getRootDirectory(), jobArtifactUUID)

Review Comment:
   >  reuse PythonWorkerFactory in which assumes that there is a UUID named directory under SparkFiles.getRootDirectory() at both Driver and Executor
   
   Ahh gotcha, I am not very well aware of the Python side, good to know 👍 
   
   > So, after this change, we do not upload twice anymore by:
   Directly pass the spark:// URI to addFile and addJar
   addFile and addJar will not attempt to upload the files, but bypass the original URI.
   
   Awesome!
   
   



##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1775,21 +1773,31 @@ class SparkContext(config: SparkConf) extends Logging {
     }
 
     val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
+    // If the session ID was specified from SparkSession, it's from a Spark Connect client.
+    // Specify a dedicated directory for Spark Connect client.
+    // We're running Spark Connect as a service so regular PySpark path
+    // is not affected.
+    lazy val root = if (jobArtifactUUID != "default") {
+      val newDest = new File(SparkFiles.getRootDirectory(), jobArtifactUUID)

Review Comment:
   >  reuse PythonWorkerFactory in which assumes that there is a UUID named directory under SparkFiles.getRootDirectory() at both Driver and Executor
   
   Ahh gotcha, I am not very well aware of the Python side, good to know 👍 
   
   > So, after this change, we do not upload twice anymore by:
   Directly pass the spark:// URI to addFile and addJar
   addFile and addJar will not attempt to upload the files, but bypass the original URI.
   
   Awesome!
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] vicennial commented on a diff in pull request #41942: [SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes

Posted by "vicennial (via GitHub)" <gi...@apache.org>.

vicennial commented on code in PR #41942:
URL: https://github.com/apache/spark/pull/41942#discussion_r1260059769


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1775,21 +1773,31 @@ class SparkContext(config: SparkConf) extends Logging {
     }
 
     val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
+    // If the session ID was specified from SparkSession, it's from a Spark Connect client.
+    // Specify a dedicated directory for Spark Connect client.
+    // We're running Spark Connect as a service so regular PySpark path
+    // is not affected.
+    lazy val root = if (jobArtifactUUID != "default") {
+      val newDest = new File(SparkFiles.getRootDirectory(), jobArtifactUUID)

Review Comment:
   Is this needed because the session-specific handling is more generic now? 
   Because for the JARs from Spark Connect, we preiovusly just registered the root artifact directory in the file server and built URIs that let the executor fetch the file directly without the need of copying over to the generic Spark Files directory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #41942: [SPARK-44348][CORE][CONNECT][TESTS] Reenable test_artifact with relevant changes

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #41942:
URL: https://github.com/apache/spark/pull/41942#issuecomment-1630796787

   Apologies that this PR happened to touch a lot of codebase. I would appreciate if you guys find some time to take a look, cc @hvanhovell @ueshin @vicennial @zhengruifeng  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #41942: [SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #41942: [SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes
URL: https://github.com/apache/spark/pull/41942


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41942: [SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #41942:
URL: https://github.com/apache/spark/pull/41942#discussion_r1260951116


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1775,21 +1773,31 @@ class SparkContext(config: SparkConf) extends Logging {
     }
 
     val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
+    // If the session ID was specified from SparkSession, it's from a Spark Connect client.
+    // Specify a dedicated directory for Spark Connect client.
+    // We're running Spark Connect as a service so regular PySpark path
+    // is not affected.
+    lazy val root = if (jobArtifactUUID != "default") {
+      val newDest = new File(SparkFiles.getRootDirectory(), jobArtifactUUID)

Review Comment:
   Yeah, it now needs to reuse `PythonWorkerFactory` in which assumes that there is a UUID named directory under `SparkFiles.getRootDirectory()` at both Driver and Executor. We _could_ try to reuse the local artifact directory but I would prefer to have another copy in the local for now for better maintainability and reusability for now.
   
   Otherwise, it does upload to the Spark file server twice (as we discussed offline). I pushed new changes to avoid this. So, after this change, we do not upload twice anymore by:
   1. Directly pass the `spark://` URI to `addFile` and `addJar`
   2. `addFile` and `addJar` will not attempt to upload the files, but bypass the original URI.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org