You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/05 04:15:53 UTC

[GitHub] [spark] pralabhkumar opened a new pull request, #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

pralabhkumar opened a new pull request, #37417:
URL: https://github.com/apache/spark/pull/37417

   
   ### What changes were proposed in this pull request?
   This PR will place spark.files , spark.jars and spark.pyfiles to the current working directory on the driver in K8s cluster mode 
   
   
   ### Why are the changes needed?
   This mimics the behaviour of Yarn and also helps user to access files from PWD . Also as mentioned in the jira 
   By doing this, users can, for example, leverage PEX to manage Python dependences in Apache Spark:
   ```
   pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
   PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Tested via unit test cases and also ran on local K8s cluster. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] advancedxy commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by "advancedxy (via GitHub)" <gi...@apache.org>.
advancedxy commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1517797912

   @pralabhkumar thanks for your work. I noticed similar issue when running spark application on K8S, it's helpful feature
   
   However, this pr might have some inefficiency to download files/jars twice when running k8s-cluster mode.
   
   ``` java
     if (deployMode == CLIENT) {
         // jars are downloaded once
         localJars = Option(args.jars).map {
           downloadFileList(_, targetDir, sparkConf, hadoopConf)
         }.orNull
         // py files are downloaded once
         localPyFiles = Option(args.pyFiles).map {
           downloadFileList(_, targetDir, sparkConf, hadoopConf)
         }.orNull
         if (isKubernetesClusterModeDriver) {
           def downloadResourcesToCurrentDirectory(uris: String, isArchive: Boolean = false): String = {
              ...
           }
   
           val filesLocalFiles = Option(args.files).map {
             downloadResourcesToCurrentDirectory(_)
           }.orNull
           // jars are downloaded again
           val jarsLocalJars = Option(args.jars).map {
             downloadResourcesToCurrentDirectory(_)
           }.orNull
           val archiveLocalFiles = Option(args.archives).map {
             downloadResourcesToCurrentDirectory(_, true)
           }.orNull
           // py files are downloaded again
           val pyLocalFiles = Option(args.pyFiles).map {
             downloadResourcesToCurrentDirectory(_)
           }.orNull
       }
     }
   ```
   
   Would you mind to create a followup pr to address this issue? @pralabhkumar 
   
   Also, there's another catch when running spark on k8s with --files/--archives:
   These files/archives are already downloaded here, however they are passed as args.files, args.archives, the spark context would copied them (and/or untar them) again when constructing the context, see relevant code:
   https://github.com/apache/spark/blob/d407a42090d7c027050be7ee723f7e3d8f686ed7/core/src/main/scala/org/apache/spark/SparkContext.scala#L440-L443
   And
   https://github.com/apache/spark/blob/d407a42090d7c027050be7ee723f7e3d8f686ed7/core/src/main/scala/org/apache/spark/SparkContext.scala#L524-L544
   
   cc @Ngone51 @holdenk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1236552864

   ping @dongjoon-hyun @Ngone51 . I think this is important feature to have the same behaviour as yarn . Please review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1208420156

   @tgravescs 
   
   https://github.com/apache/spark/pull/30735/files#r540935585 , In case K8s files are not copied to current working directory . Files are available in local but not in current working directory for driver.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] asfgit closed pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode
URL: https://github.com/apache/spark/pull/37417


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
holdenk commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1347772151

   Thanks for making the PR :D 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
tgravescs commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1208165210

   > Does this PR introduce any user-facing change?
   > No
   
   Is this true?  If I specify a file to pass along and my code references it, what did it do before?   I know normally on yarn I would do some like ./XXXX. 
   
   I'm a bit worried here about breaking people if we move location and they expected it somewhere else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1324989401

   @Dooyoung-Hwang please let me know if this feature is not required . I can close the PR . However IMHO , this is a parity fix feature w.r.t to yarn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a diff in pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on code in PR #37417:
URL: https://github.com/apache/spark/pull/37417#discussion_r972605146


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -381,45 +382,52 @@ private[spark] class SparkSubmit extends Logging {
       localPyFiles = Option(args.pyFiles).map {
         downloadFileList(_, targetDir, sparkConf, hadoopConf)
       }.orNull
-
       if (isKubernetesClusterModeDriver) {
-        // Replace with the downloaded local jar path to avoid propagating hadoop compatible uris.
-        // Executors will get the jars from the Spark file server.
-        // Explicitly download the related files here
-        args.jars = localJars
-        val filesLocalFiles = Option(args.files).map {
-          downloadFileList(_, targetDir, sparkConf, hadoopConf)
-        }.orNull
-        val archiveLocalFiles = Option(args.archives).map { uris =>
+        // SPARK-33748: this mimics the behaviour of Yarn cluster mode. If the driver is running
+        // in cluster mode, the archives should be available in the driver's current working
+        // directory too.
+        // SPARK-33782 : This downloads all the files , jars , archiveFiles and pyfiles to current
+        // working directory
+        def downloadResourcesToCurrentDirectory(uris: String, isArchive: Boolean = false):
+        String = {
           val resolvedUris = Utils.stringToSeq(uris).map(Utils.resolveURI)
-          val localArchives = downloadFileList(
+          val localResources = downloadFileList(
             resolvedUris.map(
               UriBuilder.fromUri(_).fragment(null).build().toString).mkString(","),
             targetDir, sparkConf, hadoopConf)
-
-          // SPARK-33748: this mimics the behaviour of Yarn cluster mode. If the driver is running
-          // in cluster mode, the archives should be available in the driver's current working
-          // directory too.
-          Utils.stringToSeq(localArchives).map(Utils.resolveURI).zip(resolvedUris).map {
-            case (localArchive, resolvedUri) =>
-              val source = new File(localArchive.getPath)
+          Utils.stringToSeq(localResources).map(Utils.resolveURI).zip(resolvedUris).map {
+            case (localResources, resolvedUri) =>
+              val source = new File(localResources.getPath)
               val dest = new File(
                 ".",
                 if (resolvedUri.getFragment != null) resolvedUri.getFragment else source.getName)
               logInfo(
-                s"Unpacking an archive $resolvedUri " +
+                s"Files  $resolvedUri " +
                   s"from ${source.getAbsolutePath} to ${dest.getAbsolutePath}")
               Utils.deleteRecursively(dest)
-              Utils.unpack(source, dest)
-
+              if (isArchive) Utils.unpack(source, dest) else Files.copy(source.toPath, dest.toPath)
               // Keep the URIs of local files with the given fragments.
               UriBuilder.fromUri(
-                localArchive).fragment(resolvedUri.getFragment).build().toString
+                localResources).fragment(resolvedUri.getFragment).build().toString
           }.mkString(",")
+        }
+
+        val filesLocalFiles = Option(args.files).map {
+          downloadResourcesToCurrentDirectory(_)
+        }.orNull
+        val jarsLocalJars = Option(args.jars).map {
+          downloadResourcesToCurrentDirectory(_)
+        }.orNull
+        val archiveLocalFiles = Option(args.archives).map {
+          downloadResourcesToCurrentDirectory(_, true)
+        }.orNull
+        val pyLocalFiles = Option(args.pyFiles).map {
+          downloadResourcesToCurrentDirectory(_)
         }.orNull
         args.files = filesLocalFiles
         args.archives = archiveLocalFiles
-        args.pyFiles = localPyFiles
+        args.pyFiles = pyLocalFiles
+        args.jars = jarsLocalJars

Review Comment:
   Since localJars is already defined above , used this variable (which is simillar to filesLocalFiles)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37417:
URL: https://github.com/apache/spark/pull/37417#discussion_r939612367


##########
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:
##########
@@ -486,6 +486,38 @@ class SparkSubmitSuite
     conf.get("spark.kubernetes.driver.container.image") should be ("bar")
   }
 
+  test("SPARK-33782 : handles k8s  files download to current directory") {

Review Comment:
   ```suggestion
     test("SPARK-33782 : handles k8s files download to current directory") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1238132273

   The change generally looks good to me. It'd be good if @dongjoon-hyun could take a look since he has more knowledge in K8s.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1238899049

   Thx @Ngone51  , for reviewing this , i'll do the suggested changes. 
   
   @dongjoon-hyun Please review the PR . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1206608039

   @HyukjinKwon  Please review , build is failing because of un related error (since its passing on local)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37417:
URL: https://github.com/apache/spark/pull/37417#discussion_r939616121


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -381,45 +381,52 @@ private[spark] class SparkSubmit extends Logging {
       localPyFiles = Option(args.pyFiles).map {
         downloadFileList(_, targetDir, sparkConf, hadoopConf)
       }.orNull
-
       if (isKubernetesClusterModeDriver) {
-        // Replace with the downloaded local jar path to avoid propagating hadoop compatible uris.
-        // Executors will get the jars from the Spark file server.
-        // Explicitly download the related files here
-        args.jars = localJars
-        val filesLocalFiles = Option(args.files).map {
-          downloadFileList(_, targetDir, sparkConf, hadoopConf)
-        }.orNull
-        val archiveLocalFiles = Option(args.archives).map { uris =>
+        // SPARK-33748: this mimics the behaviour of Yarn cluster mode. If the driver is running
+        // in cluster mode, the archives should be available in the driver's current working
+        // directory too.
+        // SPARK-33782 : This downloads all the files , jars , archiveFiles and pyfiles to current
+        // working directory
+        def downloadResourcesToCurrentDirectory(uris: String): String = {
           val resolvedUris = Utils.stringToSeq(uris).map(Utils.resolveURI)
-          val localArchives = downloadFileList(
+          val localResources = downloadFileList(
             resolvedUris.map(
               UriBuilder.fromUri(_).fragment(null).build().toString).mkString(","),
             targetDir, sparkConf, hadoopConf)
-
-          // SPARK-33748: this mimics the behaviour of Yarn cluster mode. If the driver is running
-          // in cluster mode, the archives should be available in the driver's current working
-          // directory too.
-          Utils.stringToSeq(localArchives).map(Utils.resolveURI).zip(resolvedUris).map {
-            case (localArchive, resolvedUri) =>
-              val source = new File(localArchive.getPath)
+          Utils.stringToSeq(localResources).map(Utils.resolveURI).zip(resolvedUris).map {
+            case (localResources, resolvedUri) =>
+              val source = new File(localResources.getPath)
               val dest = new File(
                 ".",
                 if (resolvedUri.getFragment != null) resolvedUri.getFragment else source.getName)
               logInfo(
-                s"Unpacking an archive $resolvedUri " +
+                s"Files  $resolvedUri " +
                   s"from ${source.getAbsolutePath} to ${dest.getAbsolutePath}")
               Utils.deleteRecursively(dest)
               Utils.unpack(source, dest)

Review Comment:
   Hm, why do we try this unpack for jars and files too? I think we should just call `downloadFileList` for them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37417:
URL: https://github.com/apache/spark/pull/37417#discussion_r939612387


##########
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:
##########
@@ -486,6 +486,38 @@ class SparkSubmitSuite
     conf.get("spark.kubernetes.driver.container.image") should be ("bar")
   }
 
+  test("SPARK-33782 : handles k8s  files download to current directory") {
+    val clArgs = Seq(
+      "--deploy-mode", "client",
+      "--proxy-user", "test.user",
+      "--master", "k8s://host:port",
+      "--executor-memory", "5g",
+      "--class", "org.SomeClass",
+      "--driver-memory", "4g",
+      "--conf", "spark.kubernetes.namespace=spark",
+      "--conf", "spark.kubernetes.driver.container.image=bar",
+      "--conf", "spark.kubernetes.submitInDriver=true",
+      "--files", "src/test/resources/test_metrics_config.properties",
+      "--py-files", "src/test/resources/test_metrics_system.properties",
+      "--archives", "src/test/resources/log4j2.properties",
+      "/home/thejar.jar",
+      "arg1")
+    val appArgs = new SparkSubmitArguments(clArgs)
+    val (childArgs, classpath, conf, mainClass) = submit.prepareSubmitEnvironment(appArgs)
+    conf.get("spark.master") should be ("k8s://https://host:port")
+    conf.get("spark.executor.memory") should be ("5g")
+    conf.get("spark.driver.memory") should be ("4g")
+    conf.get("spark.kubernetes.namespace") should be ("spark")
+    conf.get("spark.kubernetes.driver.container.image") should be ("bar")
+    import java.nio.file.{Paths, Files}

Review Comment:
   Let's import this on the top



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1217455718

   ping @dongjoon-hyun @Ngone51 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1347736718

   @HyukjinKwon  Thx for giving LGTM.  Please merge the PR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1207088896

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1246809845

   Gentle ping @dongjoon-hyun


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stijndehaes commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by "stijndehaes (via GitHub)" <gi...@apache.org>.
stijndehaes commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1558629449

   Maybe this is not the correct place to ask, but this PR had created some issues for our flow. Maybe our flow was not how things were intended to be used but it goes like this:
   We submit jobs on kubernetes by baking in the jar in the docker image, we copy the jar in the docker image on the workdir location.
   The result of this PR is that our jar is removed, because of the `Utils.deleteRecursively(dest)`.
   Now a quick fix is to copy the jar somewhere else, but if this sounds like unintended behavior to you guys, I am willing to open up a PR to fix it.
   I am however not sure what the intended behavior would be. Just remove this `Utils.deleteRecursively(dest)` sounds a bit too simple :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1207342632

   cc @dongjoon-hyun @tgravescs @Ngone51 FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1242729004

   Gentle ping @dongjoon-hyun  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1347768333

   Thx @holdenk for your help . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
holdenk commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1347764987

   Merged, thanks for the reminder.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1258084449

   @HyukjinKwon  , can u please reviewed or get it reviewed. Since this is important jira (opened by you) and also fix parity with yarn . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #37417:
URL: https://github.com/apache/spark/pull/37417#discussion_r963689315


##########
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:
##########
@@ -486,6 +486,41 @@ class SparkSubmitSuite
     conf.get("spark.kubernetes.driver.container.image") should be ("bar")
   }
 
+  test("SPARK-33782 : handles k8s files download to current directory") {

Review Comment:
   nit:
   ```suggestion
     test("SPARK-33782: handles k8s files download to current directory") {
   ```



##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -381,45 +382,52 @@ private[spark] class SparkSubmit extends Logging {
       localPyFiles = Option(args.pyFiles).map {
         downloadFileList(_, targetDir, sparkConf, hadoopConf)
       }.orNull
-

Review Comment:
   revert unnecessary change?



##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -381,45 +382,52 @@ private[spark] class SparkSubmit extends Logging {
       localPyFiles = Option(args.pyFiles).map {
         downloadFileList(_, targetDir, sparkConf, hadoopConf)
       }.orNull
-
       if (isKubernetesClusterModeDriver) {
-        // Replace with the downloaded local jar path to avoid propagating hadoop compatible uris.
-        // Executors will get the jars from the Spark file server.
-        // Explicitly download the related files here
-        args.jars = localJars
-        val filesLocalFiles = Option(args.files).map {
-          downloadFileList(_, targetDir, sparkConf, hadoopConf)
-        }.orNull
-        val archiveLocalFiles = Option(args.archives).map { uris =>
+        // SPARK-33748: this mimics the behaviour of Yarn cluster mode. If the driver is running
+        // in cluster mode, the archives should be available in the driver's current working
+        // directory too.
+        // SPARK-33782 : This downloads all the files , jars , archiveFiles and pyfiles to current
+        // working directory
+        def downloadResourcesToCurrentDirectory(uris: String, isArchive: Boolean = false):
+        String = {
           val resolvedUris = Utils.stringToSeq(uris).map(Utils.resolveURI)
-          val localArchives = downloadFileList(
+          val localResources = downloadFileList(
             resolvedUris.map(
               UriBuilder.fromUri(_).fragment(null).build().toString).mkString(","),
             targetDir, sparkConf, hadoopConf)
-
-          // SPARK-33748: this mimics the behaviour of Yarn cluster mode. If the driver is running
-          // in cluster mode, the archives should be available in the driver's current working
-          // directory too.
-          Utils.stringToSeq(localArchives).map(Utils.resolveURI).zip(resolvedUris).map {
-            case (localArchive, resolvedUri) =>
-              val source = new File(localArchive.getPath)
+          Utils.stringToSeq(localResources).map(Utils.resolveURI).zip(resolvedUris).map {
+            case (localResources, resolvedUri) =>
+              val source = new File(localResources.getPath)
               val dest = new File(
                 ".",
                 if (resolvedUri.getFragment != null) resolvedUri.getFragment else source.getName)
               logInfo(
-                s"Unpacking an archive $resolvedUri " +
+                s"Files  $resolvedUri " +
                   s"from ${source.getAbsolutePath} to ${dest.getAbsolutePath}")
               Utils.deleteRecursively(dest)
-              Utils.unpack(source, dest)
-
+              if (isArchive) Utils.unpack(source, dest) else Files.copy(source.toPath, dest.toPath)
               // Keep the URIs of local files with the given fragments.
               UriBuilder.fromUri(
-                localArchive).fragment(resolvedUri.getFragment).build().toString
+                localResources).fragment(resolvedUri.getFragment).build().toString
           }.mkString(",")
+        }
+
+        val filesLocalFiles = Option(args.files).map {
+          downloadResourcesToCurrentDirectory(_)
+        }.orNull
+        val jarsLocalJars = Option(args.jars).map {
+          downloadResourcesToCurrentDirectory(_)
+        }.orNull
+        val archiveLocalFiles = Option(args.archives).map {
+          downloadResourcesToCurrentDirectory(_, true)
+        }.orNull
+        val pyLocalFiles = Option(args.pyFiles).map {
+          downloadResourcesToCurrentDirectory(_)
         }.orNull
         args.files = filesLocalFiles
         args.archives = archiveLocalFiles
-        args.pyFiles = localPyFiles
+        args.pyFiles = pyLocalFiles
+        args.jars = jarsLocalJars

Review Comment:
   `jarsLocalJars` -> `localJars`?



##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -381,45 +382,52 @@ private[spark] class SparkSubmit extends Logging {
       localPyFiles = Option(args.pyFiles).map {
         downloadFileList(_, targetDir, sparkConf, hadoopConf)
       }.orNull
-
       if (isKubernetesClusterModeDriver) {
-        // Replace with the downloaded local jar path to avoid propagating hadoop compatible uris.
-        // Executors will get the jars from the Spark file server.
-        // Explicitly download the related files here
-        args.jars = localJars
-        val filesLocalFiles = Option(args.files).map {
-          downloadFileList(_, targetDir, sparkConf, hadoopConf)
-        }.orNull
-        val archiveLocalFiles = Option(args.archives).map { uris =>
+        // SPARK-33748: this mimics the behaviour of Yarn cluster mode. If the driver is running
+        // in cluster mode, the archives should be available in the driver's current working
+        // directory too.
+        // SPARK-33782 : This downloads all the files , jars , archiveFiles and pyfiles to current
+        // working directory
+        def downloadResourcesToCurrentDirectory(uris: String, isArchive: Boolean = false):
+        String = {
           val resolvedUris = Utils.stringToSeq(uris).map(Utils.resolveURI)
-          val localArchives = downloadFileList(
+          val localResources = downloadFileList(
             resolvedUris.map(
               UriBuilder.fromUri(_).fragment(null).build().toString).mkString(","),
             targetDir, sparkConf, hadoopConf)
-
-          // SPARK-33748: this mimics the behaviour of Yarn cluster mode. If the driver is running
-          // in cluster mode, the archives should be available in the driver's current working
-          // directory too.
-          Utils.stringToSeq(localArchives).map(Utils.resolveURI).zip(resolvedUris).map {
-            case (localArchive, resolvedUri) =>
-              val source = new File(localArchive.getPath)
+          Utils.stringToSeq(localResources).map(Utils.resolveURI).zip(resolvedUris).map {
+            case (localResources, resolvedUri) =>
+              val source = new File(localResources.getPath)
               val dest = new File(
                 ".",
                 if (resolvedUri.getFragment != null) resolvedUri.getFragment else source.getName)
               logInfo(
-                s"Unpacking an archive $resolvedUri " +
+                s"Files  $resolvedUri " +
                   s"from ${source.getAbsolutePath} to ${dest.getAbsolutePath}")
               Utils.deleteRecursively(dest)
-              Utils.unpack(source, dest)
-
+              if (isArchive) Utils.unpack(source, dest) else Files.copy(source.toPath, dest.toPath)

Review Comment:
   ```suggestion
                 if (isArchive) {
                   Utils.unpack(source, dest)
                 } else {
                   Files.copy(source.toPath, dest.toPath)
                 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1248915000

    @dongjoon-hyun , please review this , it will help to have K8s similar functionality as Yarn . Since this is target for Spark3.4 (as per jira) , IMHO , it would be great if u can spend some time on it . 
   
   cc @HyukjinKwon (since jira is created by [HyukjinKwon])


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1251334364

   @dongjoon-hyun , Have incorporated all the review comments , please look into the same. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1253298725

   @HyukjinKwon @Ngone51 @dongjoon-hyun 
   Gentle ping to please review the same. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1222603252

   ping @dongjoon-hyun @Ngone51 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1272748900

   @Yikun can u please review this PR . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pm-nuance commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

Posted by "pm-nuance (via GitHub)" <gi...@apache.org>.
pm-nuance commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1630954916

   @pralabhkumar @HyukjinKwon @holdenk 
   Facing issue with the new Spark version which is using Files.Copy in SparkSubmit.scala.
   The latest change in the SparkSubmit.scala is causing the NoSuchFileException.
   The mentioned "sample.jar" is present in the mentioned path /opt/spark/work-dir/sample.jar inside the docker image.
   My same code is working fine with Spark 3.3.0.
   
   **Here is the Error Message**
   Files  local:///opt/spark/work-dir/sample.jar from /opt/spark/work-dir/sample.jar to /opt/spark/work-dir/./sample.jar
   Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/sample.jar
           at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
           at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
           at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
           at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
           at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
           at java.nio.file.Files.copy(Files.java:1274)
           at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437)
           at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
           at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
           at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
           at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
           at scala.collection.TraversableLike.map(TraversableLike.scala:286)
           at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
           at scala.collection.AbstractTraversable.map(Traversable.scala:108)
         
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org