You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/24 14:20:12 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

HyukjinKwon opened a new pull request #30486:
URL: https://github.com/apache/spark/pull/30486


   ### What changes were proposed in this pull request?
   
   TL;DR:
   - This PR completes the support of archives in Spark itself instead of Yarn-only
   -  After this PR, PySpark users can use Conda to ship Python packages together as below:
       ```python
       conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0
       conda activate pyspark_env
       conda pack -f -o pyspark_env.tar.gz
       PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment
      ```
   
   
   This PR proposes to add Spark's native `--archives` in Spark submit, and `spark.archives` configuration. Currently, both are supported only in Yarn mode:
   
   ```bash
   ./bin/spark-submit --help
   ```
   
   ```
   Options:
   ...
    Spark on YARN only:
     --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
     --archives ARCHIVES         Comma separated list of archives to be extracted into the
                                 working directory of each executor.
   ```
   
   This `archives` feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment.
   
   Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0).
   
   The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use `spark.files` by relying on its undocumented behaviour that untars `tar.gz` but I don't think we should document such ways and promote people to more rely on it.
   
   Also, note that this PR does not target to add the feature parity of `spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented that this is an experimental feature as well.
   
   ### Why are the changes needed?
   
   To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a new configuration `spark.archives`.
   
   ### How was this patch tested?
   
   I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733713063


   **[Test build #131769 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131769/testReport)** for PR 30486 at commit [`5b1d1c3`](https://github.com/apache/spark/commit/5b1d1c3db6c1ce68a2738dec9b54af519b026b42).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-736214769


   Oh, maybe I will use tar.gz and tgz in the integration test. That will address https://github.com/apache/spark/pull/30486#discussion_r532678330 togeter.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r530054680



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -422,6 +426,8 @@ class SparkContext(config: SparkConf) extends Logging {
     _jars = Utils.getUserJars(_conf)
     _files = _conf.getOption(FILES.key).map(_.split(",")).map(_.filter(_.nonEmpty))
       .toSeq.flatten
+    _archives = _conf.getOption(ARCHIVES.key).map(_.split(",")).map(_.filter(_.nonEmpty))
+      .toSeq.flatten

Review comment:
       nit: `_archives = _conf.getOption(ARCHIVES.key).map(Utils.stringToSeq).toSeq.flatten`?

##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
##########
@@ -512,6 +513,8 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S
         |  --files FILES               Comma-separated list of files to be placed in the working
         |                              directory of each executor. File paths of these files
         |                              in executors can be accessed via SparkFiles.get(fileName).
+        |  --archives ARCHIVES         Comma separated list of archives to be extracted into the

Review comment:
       nit: `Comma separated` -> `Comma-separated` for consistency?

##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -1803,6 +1803,16 @@ package object config {
     .toSequence
     .createWithDefault(Nil)
 
+  private[spark] val ARCHIVES = ConfigBuilder("spark.archives")
+    .version("3.1.0")
+    .doc("Comma separated list of archives to be extracted into the working directory of each " +

Review comment:
       btw, we don't need to describe something here about the relationship (or difference?) between `park.files` and `spark.archives`? 

##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -494,7 +495,8 @@ private[spark] object Utils extends Logging {
       securityMgr: SecurityManager,
       hadoopConf: Configuration,
       timestamp: Long,
-      useCache: Boolean): File = {
+      useCache: Boolean,
+      shouldUntar: Boolean = true): File = {

Review comment:
       In L487, could you document what the new param `shouldUntar` is?

##########
File path: core/src/main/scala/org/apache/spark/executor/Executor.scala
##########
@@ -230,16 +232,17 @@ private[spark] class Executor(
   private val appStartTime = conf.getLong("spark.app.startTime", 0)
 
   // To allow users to distribute plugins and their required files
-  // specified by --jars and --files on application submission, those jars/files should be
-  // downloaded and added to the class loader via updateDependencies.
-  // This should be done before plugin initialization below
+  // specified by --jars, --files and --archives on application submission, those
+  // jars/files/archives should be downloaded and added to the class loader vi

Review comment:
       nit: `vi` -> `via`

##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1568,21 +1613,38 @@ class SparkContext(config: SparkConf) extends Logging {
 
     val key = if (!isLocal && scheme == "file") {
       env.rpcEnv.fileServer.addFile(new File(uri.getPath))
+    } else if (uri.getScheme == null) {
+      schemeCorrectedURI.toString
+    } else if (isArchive) {
+      uri.toString
     } else {
-        if (uri.getScheme == null) {
-          schemeCorrectedURI.toString
-        } else {
-          path
-        }
+      path
     }
+
     val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
-    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
+    if (!isArchive && addedFiles.putIfAbsent(key, timestamp).isEmpty) {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
       Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
         env.securityManager, hadoopConfiguration, timestamp, useCache = false)
       postEnvironmentUpdate()
+    } else if (
+      isArchive &&
+        addedArchives.putIfAbsent(
+          UriBuilder.fromUri(new URI(key)).fragment(uri.getFragment).build().toString,
+          timestamp).isEmpty) {
+      logInfo(s"Added archive $path at $key with timestamp $timestamp")
+      val uriToDownload = UriBuilder.fromUri(new URI(key)).fragment(null).build()
+      val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf,
+        env.securityManager, hadoopConfiguration, timestamp, useCache = false, shouldUntar = false)
+      logInfo(s"Unpacking an archive $path")

Review comment:
       How about logging source and dest, too?

##########
File path: core/src/main/scala/org/apache/spark/executor/Executor.scala
##########
@@ -907,32 +911,49 @@ private[spark] class Executor(
    * Download any missing dependencies if we receive a new set of files and JARs from the
    * SparkContext. Also adds any new JARs we fetched to the class loader.
    */
-  private def updateDependencies(newFiles: Map[String, Long], newJars: Map[String, Long]): Unit = {
+  private def updateDependencies(
+      newFiles: Map[String, Long],
+      newJars: Map[String, Long],
+      newArchives: Map[String, Long]): Unit = {
     lazy val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
     synchronized {
       // Fetch missing dependencies
       for ((name, timestamp) <- newFiles if currentFiles.getOrElse(name, -1L) < timestamp) {
-        logInfo("Fetching " + name + " with timestamp " + timestamp)
+        logInfo(s"Fetching $name with timestamp $timestamp")
         // Fetch file with useCache mode, close cache for local mode.
         Utils.fetchFile(name, new File(SparkFiles.getRootDirectory()), conf,
           env.securityManager, hadoopConf, timestamp, useCache = !isLocal)
         currentFiles(name) = timestamp
       }
+      for ((name, timestamp) <- newArchives if currentArchives.getOrElse(name, -1L) < timestamp) {
+        logInfo(s"Fetching $name with timestamp $timestamp")
+        val sourceURI = new URI(name)
+        val uriToDownload = UriBuilder.fromUri(sourceURI).fragment(null).build()
+        val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf,
+          env.securityManager, hadoopConf, timestamp, useCache = false, shouldUntar = false)

Review comment:
       We cannot use cache for this purpose?

##########
File path: docs/configuration.md
##########
@@ -784,6 +784,17 @@ Apart from these, the following properties are also available, and may be useful
   </td>
   <td>2.3.0</td>
 </tr>
+<tr>
+  <td><code>spark.archives</code></td>
+  <td></td>
+  <td>
+    Comma separated list of archives to be extracted into the working directory of each executor.

Review comment:
       ditto: `Comma separated` -> `Comma-separated`

##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -1803,6 +1803,16 @@ package object config {
     .toSequence
     .createWithDefault(Nil)
 
+  private[spark] val ARCHIVES = ConfigBuilder("spark.archives")
+    .version("3.1.0")
+    .doc("Comma separated list of archives to be extracted into the working directory of each " +

Review comment:
       nit: `Comma separated` -> `Comma-separated` for consistency?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #30486:
URL: https://github.com/apache/spark/pull/30486


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r532966969



##########
File path: core/src/test/scala/org/apache/spark/SparkContextSuite.scala
##########
@@ -160,6 +160,85 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu
     }
   }
 
+  test("SPARK-33530: basic case for addArchive and listArchives") {
+    withTempDir { dir =>
+      val file1 = File.createTempFile("someprefix1", "somesuffix1", dir)
+      val file2 = File.createTempFile("someprefix2", "somesuffix2", dir)
+      val file3 = File.createTempFile("someprefix3", "somesuffix3", dir)
+      val file4 = File.createTempFile("someprefix4", "somesuffix4", dir)
+
+      val jarFile = new File(dir, "test!@$jar.jar")
+      val zipFile = new File(dir, "test-zip.zip")

Review comment:
       I manually tested tar.gz and tgz cases, and the cases without extensions too. The unpack method is a copy from Hadoop so I did not exhaustively test but I can add if that looks better to add the case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733470052






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733119823






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-734037334


   It will be exactly same as `spark.files` and `spark.yarn.dist.files` since I am reusing `spark.files` code path (`spark.jars` vs `spark.yarn.dist.jars` too). To be honest, I am not exactly sure how they resolve the conflict to each other but both can work together as far as I know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735633878






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r530265377



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1568,21 +1612,39 @@ class SparkContext(config: SparkConf) extends Logging {
 
     val key = if (!isLocal && scheme == "file") {
       env.rpcEnv.fileServer.addFile(new File(uri.getPath))
+    } else if (uri.getScheme == null) {
+      schemeCorrectedURI.toString
+    } else if (isArchive) {
+      uri.toString
     } else {
-        if (uri.getScheme == null) {
-          schemeCorrectedURI.toString
-        } else {
-          path
-        }
+      path
     }
+
     val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
-    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
+    if (!isArchive && addedFiles.putIfAbsent(key, timestamp).isEmpty) {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
       Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
         env.securityManager, hadoopConfiguration, timestamp, useCache = false)
       postEnvironmentUpdate()
+    } else if (
+      isArchive &&
+        addedArchives.putIfAbsent(
+          UriBuilder.fromUri(new URI(key)).fragment(uri.getFragment).build().toString,
+          timestamp).isEmpty) {
+      logInfo(s"Added archive $path at $key with timestamp $timestamp")
+      val uriToDownload = UriBuilder.fromUri(new URI(key)).fragment(null).build()
+      val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf,

Review comment:
       I think it's fine to unpack/download on the driver side as well to mimic the previous Yarn's behaviour. At least seems that's what Yarn does in Yarn cluster mode.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r530016725



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -1803,6 +1803,16 @@ package object config {
     .toSequence
     .createWithDefault(Nil)
 
+  private[spark] val ARCHIVES = ConfigBuilder("spark.archives")
+    .version("3.1.0")
+    .doc("Comma separated list of archives to be extracted into the working directory of each " +
+      "executor. .jar, .tar.gz, .tgz and .zip are supported. You can specify the directory " +
+      "name to unpack via adding '#' after the file name to unpack, for example, " +
+      "'file.zip#directory'. This configuration is experimental.")

Review comment:
       Makes sense. I will instead say the behaviour will be deprecated. We can remove that back once this configuration becomes stable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733625455


   I pushed some more changes to fix some nits which are all virtually non-code change (https://github.com/apache/spark/pull/30486/commits/5b1d1c3db6c1ce68a2738dec9b54af519b026b42).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733810304






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733282465






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733761482






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529590417



##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -535,13 +538,23 @@ private[spark] object Utils extends Logging {
       doFetchFile(url, targetDir, fileName, conf, securityMgr, hadoopConf)
     }
 
-    // Decompress the file if it's a .tar or .tar.gz
-    if (fileName.endsWith(".tar.gz") || fileName.endsWith(".tgz")) {
-      logInfo("Untarring " + fileName)
-      executeAndGetOutput(Seq("tar", "-xzf", fileName), targetDir)
-    } else if (fileName.endsWith(".tar")) {
-      logInfo("Untarring " + fileName)
-      executeAndGetOutput(Seq("tar", "-xf", fileName), targetDir)
+    if (shouldUntar) {
+      // Decompress the file if it's a .tar or .tar.gz
+      if (fileName.endsWith(".tar.gz") || fileName.endsWith(".tgz")) {
+        logWarning(
+          "Untarring behavior is deprecated at spark.files and " +
+            "SparkContext.addFile. Use spark.archives or SparkContext.addArchive " +
+            "instead.")
+        logInfo("Untarring " + fileName)
+        executeAndGetOutput(Seq("tar", "-xzf", fileName), targetDir)

Review comment:
       Our `spark.files` and `SparkContext.addFile` have a sort of undocumented and hidden behaviour. Only in executor side, it untars if the files are `.tar.gz` or `tgz`. I think it makes sense to deprecate this behaviour and encourage users to use explicit archive handling.
   
   Also, I believe it's a good practice to avoid relying on external programs anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733510493






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733029357






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735625661


   **[Test build #131961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131961/testReport)** for PR 30486 at commit [`e35e94c`](https://github.com/apache/spark/commit/e35e94cad2888c98e03feb391a8dfcca8afa365c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733031627


   **[Test build #131670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131670/testReport)** for PR 30486 at commit [`0559c95`](https://github.com/apache/spark/commit/0559c955efe3e62ac7590b3cccefea73fd5e0fc8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733488641


   **[Test build #131743 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131743/testReport)** for PR 30486 at commit [`15d8ed5`](https://github.com/apache/spark/commit/15d8ed51d99e57403aae3272d9975b96e7735ee2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735554483


   **[Test build #131961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131961/testReport)** for PR 30486 at commit [`e35e94c`](https://github.com/apache/spark/commit/e35e94cad2888c98e03feb391a8dfcca8afa365c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
tgravescs commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r532678330



##########
File path: core/src/test/scala/org/apache/spark/SparkContextSuite.scala
##########
@@ -160,6 +160,85 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu
     }
   }
 
+  test("SPARK-33530: basic case for addArchive and listArchives") {
+    withTempDir { dir =>
+      val file1 = File.createTempFile("someprefix1", "somesuffix1", dir)
+      val file2 = File.createTempFile("someprefix2", "somesuffix2", dir)
+      val file3 = File.createTempFile("someprefix3", "somesuffix3", dir)
+      val file4 = File.createTempFile("someprefix4", "somesuffix4", dir)
+
+      val jarFile = new File(dir, "test!@$jar.jar")
+      val zipFile = new File(dir, "test-zip.zip")

Review comment:
       are we relying on existing test for tar.gz and tar?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733567818






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733031627


   **[Test build #131670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131670/testReport)** for PR 30486 at commit [`0559c95`](https://github.com/apache/spark/commit/0559c955efe3e62ac7590b3cccefea73fd5e0fc8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733312646


   I think it's fine to avoid removing `spark.yarn.dist.archives` out yet - maybe we could think about removing out once this feature becomes stable (?). Yarn also has `spark.yarn.dist.files` and `spark.files` can work together as far as I know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733468067


   **[Test build #131714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131714/testReport)** for PR 30486 at commit [`59b4355`](https://github.com/apache/spark/commit/59b4355f13a5b2e3f81dd92293b696c514778ccb).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `            logInfo(s\"Adding $url to class loader\")`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733414026






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735633878






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733449074






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733642537






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-736111892


   I haven't tested in K8S yet it would take me a while. I plan to add an integration test though.
   
   Hope I can proceed it separately given that the code freeze is coming and I would like to get this in for Spark 3.1.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733568538


   **[Test build #131759 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131759/testReport)** for PR 30486 at commit [`15d8ed5`](https://github.com/apache/spark/commit/15d8ed51d99e57403aae3272d9975b96e7735ee2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-734037334


   It will be exactly same as `spark.files` and `spark.yarn.dist.files` since I am reusing `spark.files` code path. To be honest, I am not exactly sure how they resolve the conflict to each other but both work together as far as I know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529583152



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1550,7 +1594,7 @@ class SparkContext(config: SparkConf) extends Logging {
 
     val hadoopPath = new Path(schemeCorrectedURI)
     val scheme = schemeCorrectedURI.getScheme
-    if (!Array("http", "https", "ftp").contains(scheme)) {
+    if (!Array("http", "https", "ftp").contains(scheme) && !isArchive) {

Review comment:
       Archive is not supposed to be a directory.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733011314


   **[Test build #131667 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131667/testReport)** for PR 30486 at commit [`469eacf`](https://github.com/apache/spark/commit/469eacfb25a6aa21118b8d89728b70ab22fc4dcb).
    * This patch **fails MiMa tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `            logInfo(s\"Adding $url to class loader\")`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733414028






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733469007






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733004565


   **[Test build #131667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131667/testReport)** for PR 30486 at commit [`469eacf`](https://github.com/apache/spark/commit/469eacfb25a6aa21118b8d89728b70ab22fc4dcb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733398510


   **[Test build #131712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131712/testReport)** for PR 30486 at commit [`c87577e`](https://github.com/apache/spark/commit/c87577e3e16587bfe3e7eb7780489d3df8a66f6a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733567925


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r532324811



##########
File path: core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
##########
@@ -335,6 +335,44 @@ class SparkSubmitSuite
     sys.props("SPARK_SUBMIT") should be ("true")
   }
 
+  test("SPARK-33530: handles standalone mode with archives") {
+    val clArgs = Seq(
+      "--master", "spark://localhost:1234",
+      "--executor-memory", "5g",
+      "--executor-cores", "5",
+      "--class", "org.SomeClass",
+      "--jars", "one.jar,two.jar,three.jar",
+      "--driver-memory", "4g",
+      "--files", "file1.txt,file2.txt",
+      "--archives", "archive1.zip,archive2.jar",
+      "--num-executors", "6",
+      "--name", "beauty",
+      "--conf", "spark.ui.enabled=false",
+      "thejar.jar",
+      "arg1", "arg2")
+    val appArgs = new SparkSubmitArguments(clArgs)
+    val (childArgs, classpath, conf, mainClass) = submit.prepareSubmitEnvironment(appArgs)
+    val childArgsStr = childArgs.mkString(" ")
+    childArgsStr should include ("arg1 arg2")
+    mainClass should be ("org.SomeClass")
+
+    classpath(0) should endWith ("thejar.jar")
+    classpath(1) should endWith ("one.jar")
+    classpath(2) should endWith ("two.jar")
+    classpath(3) should endWith ("three.jar")
+
+    conf.get("spark.executor.memory") should be ("5g")
+    conf.get("spark.driver.memory") should be ("4g")
+    conf.get("spark.executor.cores") should be ("5")
+    conf.get("spark.jars") should include regex (".*one.jar,.*two.jar,.*three.jar")
+    conf.get("spark.files") should include regex (".*file1.txt,.*file2.txt")
+    conf.get("spark.archives") should include regex (".*archive1.zip,.*archive2.jar")
+    conf.get("spark.app.name") should be ("beauty")
+    conf.get(UI_ENABLED) should be (false)
+    sys.props("SPARK_SUBMIT") should be ("true")
+  }
+
+

Review comment:
       nit: redundant line




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733639558


   **[Test build #131769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131769/testReport)** for PR 30486 at commit [`5b1d1c3`](https://github.com/apache/spark/commit/5b1d1c3db6c1ce68a2738dec9b54af519b026b42).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
tgravescs commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-736112787


   yeah its not a blocker


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733397169


   **[Test build #131709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131709/testReport)** for PR 30486 at commit [`7b95396`](https://github.com/apache/spark/commit/7b95396f53c57d543ff5ee856006f808eca3401c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735225924


   Yeah, that's correct. One thing js though, if there's anything wrong in terms of conflict between Yarn distributed cach (spark.yarn.dist.* vs spark.* like spark.files), I would say this is a separate issue to handle since I am reusing the existing code path


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733694235






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733639558


   **[Test build #131769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131769/testReport)** for PR 30486 at commit [`5b1d1c3`](https://github.com/apache/spark/commit/5b1d1c3db6c1ce68a2738dec9b54af519b026b42).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733726096






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733519344


   +1, LGTM (Pending CI)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-734037334


   It will be exactly same as `spark.files` and `spark.yarn.dist.files`. To be honest, I am not exactly sure how they will conflict to each other but both work together as far as I know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735209644


   Thanks for the details @HyukjinKwon !
   It would be good if @tgravescs can also take a look, I am a bit rusty on some of these now - and iirc there are a bunch of corner cases in this.
   
   So my understanding is that the PR is adding functionality which existing in yarn for archives into a general support for other cluster modes too - with spark on yarn continuing to rely on distributed cache (and there should be no functionality change in yarn mode).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529583595



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1568,21 +1612,44 @@ class SparkContext(config: SparkConf) extends Logging {
 
     val key = if (!isLocal && scheme == "file") {
       env.rpcEnv.fileServer.addFile(new File(uri.getPath))
+    } else if (uri.getScheme == null) {
+      schemeCorrectedURI.toString
+    } else if (isArchive) {
+      uri.toString

Review comment:
       For the same reason of keeping the fragment, it uses URI when it's archive.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733397169


   **[Test build #131709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131709/testReport)** for PR 30486 at commit [`7b95396`](https://github.com/apache/spark/commit/7b95396f53c57d543ff5ee856006f808eca3401c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-734037334


   It will be exactly same as `spark.files` and `spark.yarn.dist.files`. To be honest, I am not exactly sure how they will conflictto each other but both work together as far as I know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529745713



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -1803,6 +1803,16 @@ package object config {
     .toSequence
     .createWithDefault(Nil)
 
+  private[spark] val ARCHIVES = ConfigBuilder("spark.archives")
+    .version("3.1.0")
+    .doc("Comma separated list of archives to be extracted into the working directory of each " +
+      "executor. .jar, .tar.gz, .tgz and .zip are supported. You can specify the directory " +
+      "name to unpack via adding '#' after the file name to unpack, for example, " +
+      "'file.zip#directory'. This configuration is experimental.")

Review comment:
       Ur, if this is still experimental, we should not deprecate the old ways.
   > This configuration is experimental.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733810304






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733726096






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733398510


   **[Test build #131712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131712/testReport)** for PR 30486 at commit [`c87577e`](https://github.com/apache/spark/commit/c87577e3e16587bfe3e7eb7780489d3df8a66f6a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733642537






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733506294


   @mcg1969 too FYI [conda-pack](https://conda.github.io/conda-pack/). With this change, users can use conda-pack in other cluster modes not only Yarn.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733414971


   **[Test build #131714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131714/testReport)** for PR 30486 at commit [`59b4355`](https://github.com/apache/spark/commit/59b4355f13a5b2e3f81dd92293b696c514778ccb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733119823






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733564041


   **[Test build #131743 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131743/testReport)** for PR 30486 at commit [`15d8ed5`](https://github.com/apache/spark/commit/15d8ed51d99e57403aae3272d9975b96e7735ee2).
    * This patch **fails SparkR unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733510493






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733009501


   cc @zero323 and @fhoering too FYI. This is related to the docs and shipping 3rd party Python packages in PySpark apps.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733414971


   **[Test build #131714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131714/testReport)** for PR 30486 at commit [`59b4355`](https://github.com/apache/spark/commit/59b4355f13a5b2e3f81dd92293b696c514778ccb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733004565


   **[Test build #131667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131667/testReport)** for PR 30486 at commit [`469eacf`](https://github.com/apache/spark/commit/469eacfb25a6aa21118b8d89728b70ab22fc4dcb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733820509


   Thanks for working on this @HyukjinKwon !
   I have not taken a very detailed look, but wanted to understand the interaction with use of distributed cache in yarn.
   How does this coexist with that ? Or will it end up causing issues ? (archives coming in via yarn and spark executor ?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-736214769


   Oh, maybe I will use tar.gz and tgz in the integration test. That will address https://github.com/apache/spark/pull/30486#discussion_r532678330 together.
   
   I filed a JIRA at SPARK-33615


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r530279879



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1568,21 +1612,39 @@ class SparkContext(config: SparkConf) extends Logging {
 
     val key = if (!isLocal && scheme == "file") {
       env.rpcEnv.fileServer.addFile(new File(uri.getPath))
+    } else if (uri.getScheme == null) {
+      schemeCorrectedURI.toString
+    } else if (isArchive) {
+      uri.toString
     } else {
-        if (uri.getScheme == null) {
-          schemeCorrectedURI.toString
-        } else {
-          path
-        }
+      path
     }
+
     val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
-    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
+    if (!isArchive && addedFiles.putIfAbsent(key, timestamp).isEmpty) {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
       Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
         env.securityManager, hadoopConfiguration, timestamp, useCache = false)
       postEnvironmentUpdate()
+    } else if (
+      isArchive &&
+        addedArchives.putIfAbsent(
+          UriBuilder.fromUri(new URI(key)).fragment(uri.getFragment).build().toString,
+          timestamp).isEmpty) {
+      logInfo(s"Added archive $path at $key with timestamp $timestamp")
+      val uriToDownload = UriBuilder.fromUri(new URI(key)).fragment(null).build()
+      val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf,

Review comment:
       In fact, `--files` describes same:
   
   ```
           |  --files FILES               Comma-separated list of files to be placed in the working
           |                              directory of each executor. File paths of these files
           |                              in executors can be accessed via SparkFiles.get(fileName).
   ```
   
   I think we should fix the docs to say it's also available on the driver side :-). but I would like to run it separately.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735603974






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733449074






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733519983


   Thank you @dongjoon-hyun!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-736213589


   Thanks all @dongjoon-hyun @maropu @Ngone51 @mridulm and @tgravescs. Let me merge this in.
   I will try to have some time to prepare an IT test with K8S which hopefully will be added before Spark 3.1.0 release.
   
   Merged to master.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733568045


   The R failure is a flaky one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733470052






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r530016260



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1520,6 +1534,30 @@ class SparkContext(config: SparkConf) extends Logging {
    */
   def listFiles(): Seq[String] = addedFiles.keySet.toSeq
 
+  /**
+   * Add an archive to be downloaded and unpacked with this Spark job on every node.
+   *
+   * If an archive is added during execution, it will not be available until the next TaskSet

Review comment:
       Yeah, it does. It was actually copied from `addFile`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733029357






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733118921


   **[Test build #131670 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131670/testReport)** for PR 30486 at commit [`0559c95`](https://github.com/apache/spark/commit/0559c955efe3e62ac7590b3cccefea73fd5e0fc8).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733568538


   **[Test build #131759 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131759/testReport)** for PR 30486 at commit [`15d8ed5`](https://github.com/apache/spark/commit/15d8ed51d99e57403aae3272d9975b96e7735ee2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733641986


   **[Test build #131759 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131759/testReport)** for PR 30486 at commit [`15d8ed5`](https://github.com/apache/spark/commit/15d8ed51d99e57403aae3272d9975b96e7735ee2).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733403870


   **[Test build #131709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131709/testReport)** for PR 30486 at commit [`7b95396`](https://github.com/apache/spark/commit/7b95396f53c57d543ff5ee856006f808eca3401c).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-734037334


   It will be exactly same as `spark.files` and `spark.yarn.dist.files` since I am reusing `spark.files` code path (`spark.jars` vs `spark.yarn.dist.jars` too). To be honest, I am not exactly sure how they resolve the conflict to each other but both work together as far as I know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733448325


   **[Test build #131712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131712/testReport)** for PR 30486 at commit [`c87577e`](https://github.com/apache/spark/commit/c87577e3e16587bfe3e7eb7780489d3df8a66f6a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529742850



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1520,6 +1534,30 @@ class SparkContext(config: SparkConf) extends Logging {
    */
   def listFiles(): Seq[String] = addedFiles.keySet.toSeq
 
+  /**
+   * Add an archive to be downloaded and unpacked with this Spark job on every node.
+   *
+   * If an archive is added during execution, it will not be available until the next TaskSet

Review comment:
       Is this limitation applied to the other ways like `spark.files` too?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735603974






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-734494224


   @mridulm dose it make sense? Ill go ahead if there are not other comments :-).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733008523


   @tgravescs, @mridulm, @Ngone51, can you take a look when you guys find some time?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733327084


   **[Test build #131705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131705/testReport)** for PR 30486 at commit [`8e42ee8`](https://github.com/apache/spark/commit/8e42ee81c93ccaf60304aabb7231c319e49bd879).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733469007






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733477920


   Thanks @maropu and @dongjoon-hyun. I believe I addressed the comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r530217363



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1568,21 +1612,39 @@ class SparkContext(config: SparkConf) extends Logging {
 
     val key = if (!isLocal && scheme == "file") {
       env.rpcEnv.fileServer.addFile(new File(uri.getPath))
+    } else if (uri.getScheme == null) {
+      schemeCorrectedURI.toString
+    } else if (isArchive) {
+      uri.toString
     } else {
-        if (uri.getScheme == null) {
-          schemeCorrectedURI.toString
-        } else {
-          path
-        }
+      path
     }
+
     val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
-    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
+    if (!isArchive && addedFiles.putIfAbsent(key, timestamp).isEmpty) {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
       Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
         env.securityManager, hadoopConfiguration, timestamp, useCache = false)
       postEnvironmentUpdate()
+    } else if (
+      isArchive &&
+        addedArchives.putIfAbsent(
+          UriBuilder.fromUri(new URI(key)).fragment(uri.getFragment).build().toString,
+          timestamp).isEmpty) {
+      logInfo(s"Added archive $path at $key with timestamp $timestamp")
+      val uriToDownload = UriBuilder.fromUri(new URI(key)).fragment(null).build()
+      val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf,

Review comment:
       > --archives ARCHIVES         Comma-separated list of archives to be extracted into the working directory of each executor.
   
   Do we need to download archives at driver as it only declares they are extracted to executors? 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735554483


   **[Test build #131961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131961/testReport)** for PR 30486 at commit [`e35e94c`](https://github.com/apache/spark/commit/e35e94cad2888c98e03feb391a8dfcca8afa365c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733761482






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733694235






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-734037334


   It will be exactly same as `spark.files` and `spark.yarn.dist.files`. To be honest, I am not exactly sure how they resolve the conflict to each other but both work together as far as I know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm edited a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
mridulm edited a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735209644


   Thanks for the details @HyukjinKwon !
   
   So my understanding is that the PR is adding functionality which is existing in yarn for archives into a general support for other resource managers too - with spark on yarn continuing to rely on distributed cache (and there should be no functionality change in yarn mode). That sounds fine to me - but it would be good if @tgravescs can also take a look, I am a bit rusty on some of these now - and iirc there are a bunch of corner cases in this.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r530117309



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -1803,6 +1803,16 @@ package object config {
     .toSequence
     .createWithDefault(Nil)
 
+  private[spark] val ARCHIVES = ConfigBuilder("spark.archives")
+    .version("3.1.0")
+    .doc("Comma separated list of archives to be extracted into the working directory of each " +

Review comment:
       I think it's fine. They are meant to be separate configurations to end-users. These were not documented as well at least for `spark.files`, `spark.jars`, `spark.yarn.dist.files`, `spark.yarn.dist.archive`, `spark.yarn.dist.jars`, etc.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-733488641


   **[Test build #131743 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131743/testReport)** for PR 30486 at commit [`15d8ed5`](https://github.com/apache/spark/commit/15d8ed51d99e57403aae3272d9975b96e7735ee2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #30486:
URL: https://github.com/apache/spark/pull/30486#issuecomment-735225924


   Yeah, that's correct. One thing is though, if there's anything wrong in terms of conflict between Yarn distributed cach (spark.yarn.dist.* vs spark.* like spark.files), I would say this is a separate issue to handle since I am reusing the existing code path


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529746205



##########
File path: core/src/test/scala/org/apache/spark/SparkContextSuite.scala
##########
@@ -160,6 +160,85 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu
     }
   }
 
+  test("basic case for addArchive and listArchives") {

Review comment:
       Could you prepend `SPARK-33530: ` please?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529582202



##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -1537,8 +1575,14 @@ class SparkContext(config: SparkConf) extends Logging {
     addFile(path, recursive, false)
   }
 
-  private def addFile(path: String, recursive: Boolean, addedOnSubmit: Boolean): Unit = {
-    val uri = new Path(path).toUri
+  private def addFile(
+      path: String, recursive: Boolean, addedOnSubmit: Boolean, isArchive: Boolean = false
+    ): Unit = {
+    val uri = if (!isArchive) {
+      new Path(path).toUri
+    } else {
+      Utils.resolveURI(path)

Review comment:
       Here we cannot rely on `new Path(path).toUri`. it makes the fragment (`#`) in URI as the part of path. `Utils.resolveURI` is used for `spark.yarn.dist.archives` as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org