You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/29 18:06:41 UTC
[GitHub] [spark] xkrogen commented on a change in pull request #34120: [SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line.

xkrogen commented on a change in pull request #34120:
URL: https://github.com/apache/spark/pull/34120#discussion_r718765256



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1470,6 +1471,84 @@ private object Client extends Logging {
     (mainUri ++ secondaryUris).toArray
   }
 
+  /**
+   * Returns a list of local, absolute file URLs representing the user classpath. Note that this
+   * must be executed on the same host which will access the URLs, as it will resolve relative
+   * paths based on the current working directory, as well as environment variables.
+   *
+   * @param conf Spark configuration.
+   * @param useClusterPath Whether to use the 'cluster' path when resolving paths with the
+   *                       `local` scheme. This should be used when running on the cluster, but
+   *                       not when running on the gateway (i.e. for the driver in `client` mode).
+   * @return Array of local URLs ready to be passed to a [[java.net.URLClassLoader]].
+   */
+  def getUserClasspathUrls(conf: SparkConf, useClusterPath: Boolean): Array[URL] = {
+    Client.getUserClasspath(conf).map { uri =>
+      val inputPath = uri.getPath
+      val replacedFilePath = if (Utils.isLocalUri(uri.toString) && useClusterPath) {
+        Client.getClusterPath(conf, inputPath)
+      } else {
+        // Any other URI schemes should have been resolved by this point
+        assert(uri.getScheme == null || uri.getScheme == "file" || Utils.isLocalUri(uri.toString),
+          "getUserClasspath should only return 'file' or 'local' URIs but found: " + uri)
+        inputPath
+      }
+      val envVarResolvedFilePath = replaceEnvVars(replacedFilePath, sys.env)
+      Paths.get(envVarResolvedFilePath).toAbsolutePath.toUri.toURL
+    }
+  }
+
+  /**
+   * Replace environment variables in a string according to the same rules [[Environment]]:
+   * `$VAR_NAME` for Unix, `%VAR_NAME%` for Windows, and `{{VAR_NAME}}` for all OS.
+   * This support escapes for `$` and `%` characters, e.g.
+   * `\$FOO` and `%%FOO%%` will be resolved to `$FOO` and `%FOO%`, respectively, instead of being
+   * treated as variable names.
+   *
+   * @param unresolvedString The unresolved string which may contain variable references.
+   * @param env The System environment
+   * @param isWindows True iff running in a Windows environment
+   * @return The input string with variables replaced with their values from `env`
+   */
+  def replaceEnvVars(

Review comment:
       I agree, it's far from ideal.
   
   I've been looking into `StringSubstitutor`, and it works great for the YARN style (`{{ ... }}`) and the Unix curly-braces style (`${...}`). I'm having trouble getting the escapes to work right for the Windows style (`% ... %`) -- I think it doesn't like that the prefix, suffix, and escape characters are all the same. Unfortunately this won't work at all for the Unix style without curly braces (`$VAR_NAME`), because `StringSubstitutor` assumes you always have a suffix. I tried implementing a custom `StringMatcher` to get around this, but it doesn't work, because there's no way to tell `StringSubstitutor` that the current character does mark the end of a variable, _but_ it is not actually a suffix character and should be maintained in the output string.
   
   Given all of this, I think it makes sense to stick with regex for all of them, rather than using `StringSubstitutor` for some and regex for others.
   
   > If this was passed in an environment variable does everything get replaced properly? I thought yarn would put it in the launch script and then properly handle replacements. env variables have much longer length and normally you would set the class path environment variable if doing it the more traditional way so assume it would fit.
   
   I think you're suggesting that, instead of using Spark configs to pass the values, pass them using an environment variable?
   
   AFAIU, environment variables are still subject to size limitations, and actually the command line arguments plus environment variables contribute to the same total limit ([read more here](https://unix.stackexchange.com/questions/110282/cp-max-source-files-number-arguments-for-copy-utility/110301#110301)). You do at least bypass some limitations, notably the maximum length for a _single_ argument, but I don't see a lot of benefit in going down this route. Let me know if I'm missing anything.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org