You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Malone Melo (Jira)" <ji...@apache.org> on 2021/12/15 00:49:00 UTC
[jira] [Created] (SPARK-37650) Tell spark-env.sh the python interpreter

Andrew Malone Melo created SPARK-37650:
------------------------------------------

             Summary: Tell spark-env.sh the python interpreter
                 Key: SPARK-37650
                 URL: https://issues.apache.org/jira/browse/SPARK-37650
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 3.2.0
            Reporter: Andrew Malone Melo


When loading config defaults via spark-env.sh, it can be useful to know
the current pyspark python interpreter to allow the configuration to set
values properly. Pass this value in the environment as
_PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script.
h3. What changes were proposed in this pull request?

It's currently possible to set sensible site-wide spark configuration defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a user is using pyspark, however, there are a number of things that aren't discoverable by that script, due to the way that it's called. There is a chain of calls (java_gateway.py -> shell script -> java -> shell script) that ends up obliterating any bit of the python context.

This change proposes to add en environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the top-level python executable within pyspark's {{java_gateway.py}} bootstrapping process. With that, spark-env.sh will be able to infer enough information about the python environment to set the appropriate configuration variables.
h3. Why are the changes needed?

Right now, there a number of config options useful to pyspark that can't be reliably set by {{spark-env.sh}} because it is unaware of the python context that spawning the executor. To give the most trivial example, it is currently possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} based on information readily available from the environment (e.g. the k8s downward API). However, {{spark.pyspark.python}} and family cannot be set because when {{spark-env.sh}} executes it's lost all of the python context. We can instruct users to add the appropriate config variables, but this form of cargo-culting is error-prone and not scalable. It would be much better to expose important python variables so that pyspark can not be a second-class citizen.
h3. Does this PR introduce _any_ user-facing change?

Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing to the python executor.
h3. How was this patch tested?

To be perfectly honest, I don't know where this fits into the testing infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to java_gateway.py and that works, but in terms of adding this to the CI ... I'm at a loss. I'm more than willing to add the additional info, if needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org