You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/12/15 01:05:00 UTC
[jira] [Assigned] (SPARK-37650) Tell spark-env.sh the python interpreter

     [ https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-37650:
------------------------------------

    Assignee: Apache Spark

> Tell spark-env.sh the python interpreter
> ----------------------------------------
>
>                 Key: SPARK-37650
>                 URL: https://issues.apache.org/jira/browse/SPARK-37650
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Andrew Malone Melo
>            Assignee: Apache Spark
>            Priority: Major
>
> When loading config defaults via spark-env.sh, it can be useful to know
> the current pyspark python interpreter to allow the configuration to set
> values properly. Pass this value in the environment as
> _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script.
> h3. What changes were proposed in this pull request?
> It's currently possible to set sensible site-wide spark configuration defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a user is using pyspark, however, there are a number of things that aren't discoverable by that script, due to the way that it's called. There is a chain of calls (java_gateway.py -> shell script -> java -> shell script) that ends up obliterating any bit of the python context.
> This change proposes to add en environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the top-level python executable within pyspark's {{java_gateway.py}} bootstrapping process. With that, spark-env.sh will be able to infer enough information about the python environment to set the appropriate configuration variables.
> h3. Why are the changes needed?
> Right now, there a number of config options useful to pyspark that can't be reliably set by {{spark-env.sh}} because it is unaware of the python context that spawning the executor. To give the most trivial example, it is currently possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} based on information readily available from the environment (e.g. the k8s downward API). However, {{spark.pyspark.python}} and family cannot be set because when {{spark-env.sh}} executes it's lost all of the python context. We can instruct users to add the appropriate config variables, but this form of cargo-culting is error-prone and not scalable. It would be much better to expose important python variables so that pyspark can not be a second-class citizen.
> h3. Does this PR introduce _any_ user-facing change?
> Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing to the python executor.
> h3. How was this patch tested?
> To be perfectly honest, I don't know where this fits into the testing infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to java_gateway.py and that works, but in terms of adding this to the CI ... I'm at a loss. I'm more than willing to add the additional info, if needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org