You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/15 00:43:45 UTC

[GitHub] [spark] PerilousApricot opened a new pull request #34903: Tell spark-env.sh the python interpreter

PerilousApricot opened a new pull request #34903:
URL: https://github.com/apache/spark/pull/34903

When loading config defaults via spark-env.sh, it can be useful to know
the current pyspark python interpreter to allow the configuration to set
values properly. Pass this value in the environment as
_PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script.

### What changes were proposed in this pull request?

It's currently possible to set sensible site-wide spark configuration defaults by using `$SPARK_CONF_DIR/spark-env.sh`. In the case where a user is using pyspark, however, there are a number of things that aren't discoverable by that script, due to the way that it's called. There is a chain of calls (java_gateway.py -> shell script -> java -> shell script) that ends up obliterating any bit of the python context.

This change proposes to add en environment variable `_PYSPARK_DRIVER_SYS_EXECUTABLE` which points to the filename of the top-level python executable within pyspark's `java_gateway.py` bootstrapping process. With that, spark-env.sh will be able to infer enough information about the python environment to set the appropriate configuration variables.

### Why are the changes needed?

Right now, there a number of config options useful to pyspark that can't be reliably set by `spark-env.sh` because it is unaware of the python context that spawning the executor. To give the most trivial example, it is currently possible to set `spark.kubernetes.container.image` or `spark.driver.host` based on information readily available from the environment (e.g. the k8s downward API). However, `spark.pyspark.python` and family cannot be set because when `spark-env.sh` executes it's lost all of the python context. We can instruct users to add the appropriate config variables, but this form of cargo-culting is error-prone and not scalable. It would be much better to expose important python variables so that pyspark can not be a second-class citizen.

### Does this PR introduce _any_ user-facing change?

Yes. With this change, if python spawns the JVM, `spark-env.sh` will receive an environment variable `_PYSPARK_DRIVER_SYS_EXECUTABLE` pointing to the python executor.

### How was this patch tested?

To be perfectly honest, I don't know where this fits into the testing infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to java_gateway.py and that works, but in terms of adding this to the CI ... I'm at a loss. I'm more than willing to add the additional info, if needed.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org