You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Willi Raschkowski (Jira)" <ji...@apache.org> on 2022/07/02 02:54:00 UTC
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655 ]
Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:53 AM:
-------------------------------------------------------------------
The way we solve this in our fork is by doing something like
{code:scala}
/**
* Append the directory to the subprocess' PATH environment variable.
*
* This allows the Python subprocess to find additional executables when the environment
* containing those executables was added at runtime (e.g. via sc.addArchive()).
*/
def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
}
{code}
and
{code:scala}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder)
{code}
was (Author: raschkowski):
The way we solve this in our fork is by doing something like
{code:scala}
/**
* Append the directory to the subprocess' PATH environment variable.
*
* This allows the Python subprocess to find additional executables when the environment
* containing those executables was added at runtime (e.g. via sc.addArchive()).
*/
def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
}
{code:scala}
and
{code}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder)
{code}
> Add environment bin folder to R/Python subprocess PATH
> ------------------------------------------------------
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.3.0
> Reporter: Willi Raschkowski
> Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via {{{}spark.archives{}}}, Python packages aren't able to find conda-installed executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/kaleido-test.py", line 7, in <module>
> fig.write_image("figure.png", engine="kaleido")
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", line 267, in write_image
> img_data = to_image(
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", line 144, in to_image
> img_bytes = scope.transform(
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", line 153, in transform
> response = self._perform_transform(
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 293, in _perform_transform
> self._ensure_kaleido()
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
> File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 99, in executable_path
> raise ValueError(
> ValueError:
> The kaleido executable is required by the kaleido Python library, but it was not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
> /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/executable/kaleido
> Searched for executable 'kaleido' on the following system PATH:
> /usr/local/sbin
> /usr/local/bin
> /usr/sbin
> /usr/bin
> /sbin
> /bin
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org