You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Willi Raschkowski (Jira)" <ji...@apache.org> on 2022/07/02 02:54:00 UTC
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

    [ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:53 AM:
-------------------------------------------------------------------

The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the environment
   * containing those executables was added at runtime (e.g. via sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = {
    processBuilder.environment().compute("PATH", (_, oldPath) =>
      Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code}
and 
{code:scala}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder)
{code}


was (Author: raschkowski):
The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the environment
   * containing those executables was added at runtime (e.g. via sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = {
    processBuilder.environment().compute("PATH", (_, oldPath) =>
      Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code:scala}
and 
{code}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder)
{code}

> Add environment bin folder to R/Python subprocess PATH
> ------------------------------------------------------
>
>                 Key: SPARK-39659
>                 URL: https://issues.apache.org/jira/browse/SPARK-39659
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Willi Raschkowski
>            Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via {{{}spark.archives{}}}, Python packages aren't able to find conda-installed executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/kaleido-test.py", line 7, in <module>
>     fig.write_image("figure.png", engine="kaleido")
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", line 3829, in write_image
>     return pio.write_image(self, *args, **kwargs)
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", line 267, in write_image
>     img_data = to_image(
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", line 144, in to_image
>     img_bytes = scope.transform(
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", line 153, in transform
>     response = self._perform_transform(
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 293, in _perform_transform
>     self._ensure_kaleido()
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 176, in _ensure_kaleido
>     proc_args = self._build_proc_args()
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 123, in _build_proc_args
>     proc_args = [self.executable_path(), self.scope_name]
>   File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 99, in executable_path
>     raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
>     /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_000001/environment/lib/python3.10/site-packages/kaleido/executable/kaleido 
> Searched for executable 'kaleido' on the following system PATH:
>     /usr/local/sbin
>     /usr/local/bin
>     /usr/sbin
>     /usr/bin
>     /sbin
>     /bin
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org