You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Ohad Raviv (Jira)" <ji...@apache.org> on 2022/12/13 12:12:00 UTC

[jira] [Created] (SPARK-41510) Support easy way for user defined PYTHONPATH in workers

Ohad Raviv created SPARK-41510:
----------------------------------

             Summary: Support easy way for user defined PYTHONPATH in workers
                 Key: SPARK-41510
                 URL: https://issues.apache.org/jira/browse/SPARK-41510
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.3.1
            Reporter: Ohad Raviv


When working interactively with Spark through notebooks in various envs - Databricks/YARN I often encounter a very frustrating process of trying to add new python modules and even change their code without starting a new spark session/cluster.

in the driver side it is easy to add things like `sys.path.append()` but if for example UDF code is importing function from a local module, then the pickle boundaries will assume that the module exists in the workers. and then I fail on "python module does not exist..".

adding NFS volumes to the workers PYTHONPATH could solve it, but it requires restarting the session/cluster and worse doesn't work in all envs as the PYTHONPATH gets overridden by someone (databricks/spark) along the way. a few ugly work around are suggested like running a "dummy" udf on workers to add the folder to the sys.path.

I think all of that could easily be solved if we add a spark.conf to add to the worker PYTHONPATH. here:

[https://github.com/apache/spark/blob/0e2d604fd33c8236cfa8ae243eeaec42d3176a06/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L94]

 

please tell me what you think, and I will make the PR.

thanks.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org