You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2023/05/24 04:34:54 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request, #41292: [SPARK-43768][PYTHON][CONNECT] Python dependency management support in Python Spark Connect

HyukjinKwon opened a new pull request, #41292:
URL: https://github.com/apache/spark/pull/41292

   ### What changes were proposed in this pull request?
   
   This PR proposes to add the support of archive (`.zip`, `.jar`, `.tar.gz`, `.tgz`, or `.tar` files) in `SparkSession.addArtifacts` so we can support Python dependency management in Python Spark Connect.
   
   ### Why are the changes needed?
   
   In order for end users to add the dependencies and archive files in Python Spark Connect client.
   
   This PR enables the Python dependency management (https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html) usecase in Spark Connect.
   
   See below how to do this with Spark Connect Python client:
   
   #### Precondition
   
   Assume that we have a Spark Connect server already running, e.g., by:
   
   ```bash
   ./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` --master "local-cluster[2,2,1024]"
   ```
   
   and assume that you already have a dev env:
   
   ```bash
   # Notice that you should install `conda-pack`.
   conda create -y -n pyspark_conda_env -c conda-forge conda-pack python=3.9
   conda activate pyspark_conda_env
   pip install --upgrade -r dev/requirements.txt
   ```
   
   #### Dependency management
   
   ```python
   ./bin/pyspark --remote "sc://localhost:15002"
   ```
   
   ```python
   import conda_pack
   import os
   # Pack the current environment ('pyspark_conda_env') to 'pyspark_conda_env.tar.gz'. 
   # Or you can run 'conda pack' in your shell.
   conda_pack.pack()  
   spark.addArtifact(f"{os.environ.get('CONDA_DEFAULT_ENV')}.tar.gz#environment", archive=True)
   spark.conf.set("spark.sql.execution.pyspark.python", "environment/bin/python")
   # From now on, Python workers on executors uses `pyspark_conda_env` Conda environment.
   ```
   
   Run your Python UDFs
   
   ```python
   import pandas as pd
   from pyspark.sql.functions import pandas_udf
   
   @pandas_udf("long")
   def plug_one(s: pd.Series) -> pd.Series:
       return s + 1
   
   spark.range(10).select(plug_one("id")).show()
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it adds the support of archive (`.zip`, `.jar`, `.tar.gz`, `.tgz`, or `.tar` files) in `SparkSession.addArtifacts`.
   
   ### How was this patch tested?
   
   Manually tested as described above, and added a unittest.
   
   Also, manually tested with `local-cluster` mode with the code below:
   
   Also verified via:
   
   ```python
   import sys
   from pyspark.sql.functions import udf
   
   spark.range(1).select(udf(lambda x: sys.executable)("id")).show(truncate=False)
   ```
   ```
   +----------------------------------------------------------------+
   |<lambda>(id)                                                    |
   +----------------------------------------------------------------+
   |/.../spark/work/app-20230524132024-0000/1/environment/bin/python|
   +----------------------------------------------------------------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41292: [SPARK-43768][PYTHON][CONNECT] Python dependency management support in Python Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #41292:
URL: https://github.com/apache/spark/pull/41292#discussion_r1203417125


##########
python/pyspark/sql/connect/client/artifact.py:
##########
@@ -154,29 +163,52 @@ def _parse_artifacts(self, path_or_uri: str, pyfile: bool) -> List[Artifact]:
                 sys.path.insert(1, local_path)
                 artifact = new_pyfile_artifact(name, LocalFile(local_path))
                 importlib.invalidate_caches()
+            elif archive and (
+                name.endswith(".zip")
+                or name.endswith(".jar")
+                or name.endswith(".tar.gz")
+                or name.endswith(".tgz")
+                or name.endswith(".tar")
+            ):
+                assert any(name.endswith(s) for s in (".zip", ".jar", ".tar.gz", ".tgz", ".tar"))
+
+                if parsed.fragment != "":
+                    # Minimal fix for the workaround of fragment handling in URI.
+                    # This has a limitation - hash(#) in the file name would not work.
+                    if "#" in local_path:
+                        raise ValueError("'#' in the path is not supported for adding an archive.")
+                    name = f"{name}#{parsed.fragment}"

Review Comment:
   This is actually pretty ugly workaround to support fragment in URI (but I believe this is the minimized change). Maybe we should pass URI instead of file path in `Artifact`s (?) but I would like to avoid touching the whole implementation in this PR. cc @hvanhovell @vicennial



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #41292: [SPARK-43768][PYTHON][CONNECT] Python dependency management support in Python Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #41292: [SPARK-43768][PYTHON][CONNECT] Python dependency management support in Python Spark Connect
URL: https://github.com/apache/spark/pull/41292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41292: [SPARK-43768][PYTHON][CONNECT] Python dependency management support in Python Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41292:
URL: https://github.com/apache/spark/pull/41292#issuecomment-1562708894

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org