You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2020/02/17 02:08:24 UTC
[spark] branch branch-3.0 updated: [SPARK-30834][DOCS][PYTHON] Add
note for recommended pandas and pyarrow versions
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.0 by this push:
new fb2e749 [SPARK-30834][DOCS][PYTHON] Add note for recommended pandas and pyarrow versions
fb2e749 is described below
commit fb2e7496006088bd6b98e9776ee51cedad1dfd6b
Author: Bryan Cutler <cu...@gmail.com>
AuthorDate: Mon Feb 17 11:06:51 2020 +0900
[SPARK-30834][DOCS][PYTHON] Add note for recommended pandas and pyarrow versions
### What changes were proposed in this pull request?
Add doc for recommended pandas and pyarrow versions.
### Why are the changes needed?
The recommended versions are those that have been thoroughly tested by Spark CI. Other versions may be used at the discretion of the user.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
NA
Closes #27587 from BryanCutler/python-doc-rec-pandas-pyarrow-SPARK-30834-3.0.
Lead-authored-by: Bryan Cutler <cu...@gmail.com>
Co-authored-by: HyukjinKwon <gu...@apache.org>
Signed-off-by: HyukjinKwon <gu...@apache.org>
(cherry picked from commit be3cb71e9cb34ad9054325c3122745e66e6f1ede)
Signed-off-by: HyukjinKwon <gu...@apache.org>
---
docs/sql-pyspark-pandas-with-arrow.md | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/docs/sql-pyspark-pandas-with-arrow.md b/docs/sql-pyspark-pandas-with-arrow.md
index 92a5157..63ba0ba 100644
--- a/docs/sql-pyspark-pandas-with-arrow.md
+++ b/docs/sql-pyspark-pandas-with-arrow.md
@@ -33,9 +33,11 @@ working with Arrow-enabled data.
### Ensure PyArrow Installed
+To use Apache Arrow in PySpark, [the recommended version of PyArrow](#recommended-pandas-and-pyarrow-versions)
+should be installed.
If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the
SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow
-is installed and available on all cluster nodes. The current supported version is 0.15.1+.
+is installed and available on all cluster nodes.
You can install using pip or conda from the conda-forge channel. See PyArrow
[installation](https://arrow.apache.org/docs/python/install.html) for details.
@@ -338,6 +340,12 @@ different than a Pandas timestamp. It is recommended to use Pandas time series f
working with timestamps in `pandas_udf`s to get the best performance, see
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.
+### Recommended Pandas and PyArrow Versions
+
+For usage with pyspark.sql, the supported versions of Pandas is 0.24.2 and PyArrow is 0.15.1. Higher
+versions may be used, however, compatibility and data correctness can not be guaranteed and should
+be verified by the user.
+
### Compatibility Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org