You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2020/02/17 02:08:24 UTC
[spark] branch branch-3.0 updated: [SPARK-30834][DOCS][PYTHON] Add note for recommended pandas and pyarrow versions

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new fb2e749  [SPARK-30834][DOCS][PYTHON] Add note for recommended pandas and pyarrow versions
fb2e749 is described below

commit fb2e7496006088bd6b98e9776ee51cedad1dfd6b
Author: Bryan Cutler <cu...@gmail.com>
AuthorDate: Mon Feb 17 11:06:51 2020 +0900

    [SPARK-30834][DOCS][PYTHON] Add note for recommended pandas and pyarrow versions
    
    ### What changes were proposed in this pull request?
    
    Add doc for recommended pandas and pyarrow versions.
    
    ### Why are the changes needed?
    
    The recommended versions are those that have been thoroughly tested by Spark CI. Other versions may be used at the discretion of the user.
    
    ### Does this PR introduce any user-facing change?
    
    No
    
    ### How was this patch tested?
    
    NA
    
    Closes #27587 from BryanCutler/python-doc-rec-pandas-pyarrow-SPARK-30834-3.0.
    
    Lead-authored-by: Bryan Cutler <cu...@gmail.com>
    Co-authored-by: HyukjinKwon <gu...@apache.org>
    Signed-off-by: HyukjinKwon <gu...@apache.org>
    (cherry picked from commit be3cb71e9cb34ad9054325c3122745e66e6f1ede)
    Signed-off-by: HyukjinKwon <gu...@apache.org>
---
 docs/sql-pyspark-pandas-with-arrow.md | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/docs/sql-pyspark-pandas-with-arrow.md b/docs/sql-pyspark-pandas-with-arrow.md
index 92a5157..63ba0ba 100644
--- a/docs/sql-pyspark-pandas-with-arrow.md
+++ b/docs/sql-pyspark-pandas-with-arrow.md
@@ -33,9 +33,11 @@ working with Arrow-enabled data.
 
 ### Ensure PyArrow Installed
 
+To use Apache Arrow in PySpark, [the recommended version of PyArrow](#recommended-pandas-and-pyarrow-versions)
+should be installed.
 If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the
 SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow
-is installed and available on all cluster nodes. The current supported version is 0.15.1+.
+is installed and available on all cluster nodes.
 You can install using pip or conda from the conda-forge channel. See PyArrow
 [installation](https://arrow.apache.org/docs/python/install.html) for details.
 
@@ -338,6 +340,12 @@ different than a Pandas timestamp. It is recommended to use Pandas time series f
 working with timestamps in `pandas_udf`s to get the best performance, see
 [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.
 
+### Recommended Pandas and PyArrow Versions
+
+For usage with pyspark.sql, the supported versions of Pandas is 0.24.2 and PyArrow is 0.15.1. Higher
+versions may be used, however, compatibility and data correctness can not be guaranteed and should
+be verified by the user.
+
 ### Compatibility Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
 
 Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org