You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2021/11/22 23:37:04 UTC

[spark] branch master updated: [SPARK-37337][PYTHON] Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new bc7d55f  [SPARK-37337][PYTHON] Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
bc7d55f is described below

commit bc7d55fc1046a55df61fdb380629699e9959fcc6
Author: Xinrong Meng <xi...@databricks.com>
AuthorDate: Tue Nov 23 08:35:28 2021 +0900

    [SPARK-37337][PYTHON] Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
    
    ### What changes were proposed in this pull request?
    The PR is proposed to:
    
    - Undeprecate (Spark)DataFrame.to_koalasĀ 
    
    - Deprecate (Spark)DataFrame.to_pandas_like and introduce (Spark)DataFrame.pandas_api instead.
    
    ### Why are the changes needed?
    Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and inconvenient to call.
    With the proposal of the PR, we may improve the user experience and make APIs more developer-friendly.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    
    (Spark)DataFrame.pandas_api is introduced.
    (Spark)DataFrame.to_pandas_on_spark is deprecated.
    (Spark)DataFrame.to_koalas is undeprecated.
    
    For example:
    ```py
    >>> sdf = spark.createDataFrame([{'name': 'Alice', 'age': 1}])
    
    >>> sdf.pandas_api()
       age   name
    0    1  Alice
    
    >>> sdf.to_pandas_on_spark()
    /Users/xinrong.meng/spark/python/pyspark/sql/dataframe.py:3207: FutureWarning: DataFrame.to_pandas_on_spark is deprecated. Use DataFrame.pandas_api instead.
      FutureWarning,
       age   name
    0    1  Alice
    
    >>> sdf.to_koalas()
       age   name
    0    1  Alice
    ```
    
    ### How was this patch tested?
    Existing tests.
    
    Closes #34608 from xinrong-databricks/conversion.
    
    Authored-by: Xinrong Meng <xi...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 .../docs/source/getting_started/quickstart_ps.ipynb  |  4 ++--
 .../source/migration_guide/koalas_to_pyspark.rst     |  5 ++++-
 python/docs/source/reference/pyspark.sql.rst         |  2 +-
 .../user_guide/pandas_on_spark/pandas_pyspark.rst    |  4 ++--
 .../docs/source/user_guide/pandas_on_spark/types.rst |  2 +-
 python/pyspark/pandas/spark/accessors.py             | 10 +++++-----
 python/pyspark/sql/dataframe.py                      | 20 +++++++++++++-------
 python/pyspark/sql/tests/test_dataframe.py           |  6 +++---
 8 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/python/docs/source/getting_started/quickstart_ps.ipynb b/python/docs/source/getting_started/quickstart_ps.ipynb
index 74d6724..87796ae 100644
--- a/python/docs/source/getting_started/quickstart_ps.ipynb
+++ b/python/docs/source/getting_started/quickstart_ps.ipynb
@@ -539,7 +539,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "psdf = sdf.to_pandas_on_spark()"
+    "psdf = sdf.pandas_api()"
    ]
   },
   {
@@ -14486,4 +14486,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 1
-}
+}
\ No newline at end of file
diff --git a/python/docs/source/migration_guide/koalas_to_pyspark.rst b/python/docs/source/migration_guide/koalas_to_pyspark.rst
index 9102d1d..24e2d95 100644
--- a/python/docs/source/migration_guide/koalas_to_pyspark.rst
+++ b/python/docs/source/migration_guide/koalas_to_pyspark.rst
@@ -30,7 +30,10 @@ Migrating from Koalas to pandas API on Spark
 * ``DataFrame.koalas`` in Koalas DataFrame was renamed to ``DataFrame.pandas_on_spark`` in pandas-on-Spark DataFrame. ``DataFrame.koalas`` was kept for compatibility reason but deprecated as of Spark 3.2.
   ``DataFrame.koalas`` will be removed in the future releases.
 
-* Monkey-patched ``DataFrame.to_koalas`` in PySpark DataFrame was renamed to ``DataFrame.to_pandas_on_spark`` in PySpark DataFrame. ``DataFrame.to_koalas`` was kept for compatibility reason but deprecated as of Spark 3.2.
+* Monkey-patched ``DataFrame.to_koalas`` in PySpark DataFrame was renamed to ``DataFrame.pandas_api`` in PySpark DataFrame. ``DataFrame.to_koalas`` was kept for compatibility reason.
   ``DataFrame.to_koalas`` will be removed in the future releases.
 
+* Monkey-patched ``DataFrame.to_pandas_on_spark`` in PySpark DataFrame was renamed to ``DataFrame.pandas_api`` in PySpark DataFrame. ``DataFrame.to_pandas_on_spark`` was kept for compatibility reason but deprecated as of Spark 3.3.
+  ``DataFrame.to_pandas_on_spark`` will be removed in the future releases.
+
 * ``databricks.koalas.__version__`` was removed. ``pyspark.__version__`` should be used instead.
diff --git a/python/docs/source/reference/pyspark.sql.rst b/python/docs/source/reference/pyspark.sql.rst
index 5b77da5..818814c 100644
--- a/python/docs/source/reference/pyspark.sql.rst
+++ b/python/docs/source/reference/pyspark.sql.rst
@@ -223,7 +223,7 @@ DataFrame APIs
     DataFrame.write
     DataFrame.writeStream
     DataFrame.writeTo
-    DataFrame.to_pandas_on_spark
+    DataFrame.pandas_api
     DataFrameNaFunctions.drop
     DataFrameNaFunctions.fill
     DataFrameNaFunctions.replace
diff --git a/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst b/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst
index f4fc0da..04d6617 100644
--- a/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst
@@ -107,7 +107,7 @@ Spark DataFrame can be a pandas-on-Spark DataFrame easily as below:
 
 .. code-block:: python
 
-   >>> sdf.to_pandas_on_spark()
+   >>> sdf.pandas_api()
       id
    0   6
    1   7
@@ -127,7 +127,7 @@ to use as an index when possible.
    >>> # Call Spark APIs
    ... sdf = sdf.filter("id > 5")
    >>> # Uses the explicit index to avoid to create default index.
-   ... sdf.to_pandas_on_spark(index_col='index')
+   ... sdf.pandas_api(index_col='index')
           id
    index
    6       6
diff --git a/python/docs/source/user_guide/pandas_on_spark/types.rst b/python/docs/source/user_guide/pandas_on_spark/types.rst
index 831967a..8e04efc 100644
--- a/python/docs/source/user_guide/pandas_on_spark/types.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/types.rst
@@ -44,7 +44,7 @@ The example below shows how data types are casted from PySpark DataFrame to pand
     DataFrame[tinyint: tinyint, decimal: decimal(10,0), float: float, double: double, integer: int, long: bigint, short: smallint, timestamp: timestamp, string: string, boolean: boolean, date: date]
 
     # 3. Convert PySpark DataFrame to pandas-on-Spark DataFrame
-    >>> psdf = sdf.to_pandas_on_spark()
+    >>> psdf = sdf.pandas_api()
 
     # 4. Check the pandas-on-Spark data types
     >>> psdf.dtypes
diff --git a/python/pyspark/pandas/spark/accessors.py b/python/pyspark/pandas/spark/accessors.py
index 0e91f4e..e0d4639 100644
--- a/python/pyspark/pandas/spark/accessors.py
+++ b/python/pyspark/pandas/spark/accessors.py
@@ -396,7 +396,7 @@ class SparkFrameMethods(object):
         See Also
         --------
         DataFrame.to_spark
-        DataFrame.to_pandas_on_spark
+        DataFrame.pandas_api
         DataFrame.spark.frame
 
         Examples
@@ -440,7 +440,7 @@ class SparkFrameMethods(object):
 
         >>> spark_df = df.to_spark(index_col="index")
         >>> spark_df = spark_df.filter("a == 2")
-        >>> spark_df.to_pandas_on_spark(index_col="index")  # doctest: +NORMALIZE_WHITESPACE
+        >>> spark_df.pandas_api(index_col="index")  # doctest: +NORMALIZE_WHITESPACE
                a  b  c
         index
         1      2  5  8
@@ -460,7 +460,7 @@ class SparkFrameMethods(object):
 
         Likewise, can be converted to back to pandas-on-Spark DataFrame.
 
-        >>> new_spark_df.to_pandas_on_spark(
+        >>> new_spark_df.pandas_api(
         ...     index_col=["index_1", "index_2"])  # doctest: +NORMALIZE_WHITESPACE
                          b  c
         index_1 index_2
@@ -893,7 +893,7 @@ class SparkFrameMethods(object):
             expensive in general.
 
         .. note:: it will lose column labels. This is a synonym of
-            ``func(psdf.to_spark(index_col)).to_pandas_on_spark(index_col)``.
+            ``func(psdf.to_spark(index_col)).pandas_api(index_col)``.
 
         Parameters
         ----------
@@ -941,7 +941,7 @@ class SparkFrameMethods(object):
                 "The output of the function [%s] should be of a "
                 "pyspark.sql.DataFrame; however, got [%s]." % (func, type(output))
             )
-        return output.to_pandas_on_spark(index_col)
+        return output.pandas_api(index_col)
 
     def repartition(self, num_partitions: int) -> "ps.DataFrame":
         """
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index ac1cbf9..337cad5 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -3198,9 +3198,19 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
         """
         return DataFrameWriterV2(self, table)
 
+    # Keep to_pandas_on_spark for backward compatibility for now.
     def to_pandas_on_spark(
         self, index_col: Optional[Union[str, List[str]]] = None
     ) -> "PandasOnSparkDataFrame":
+        warnings.warn(
+            "DataFrame.to_pandas_on_spark is deprecated. Use DataFrame.pandas_api instead.",
+            FutureWarning,
+        )
+        return self.pandas_api(index_col)
+
+    def pandas_api(
+        self, index_col: Optional[Union[str, List[str]]] = None
+    ) -> "PandasOnSparkDataFrame":
         """
         Converts the existing DataFrame into a pandas-on-Spark DataFrame.
 
@@ -3230,7 +3240,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
         |   c|   3|
         +----+----+
 
-        >>> df.to_pandas_on_spark()  # doctest: +SKIP
+        >>> df.pandas_api()  # doctest: +SKIP
           Col1  Col2
         0    a     1
         1    b     2
@@ -3238,7 +3248,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 
         We can specify the index columns.
 
-        >>> df.to_pandas_on_spark(index_col="Col1"): # doctest: +SKIP
+        >>> df.pandas_api(index_col="Col1"): # doctest: +SKIP
               Col2
         Col1
         a        1
@@ -3261,11 +3271,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
     def to_koalas(
         self, index_col: Optional[Union[str, List[str]]] = None
     ) -> "PandasOnSparkDataFrame":
-        warnings.warn(
-            "DataFrame.to_koalas is deprecated. Use DataFrame.to_pandas_on_spark instead.",
-            FutureWarning,
-        )
-        return self.to_pandas_on_spark(index_col)
+        return self.pandas_api(index_col)
 
 
 def _to_scala_map(sc: SparkContext, jm: Dict) -> JavaObject:
diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py
index 75301ed..3cafd2c 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -1106,13 +1106,13 @@ class DataFrameTests(ReusedSQLTestCase):
         not have_pandas or not have_pyarrow,
         cast(str, pandas_requirement_message or pyarrow_requirement_message),
     )
-    def test_to_pandas_on_spark(self):
+    def test_pandas_api(self):
         import pandas as pd
         from pandas.testing import assert_frame_equal
 
         sdf = self.spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
-        psdf_from_sdf = sdf.to_pandas_on_spark()
-        psdf_from_sdf_with_index = sdf.to_pandas_on_spark(index_col="Col1")
+        psdf_from_sdf = sdf.pandas_api()
+        psdf_from_sdf_with_index = sdf.pandas_api(index_col="Col1")
         pdf = pd.DataFrame({"Col1": ["a", "b", "c"], "Col2": [1, 2, 3]})
         pdf_with_index = pdf.set_index("Col1")
 

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org