You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2021/11/22 23:37:04 UTC
[spark] branch master updated: [SPARK-37337][PYTHON] Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new bc7d55f [SPARK-37337][PYTHON] Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
bc7d55f is described below
commit bc7d55fc1046a55df61fdb380629699e9959fcc6
Author: Xinrong Meng <xi...@databricks.com>
AuthorDate: Tue Nov 23 08:35:28 2021 +0900
[SPARK-37337][PYTHON] Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
### What changes were proposed in this pull request?
The PR is proposed to:
- Undeprecate (Spark)DataFrame.to_koalasĀ
- Deprecate (Spark)DataFrame.to_pandas_like and introduce (Spark)DataFrame.pandas_api instead.
### Why are the changes needed?
Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and inconvenient to call.
With the proposal of the PR, we may improve the user experience and make APIs more developer-friendly.
### Does this PR introduce _any_ user-facing change?
Yes.
(Spark)DataFrame.pandas_api is introduced.
(Spark)DataFrame.to_pandas_on_spark is deprecated.
(Spark)DataFrame.to_koalas is undeprecated.
For example:
```py
>>> sdf = spark.createDataFrame([{'name': 'Alice', 'age': 1}])
>>> sdf.pandas_api()
age name
0 1 Alice
>>> sdf.to_pandas_on_spark()
/Users/xinrong.meng/spark/python/pyspark/sql/dataframe.py:3207: FutureWarning: DataFrame.to_pandas_on_spark is deprecated. Use DataFrame.pandas_api instead.
FutureWarning,
age name
0 1 Alice
>>> sdf.to_koalas()
age name
0 1 Alice
```
### How was this patch tested?
Existing tests.
Closes #34608 from xinrong-databricks/conversion.
Authored-by: Xinrong Meng <xi...@databricks.com>
Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
.../docs/source/getting_started/quickstart_ps.ipynb | 4 ++--
.../source/migration_guide/koalas_to_pyspark.rst | 5 ++++-
python/docs/source/reference/pyspark.sql.rst | 2 +-
.../user_guide/pandas_on_spark/pandas_pyspark.rst | 4 ++--
.../docs/source/user_guide/pandas_on_spark/types.rst | 2 +-
python/pyspark/pandas/spark/accessors.py | 10 +++++-----
python/pyspark/sql/dataframe.py | 20 +++++++++++++-------
python/pyspark/sql/tests/test_dataframe.py | 6 +++---
8 files changed, 31 insertions(+), 22 deletions(-)
diff --git a/python/docs/source/getting_started/quickstart_ps.ipynb b/python/docs/source/getting_started/quickstart_ps.ipynb
index 74d6724..87796ae 100644
--- a/python/docs/source/getting_started/quickstart_ps.ipynb
+++ b/python/docs/source/getting_started/quickstart_ps.ipynb
@@ -539,7 +539,7 @@
"metadata": {},
"outputs": [],
"source": [
- "psdf = sdf.to_pandas_on_spark()"
+ "psdf = sdf.pandas_api()"
]
},
{
@@ -14486,4 +14486,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
-}
+}
\ No newline at end of file
diff --git a/python/docs/source/migration_guide/koalas_to_pyspark.rst b/python/docs/source/migration_guide/koalas_to_pyspark.rst
index 9102d1d..24e2d95 100644
--- a/python/docs/source/migration_guide/koalas_to_pyspark.rst
+++ b/python/docs/source/migration_guide/koalas_to_pyspark.rst
@@ -30,7 +30,10 @@ Migrating from Koalas to pandas API on Spark
* ``DataFrame.koalas`` in Koalas DataFrame was renamed to ``DataFrame.pandas_on_spark`` in pandas-on-Spark DataFrame. ``DataFrame.koalas`` was kept for compatibility reason but deprecated as of Spark 3.2.
``DataFrame.koalas`` will be removed in the future releases.
-* Monkey-patched ``DataFrame.to_koalas`` in PySpark DataFrame was renamed to ``DataFrame.to_pandas_on_spark`` in PySpark DataFrame. ``DataFrame.to_koalas`` was kept for compatibility reason but deprecated as of Spark 3.2.
+* Monkey-patched ``DataFrame.to_koalas`` in PySpark DataFrame was renamed to ``DataFrame.pandas_api`` in PySpark DataFrame. ``DataFrame.to_koalas`` was kept for compatibility reason.
``DataFrame.to_koalas`` will be removed in the future releases.
+* Monkey-patched ``DataFrame.to_pandas_on_spark`` in PySpark DataFrame was renamed to ``DataFrame.pandas_api`` in PySpark DataFrame. ``DataFrame.to_pandas_on_spark`` was kept for compatibility reason but deprecated as of Spark 3.3.
+ ``DataFrame.to_pandas_on_spark`` will be removed in the future releases.
+
* ``databricks.koalas.__version__`` was removed. ``pyspark.__version__`` should be used instead.
diff --git a/python/docs/source/reference/pyspark.sql.rst b/python/docs/source/reference/pyspark.sql.rst
index 5b77da5..818814c 100644
--- a/python/docs/source/reference/pyspark.sql.rst
+++ b/python/docs/source/reference/pyspark.sql.rst
@@ -223,7 +223,7 @@ DataFrame APIs
DataFrame.write
DataFrame.writeStream
DataFrame.writeTo
- DataFrame.to_pandas_on_spark
+ DataFrame.pandas_api
DataFrameNaFunctions.drop
DataFrameNaFunctions.fill
DataFrameNaFunctions.replace
diff --git a/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst b/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst
index f4fc0da..04d6617 100644
--- a/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst
@@ -107,7 +107,7 @@ Spark DataFrame can be a pandas-on-Spark DataFrame easily as below:
.. code-block:: python
- >>> sdf.to_pandas_on_spark()
+ >>> sdf.pandas_api()
id
0 6
1 7
@@ -127,7 +127,7 @@ to use as an index when possible.
>>> # Call Spark APIs
... sdf = sdf.filter("id > 5")
>>> # Uses the explicit index to avoid to create default index.
- ... sdf.to_pandas_on_spark(index_col='index')
+ ... sdf.pandas_api(index_col='index')
id
index
6 6
diff --git a/python/docs/source/user_guide/pandas_on_spark/types.rst b/python/docs/source/user_guide/pandas_on_spark/types.rst
index 831967a..8e04efc 100644
--- a/python/docs/source/user_guide/pandas_on_spark/types.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/types.rst
@@ -44,7 +44,7 @@ The example below shows how data types are casted from PySpark DataFrame to pand
DataFrame[tinyint: tinyint, decimal: decimal(10,0), float: float, double: double, integer: int, long: bigint, short: smallint, timestamp: timestamp, string: string, boolean: boolean, date: date]
# 3. Convert PySpark DataFrame to pandas-on-Spark DataFrame
- >>> psdf = sdf.to_pandas_on_spark()
+ >>> psdf = sdf.pandas_api()
# 4. Check the pandas-on-Spark data types
>>> psdf.dtypes
diff --git a/python/pyspark/pandas/spark/accessors.py b/python/pyspark/pandas/spark/accessors.py
index 0e91f4e..e0d4639 100644
--- a/python/pyspark/pandas/spark/accessors.py
+++ b/python/pyspark/pandas/spark/accessors.py
@@ -396,7 +396,7 @@ class SparkFrameMethods(object):
See Also
--------
DataFrame.to_spark
- DataFrame.to_pandas_on_spark
+ DataFrame.pandas_api
DataFrame.spark.frame
Examples
@@ -440,7 +440,7 @@ class SparkFrameMethods(object):
>>> spark_df = df.to_spark(index_col="index")
>>> spark_df = spark_df.filter("a == 2")
- >>> spark_df.to_pandas_on_spark(index_col="index") # doctest: +NORMALIZE_WHITESPACE
+ >>> spark_df.pandas_api(index_col="index") # doctest: +NORMALIZE_WHITESPACE
a b c
index
1 2 5 8
@@ -460,7 +460,7 @@ class SparkFrameMethods(object):
Likewise, can be converted to back to pandas-on-Spark DataFrame.
- >>> new_spark_df.to_pandas_on_spark(
+ >>> new_spark_df.pandas_api(
... index_col=["index_1", "index_2"]) # doctest: +NORMALIZE_WHITESPACE
b c
index_1 index_2
@@ -893,7 +893,7 @@ class SparkFrameMethods(object):
expensive in general.
.. note:: it will lose column labels. This is a synonym of
- ``func(psdf.to_spark(index_col)).to_pandas_on_spark(index_col)``.
+ ``func(psdf.to_spark(index_col)).pandas_api(index_col)``.
Parameters
----------
@@ -941,7 +941,7 @@ class SparkFrameMethods(object):
"The output of the function [%s] should be of a "
"pyspark.sql.DataFrame; however, got [%s]." % (func, type(output))
)
- return output.to_pandas_on_spark(index_col)
+ return output.pandas_api(index_col)
def repartition(self, num_partitions: int) -> "ps.DataFrame":
"""
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index ac1cbf9..337cad5 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -3198,9 +3198,19 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
"""
return DataFrameWriterV2(self, table)
+ # Keep to_pandas_on_spark for backward compatibility for now.
def to_pandas_on_spark(
self, index_col: Optional[Union[str, List[str]]] = None
) -> "PandasOnSparkDataFrame":
+ warnings.warn(
+ "DataFrame.to_pandas_on_spark is deprecated. Use DataFrame.pandas_api instead.",
+ FutureWarning,
+ )
+ return self.pandas_api(index_col)
+
+ def pandas_api(
+ self, index_col: Optional[Union[str, List[str]]] = None
+ ) -> "PandasOnSparkDataFrame":
"""
Converts the existing DataFrame into a pandas-on-Spark DataFrame.
@@ -3230,7 +3240,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
| c| 3|
+----+----+
- >>> df.to_pandas_on_spark() # doctest: +SKIP
+ >>> df.pandas_api() # doctest: +SKIP
Col1 Col2
0 a 1
1 b 2
@@ -3238,7 +3248,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
We can specify the index columns.
- >>> df.to_pandas_on_spark(index_col="Col1"): # doctest: +SKIP
+ >>> df.pandas_api(index_col="Col1"): # doctest: +SKIP
Col2
Col1
a 1
@@ -3261,11 +3271,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
def to_koalas(
self, index_col: Optional[Union[str, List[str]]] = None
) -> "PandasOnSparkDataFrame":
- warnings.warn(
- "DataFrame.to_koalas is deprecated. Use DataFrame.to_pandas_on_spark instead.",
- FutureWarning,
- )
- return self.to_pandas_on_spark(index_col)
+ return self.pandas_api(index_col)
def _to_scala_map(sc: SparkContext, jm: Dict) -> JavaObject:
diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py
index 75301ed..3cafd2c 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -1106,13 +1106,13 @@ class DataFrameTests(ReusedSQLTestCase):
not have_pandas or not have_pyarrow,
cast(str, pandas_requirement_message or pyarrow_requirement_message),
)
- def test_to_pandas_on_spark(self):
+ def test_pandas_api(self):
import pandas as pd
from pandas.testing import assert_frame_equal
sdf = self.spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
- psdf_from_sdf = sdf.to_pandas_on_spark()
- psdf_from_sdf_with_index = sdf.to_pandas_on_spark(index_col="Col1")
+ psdf_from_sdf = sdf.pandas_api()
+ psdf_from_sdf_with_index = sdf.pandas_api(index_col="Col1")
pdf = pd.DataFrame({"Col1": ["a", "b", "c"], "Col2": [1, 2, 3]})
pdf_with_index = pdf.set_index("Col1")
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org