You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2021/10/07 09:36:32 UTC

[spark] branch master updated: [SPARK-36713][PYTHON][DOCS] Document new syntax of type hints with index (pandas-on-Spark)

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 2e9b698  [SPARK-36713][PYTHON][DOCS] Document new syntax of type hints with index (pandas-on-Spark)
2e9b698 is described below

commit 2e9b698d3100f18595dadbf35abb06502c3d6123
Author: Hyukjin Kwon <gu...@apache.org>
AuthorDate: Thu Oct 7 18:35:20 2021 +0900

    [SPARK-36713][PYTHON][DOCS] Document new syntax of type hints with index (pandas-on-Spark)
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to document the new syntax of type hints with index. Self-contained.
    
    ### Why are the changes needed?
    
    To guide users about the new ways of typing to avoid creating default index.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it adds new sections in the pandas-on-Spark documentation.
    
    ### How was this patch tested?
    
    Manually built the docs and verified the output HTMLs. Also manually ran the example codes.
    
    ![Screen Shot 2021-10-07 at 2 19 41 PM](https://user-images.githubusercontent.com/6477701/136324614-a9eafaa9-79b6-42fb-be65-ac43e12017b7.png)
    
    ![Screen Shot 2021-10-07 at 2 19 38 PM](https://user-images.githubusercontent.com/6477701/136324609-8da68d45-259e-441d-9226-b97fe7b0d63f.png)
    
    Closes #34210 from HyukjinKwon/SPARK-36713.
    
    Authored-by: Hyukjin Kwon <gu...@apache.org>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 .../user_guide/pandas_on_spark/typehints.rst       | 123 ++++++++++++++++++++-
 1 file changed, 117 insertions(+), 6 deletions(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/typehints.rst b/python/docs/source/user_guide/pandas_on_spark/typehints.rst
index 2b8628e..72519fc 100644
--- a/python/docs/source/user_guide/pandas_on_spark/typehints.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/typehints.rst
@@ -60,10 +60,10 @@ it as a Spark schema. As an example, you can specify the return type hint as bel
     >>> df = ps.DataFrame({'A': ['a', 'a', 'b'], 'B': [1, 2, 3], 'C': [4, 6, 5]})
     >>> df.groupby('A').apply(pandas_div)
 
-The function ``pandas_div`` actually takes and outputs a pandas DataFrame instead of pandas-on-Spark :class:`DataFrame`.
-However, pandas API on Spark has to force to set the mismatched type hints.
+Notice that the function ``pandas_div`` actually takes and outputs a pandas DataFrame instead of
+pandas-on-Spark :class:`DataFrame`. So, technically the correct types should be of pandas.
 
-From pandas-on-Spark 1.0 with Python 3.7+, now you can specify the type hints by using pandas instances.
+With Python 3.7+, you can specify the type hints by using pandas instances as follows:
 
 .. code-block:: python
 
@@ -91,7 +91,7 @@ plans to move gradually towards using pandas instances only as the stability bec
 Type Hinting with Names
 -----------------------
 
-In pandas-on-Spark 1.0, the new style of type hinting was introduced to overcome the limitations in the existing type
+This apporach is to overcome the limitations in the existing type
 hinting especially for DataFrame. When you use a DataFrame as the return type hint, for example,
 ``DataFrame[int, int]``, there is no way to specify the names of each Series. In the old way, pandas API on Spark just generates
 the column names as ``c#`` and this easily leads users to lose or forgot the Series mappings. See the example below:
@@ -139,7 +139,8 @@ programmatically generate the return type and schema.
 
 .. code-block:: python
 
-    >>> def transform(pdf) -> pd.DataFrame[zip(pdf.columns, pdf.dtypes)]:
+    >>> def transform(pdf) -> pd.DataFrame[
+    ..         zip(sample.columns, sample.dtypes)]:
     ...    return pdf + 1
     ...
     >>> psdf.pandas_on_spark.apply_batch(transform)
@@ -148,7 +149,117 @@ Likewise, ``dtype`` instances from pandas DataFrame can be used alone and let pa
 
 .. code-block:: python
 
-    >>> def transform(pdf) -> pd.DataFrame[pdf.dtypes]:
+    >>> def transform(pdf) -> pd.DataFrame[sample.dtypes]:
     ...     return pdf + 1
     ...
     >>> psdf.pandas_on_spark.apply_batch(transform)
+
+
+Type Hinting with Index
+-----------------------
+
+When you omit index types in the type hints, pandas API on Spark attaches the default index (`compute.default_index_type`),
+and it loses the index column and information from the original data. The default index sometimes requires to have an
+expensive computation such as shuffle so it is best to specify the index type together.
+
+
+Index
+~~~~~
+
+With the pandas DataFrames below:
+
+.. code-block:: python
+
+    >>> pdf = pd.DataFrame({'id': range(5)})
+    >>> sample = pdf.copy()
+    >>> sample["a"] = sample.id + 1
+
+The ways below are allowed for a regular index:
+
+.. code-block:: python
+
+    >>> def transform(pdf) -> pd.DataFrame[int, [int, int]]:
+    ...     pdf["a"] = pdf.id + 1
+    ...     return pdf
+    ...
+    >>> ps.from_pandas(pdf).pandas_on_spark.apply_batch(transform)
+
+.. code-block:: python
+
+    >>> def transform(pdf) -> pd.DataFrame[
+    ...         sample.index.dtype, sample.dtypes]:
+    ...     pdf["a"] = pdf.id + 1
+    ...     return pdf
+    ...
+    >>> ps.from_pandas(pdf).pandas_on_spark.apply_batch(transform)
+
+.. code-block:: python
+
+    >>> def transform(pdf) -> pd.DataFrame[
+    ...         ("idxA", int), [("id", int), ("a", int)]]:
+    ...     pdf["a"] = pdf.id + 1
+    ...     return pdf
+    ...
+    >>> ps.from_pandas(pdf).pandas_on_spark.apply_batch(transform)
+
+.. code-block:: python
+
+    >>> def transform(pdf) -> pd.DataFrame[
+    ...         (sample.index.name, sample.index.dtype),
+    ...         zip(sample.columns, sample.dtypes)]:
+    ...     pdf["a"] = pdf.id + 1
+    ...     return pdf
+    ...
+    >>> ps.from_pandas(pdf).pandas_on_spark.apply_batch(transform)
+
+
+MultiIndex
+~~~~~~~~~~
+
+With the pandas DataFrames below:
+
+    >>> midx = pd.MultiIndex.from_arrays(
+    ...     [(1, 1, 2), (1.5, 4.5, 7.5)],
+    ...     names=("int", "float"))
+    >>> pdf = pd.DataFrame(range(3), index=midx, columns=["id"])
+    >>> sample = pdf.copy()
+    >>> sample["a"] = sample.id + 1
+
+The ways below are allowed for multi-index:
+
+.. code-block:: python
+
+    >>> def transform(pdf) -> pd.DataFrame[[int, float], [int, int]]:
+    ...     pdf["a"] = pdf.id + 1
+    ...     return pdf
+    ...
+    >>> ps.from_pandas(pdf).pandas_on_spark.apply_batch(transform)
+
+.. code-block:: python
+
+    >>> def transform(pdf) -> pd.DataFrame[
+    ...         sample.index.dtypes, sample.dtypes]:
+    ...     pdf["a"] = pdf.id + 1
+    ...     return pdf
+    ...
+    >>> ps.from_pandas(pdf).pandas_on_spark.apply_batch(transform)
+
+.. code-block:: python
+
+    >>> def transform(pdf) -> pd.DataFrame[
+    ...         [("int", int), ("float", float)],
+    ...         [("id", int), ("a", int)]]:
+    ...     pdf["a"] = pdf.id + 1
+    ...     return pdf
+    ...
+    >>> ps.from_pandas(pdf).pandas_on_spark.apply_batch(transform)
+
+.. code-block:: python
+
+    >>> def transform(pdf) -> pd.DataFrame[
+    ...         zip(sample.index.names, sample.index.dtypes),
+    ...         zip(sample.columns, sample.dtypes)]:
+    ...     pdf["A"] = pdf.id + 1
+    ...     return pdf
+    ...
+    >>> ps.from_pandas(pdf).pandas_on_spark.apply_batch(transform)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org