You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2023/09/04 04:43:49 UTC

[GitHub] [spark] itholic opened a new pull request, #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

itholic opened a new pull request, #42793:
URL: https://github.com/apache/spark/pull/42793

   
   ### What changes were proposed in this pull request?
   
   This PR proposes to support pandas 2.1.0 for PySpark. See [What's new in 2.1.0](https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html) for more detail.
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   We should follow the latest version of pandas.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No.
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   The existing CI should passed with Pandas 2.1.0
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0
URL: https://github.com/apache/spark/pull/42793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1724614064

   Thank you, @itholic and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #42793: [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r1319286522


##########
python/pyspark/pandas/tests/frame/test_reshaping.py:
##########
@@ -291,7 +291,8 @@ def test_stack(self):
         psdf_multi_level_cols2 = ps.from_pandas(pdf_multi_level_cols2)
 
         self.assert_eq(
-            psdf_multi_level_cols2.stack().sort_index(), pdf_multi_level_cols2.stack().sort_index()
+            psdf_multi_level_cols2.stack().sort_index()[["weight", "height"]],
+            pdf_multi_level_cols2.stack().sort_index()[["weight", "height"]],

Review Comment:
   This just for handling the column order:
   
   **DataFrame**
   ```python
   >>> pdf
       weight height
           kg      m
   cat    1.0    2.0
   dog    3.0    4.0
   ```
   
   **DataFrame.stack() in Pandas 1.5.3**
   ```python
   >>> pdf.stack()
           weight  height
   cat kg     1.0     NaN
       m      NaN     2.0
   dog kg     3.0     NaN
       m      NaN     4.0
   ```
   
   **DataFrame.stack() in Pandas 2.1.0**
   ```python
   >>> pdf.stack()
           weight  height
   cat kg     1.0     NaN
       m      NaN     2.0
   dog kg     3.0     NaN
       m      NaN     4.0
   ```
   
   I think maybe this is the minor bug in Pandas, so I reported for the Pandas community to make sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1720382320

   @zhengruifeng AFAIK, there is no separate policy for minimum version. We may change the minimum version of a particular package when if an older version no longer works properly with Spark, or if the community for that package no longer maintains a particular older version, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r1328008105


##########
python/docs/source/migration_guide/pyspark_upgrade.rst:
##########
@@ -42,6 +42,8 @@ Upgrading from PySpark 3.5 to 4.0
 * In Spark 4.0, ``squeeze`` parameter from ``ps.read_csv`` and ``ps.read_excel`` has been removed from pandas API on Spark.
 * In Spark 4.0, ``null_counts`` parameter from ``DataFrame.info`` has been removed from pandas API on Spark, use ``show_counts`` instead.
 * In Spark 4.0, the result of ``MultiIndex.append`` does not keep the index names from pandas API on Spark.

Review Comment:
   Good idea. Related information has been added to the top of the migration guide. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r1327777150


##########
python/pyspark/pandas/frame.py:
##########
@@ -1321,11 +1323,76 @@ def applymap(self, func: Callable[[Any], Any]) -> "DataFrame":
         0   1.000000   4.494400
         1  11.262736  20.857489
         """
+        warnings.warn(
+            "DataFrame.applymap has been deprecated. Use DataFrame.map instead", FutureWarning
+        )
 
         # TODO: We can implement shortcut theoretically since it creates new DataFrame
         #  anyway and we don't have to worry about operations on different DataFrames.
         return self._apply_series_op(lambda psser: psser.apply(func))
 
+    def map(self, func: Callable[[Any], Any]) -> "DataFrame":
+        """
+        Apply a function to a Dataframe elementwise.
+
+        This method applies a function that accepts and returns a scalar
+        to every element of a DataFrame.
+
+        .. versionadded:: 4.0.0
+            DataFrame.applymap was deprecated and renamed to DataFrame.map.
+
+        .. note:: this API executes the function once to infer the type which is
+             potentially expensive, for instance, when the dataset is created after
+             aggregations or sorting.
+
+             To avoid this, specify return type in ``func``, for instance, as below:
+
+             >>> def square(x) -> np.int32:
+             ...     return x ** 2
+
+             pandas-on-Spark uses return type hints and does not try to infer the type.
+
+        Parameters
+        ----------
+        func : callable
+            Python function returns a single value from a single value.
+
+        Returns
+        -------
+        DataFrame
+            Transformed DataFrame.
+
+        Examples
+        --------
+        >>> df = ps.DataFrame([[1, 2.12], [3.356, 4.567]])
+        >>> df
+               0      1
+        0  1.000  2.120
+        1  3.356  4.567
+
+        >>> def str_len(x) -> int:
+        ...     return len(str(x))
+        >>> df.map(str_len)
+           0  1
+        0  3  4
+        1  5  5
+
+        >>> def power(x) -> float:
+        ...     return x ** 2
+        >>> df.map(power)
+                   0          1
+        0   1.000000   4.494400
+        1  11.262736  20.857489
+
+        You can omit type hints and let pandas-on-Spark infer its type.
+
+        >>> df.map(lambda x: x ** 2)
+                   0          1
+        0   1.000000   4.494400
+        1  11.262736  20.857489
+        """
+        return self.applymap(func=func)

Review Comment:
   This call will show a deprecation warning from `applymap`?
   
   I guess we should call `return self._apply_series_op(lambda psser: psser.apply(func))` here and `applymap` should call `map` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on a diff in pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.
bjornjorgensen commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r1327998224


##########
python/docs/source/migration_guide/pyspark_upgrade.rst:
##########
@@ -42,6 +42,8 @@ Upgrading from PySpark 3.5 to 4.0
 * In Spark 4.0, ``squeeze`` parameter from ``ps.read_csv`` and ``ps.read_excel`` has been removed from pandas API on Spark.
 * In Spark 4.0, ``null_counts`` parameter from ``DataFrame.info`` has been removed from pandas API on Spark, use ``show_counts`` instead.
 * In Spark 4.0, the result of ``MultiIndex.append`` does not keep the index names from pandas API on Spark.

Review Comment:
   Can we add a line her, where we tell users to have pandas version 2.1.0 installed for spark 4.0  
   The only way now to find witch pandas version to install is to check the docker file in dev/infra 
   
   https://github.com/jupyter/docker-stacks/blob/52a999a554fe42951e017f7be132d808695a1261/images/pyspark-notebook/Dockerfile#L69



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1720583572

   Could you resolve the conflict, @itholic ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42793: [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1718463590

   Seems like the CI failure is not related my changes. Let me just try retriggering the CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1724692463

   Thanks all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1720385772

   Let's probably upgrade them since we're going ahead for 4.0.0 major version bumpup


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #42793: [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r1319275507


##########
python/pyspark/pandas/typedef/typehints.py:
##########
@@ -487,23 +487,23 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
     ...     pass
     >>> inferred = infer_return_type(func)
     >>> inferred.dtypes
-    [dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False)]
+    [dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False, categories_dtype=int64)]

Review Comment:
   Added `dtype` of categories is added to `__repr__`: https://github.com/pandas-dev/pandas/issues/52179.



##########
python/pyspark/pandas/frame.py:
##########
@@ -10530,12 +10530,12 @@ def stack(self) -> DataFrameOrSeries:
                 kg      m
         cat    1.0    2.0
         dog    3.0    4.0
-        >>> df_multi_level_cols2.stack().sort_index()  # doctest: +SKIP
-                height  weight
-        cat kg     NaN     1.0
-            m      2.0     NaN
-        dog kg     NaN     3.0
-            m      4.0     NaN
+        >>> df_multi_level_cols2.stack().sort_index()

Review Comment:
   Bug fixed in Pandas: https://github.com/pandas-dev/pandas/issues/53786.



##########
python/pyspark/pandas/groupby.py:
##########
@@ -311,7 +311,14 @@ def aggregate(
                 i for i, gkey in enumerate(self._groupkeys) if gkey._psdf is not self._psdf
             )
             if len(should_drop_index) > 0:
-                psdf = psdf.reset_index(level=should_drop_index, drop=True)
+                drop = not any(
+                    [
+                        isinstance(func_or_funcs[gkey.name], list)
+                        for gkey in self._groupkeys
+                        if gkey.name in func_or_funcs
+                    ]
+                )
+                psdf = psdf.reset_index(level=should_drop_index, drop=drop)

Review Comment:
   Bug fixed in Pandas: https://github.com/pandas-dev/pandas/issues/52849.



##########
python/pyspark/pandas/tests/test_stats.py:
##########
@@ -273,7 +268,18 @@ def test_skew_kurt_numerical_stability(self):
         self.assert_eq(psdf.kurt(), pdf.kurt(), almost=True)
 
     def test_dataframe_corr(self):
-        pdf = makeMissingDataframe(0.3, 42)
+        pdf = pd.DataFrame(
+            index=[
+                "".join(
+                    np.random.choice(
+                        list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"), 10
+                    )
+                )
+                for _ in range(30)
+            ],
+            columns=list("ABCD"),
+            dtype="float64",
+        )

Review Comment:
   The testing util `makeMissingDataframe` is removed.



##########
python/pyspark/pandas/tests/frame/test_reshaping.py:
##########
@@ -291,7 +291,8 @@ def test_stack(self):
         psdf_multi_level_cols2 = ps.from_pandas(pdf_multi_level_cols2)
 
         self.assert_eq(
-            psdf_multi_level_cols2.stack().sort_index(), pdf_multi_level_cols2.stack().sort_index()
+            psdf_multi_level_cols2.stack().sort_index()[["weight", "height"]],
+            pdf_multi_level_cols2.stack().sort_index()[["weight", "height"]],

Review Comment:
   This just for handling the column order:
   
   **DataFrame**
   ```python
   >>> pdf
       weight height
           kg      m
   cat    1.0    2.0
   dog    3.0    4.0
   ```
   
   **DataFrame.stack() in Pandas 1.5.3**
   ```python
   >>> pdf.stack()
           weight  height
   cat kg     1.0     NaN
       m      NaN     2.0
   dog kg     3.0     NaN
       m      NaN     4.0
   ```
   
   **DataFrame.stack() in Pandas 2.1.0**
   ```python
   >>> pdf.stack()
           weight  height
   cat kg     1.0     NaN
       m      NaN     2.0
   dog kg     3.0     NaN
       m      NaN     4.0
   ```
   
   I think maybe this is the minor bug in Pandas, so I reported for the Pandas community to make sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1718465891

   I believe this PR is ready to review. The current CI failure is not related to my change.
   
   cc @ueshin @HyukjinKwon @zhengruifeng @xinrong-meng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1719206074

   not related to this PR itself, what is the policy to upgrade the minimum version of dependencies listed [here](https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies) ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42793: [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1716970680

   Many tests are failing due to the PyArrow upgrade in CI.
   
   https://github.com/apache/spark/pull/42897 is fixing this issue, so let me rebase the PR after the fixing is get merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1722660266

   CI link: https://github.com/itholic/spark/actions/runs/6216894150


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r1327908926


##########
python/pyspark/pandas/frame.py:
##########
@@ -1321,11 +1323,76 @@ def applymap(self, func: Callable[[Any], Any]) -> "DataFrame":
         0   1.000000   4.494400
         1  11.262736  20.857489
         """
+        warnings.warn(
+            "DataFrame.applymap has been deprecated. Use DataFrame.map instead", FutureWarning
+        )
 
         # TODO: We can implement shortcut theoretically since it creates new DataFrame
         #  anyway and we don't have to worry about operations on different DataFrames.
         return self._apply_series_op(lambda psser: psser.apply(func))
 
+    def map(self, func: Callable[[Any], Any]) -> "DataFrame":
+        """
+        Apply a function to a Dataframe elementwise.
+
+        This method applies a function that accepts and returns a scalar
+        to every element of a DataFrame.
+
+        .. versionadded:: 4.0.0
+            DataFrame.applymap was deprecated and renamed to DataFrame.map.
+
+        .. note:: this API executes the function once to infer the type which is
+             potentially expensive, for instance, when the dataset is created after
+             aggregations or sorting.
+
+             To avoid this, specify return type in ``func``, for instance, as below:
+
+             >>> def square(x) -> np.int32:
+             ...     return x ** 2
+
+             pandas-on-Spark uses return type hints and does not try to infer the type.
+
+        Parameters
+        ----------
+        func : callable
+            Python function returns a single value from a single value.
+
+        Returns
+        -------
+        DataFrame
+            Transformed DataFrame.
+
+        Examples
+        --------
+        >>> df = ps.DataFrame([[1, 2.12], [3.356, 4.567]])
+        >>> df
+               0      1
+        0  1.000  2.120
+        1  3.356  4.567
+
+        >>> def str_len(x) -> int:
+        ...     return len(str(x))
+        >>> df.map(str_len)
+           0  1
+        0  3  4
+        1  5  5
+
+        >>> def power(x) -> float:
+        ...     return x ** 2
+        >>> df.map(power)
+                   0          1
+        0   1.000000   4.494400
+        1  11.262736  20.857489
+
+        You can omit type hints and let pandas-on-Spark infer its type.
+
+        >>> df.map(lambda x: x ** 2)
+                   0          1
+        0   1.000000   4.494400
+        1  11.262736  20.857489
+        """
+        return self.applymap(func=func)

Review Comment:
   Oh, yeah we shouldn't call `applymap` here.
   
   Just applied the suggestion. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42793: [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42793:
URL: https://github.com/apache/spark/pull/42793#issuecomment-1704604204

   Since there are many features are [deprecated from Pandas 2.1.0](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#deprecations), let me investigate if there is any corresponding feature from Pandas API on Spark while we're here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #42793: [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r1319299936


##########
python/pyspark/pandas/frame.py:
##########
@@ -10530,12 +10530,12 @@ def stack(self) -> DataFrameOrSeries:
                 kg      m
         cat    1.0    2.0
         dog    3.0    4.0
-        >>> df_multi_level_cols2.stack().sort_index()  # doctest: +SKIP
-                height  weight
-        cat kg     NaN     1.0
-            m      2.0     NaN
-        dog kg     NaN     3.0
-            m      4.0     NaN
+        >>> df_multi_level_cols2.stack().sort_index()

Review Comment:
   Column ordering bug is fixed in Pandas: https://github.com/pandas-dev/pandas/issues/53786.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org