You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2022/11/01 06:23:42 UTC

[spark] branch master updated: [SPARK-40827][PS][TESTS] Re-enable the DataFrame.corrwith test after fixing in future pandas

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new e24d22e1c9a [SPARK-40827][PS][TESTS] Re-enable the DataFrame.corrwith test after fixing in future pandas
e24d22e1c9a is described below

commit e24d22e1c9afa8d2190d2ca44a16deae58e0fee8
Author: itholic <ha...@databricks.com>
AuthorDate: Tue Nov 1 15:23:27 2022 +0900

    [SPARK-40827][PS][TESTS] Re-enable the DataFrame.corrwith test after fixing in future pandas
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to make the manual tests for `DataFrame.corrwith` back into formal approach, if the pandas version is not 1.5.0.
    
    ### Why are the changes needed?
    
    There was a regression introduced by pandas 1.5.0 (https://github.com/pandas-dev/pandas/issues/48826), and seems it's resolved now.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The fixed test should pass the CI.
    
    Closes #38455 from itholic/SPARK-40827.
    
    Authored-by: itholic <ha...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/pandas/tests/test_dataframe.py          |  6 ++++--
 python/pyspark/pandas/tests/test_ops_on_diff_frames.py | 10 ++++++----
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/pandas/tests/test_dataframe.py b/python/pyspark/pandas/tests/test_dataframe.py
index b5466b467d8..4e80c680b6e 100644
--- a/python/pyspark/pandas/tests/test_dataframe.py
+++ b/python/pyspark/pandas/tests/test_dataframe.py
@@ -6091,10 +6091,12 @@ class DataFrameTest(ComparisonTestBase, SQLTestUtils):
     def _test_corrwith(self, psdf, psobj):
         pdf = psdf._to_pandas()
         pobj = psobj._to_pandas()
-        # Regression in pandas 1.5.0 when other is Series and method is "pearson" or "spearman"
+        # There was a regression in pandas 1.5.0
+        # when other is Series and method is "pearson" or "spearman", and fixed in pandas 1.5.1
+        # Therefore, we only test the pandas 1.5.0 in different way.
         # See https://github.com/pandas-dev/pandas/issues/48826 for the reported issue,
         # and https://github.com/pandas-dev/pandas/pull/46174 for the initial PR that causes.
-        if LooseVersion(pd.__version__) >= LooseVersion("1.5.0") and isinstance(pobj, pd.Series):
+        if LooseVersion(pd.__version__) == LooseVersion("1.5.0") and isinstance(pobj, pd.Series):
             methods = ["kendall"]
         else:
             methods = ["pearson", "spearman", "kendall"]
diff --git a/python/pyspark/pandas/tests/test_ops_on_diff_frames.py b/python/pyspark/pandas/tests/test_ops_on_diff_frames.py
index ce1ffb34765..71c393dcf34 100644
--- a/python/pyspark/pandas/tests/test_ops_on_diff_frames.py
+++ b/python/pyspark/pandas/tests/test_ops_on_diff_frames.py
@@ -1866,12 +1866,13 @@ class OpsOnDiffFramesEnabledTest(PandasOnSparkTestCase, SQLTestUtils):
         self._test_corrwith((df1 + 1), df2.B)
         self._test_corrwith((df1 + 1), (df2.B + 2))
 
-        # Regression in pandas 1.5.0
+        # There was a regression in pandas 1.5.0, and fixed in pandas 1.5.1.
+        # Therefore, we only test the pandas 1.5.0 in different way.
         # See https://github.com/pandas-dev/pandas/issues/49141 for the reported issue,
         # and https://github.com/pandas-dev/pandas/pull/46174 for the initial PR that causes.
         df_bool = ps.DataFrame({"A": [True, True, False, False], "B": [True, False, False, True]})
         ser_bool = ps.Series([True, True, False, True])
-        if LooseVersion(pd.__version__) >= LooseVersion("1.5.0"):
+        if LooseVersion(pd.__version__) == LooseVersion("1.5.0"):
             expected = ps.Series([0.5773502691896257, 0.5773502691896257], index=["B", "A"])
             self.assert_eq(df_bool.corrwith(ser_bool), expected, almost=True)
         else:
@@ -1883,10 +1884,11 @@ class OpsOnDiffFramesEnabledTest(PandasOnSparkTestCase, SQLTestUtils):
         self._test_corrwith(self.psdf3, self.psdf4)
 
         self._test_corrwith(self.psdf1, self.psdf1.a)
-        # Regression in pandas 1.5.0
+        # There was a regression in pandas 1.5.0, and fixed in pandas 1.5.1.
+        # Therefore, we only test the pandas 1.5.0 in different way.
         # See https://github.com/pandas-dev/pandas/issues/49141 for the reported issue,
         # and https://github.com/pandas-dev/pandas/pull/46174 for the initial PR that causes.
-        if LooseVersion(pd.__version__) >= LooseVersion("1.5.0"):
+        if LooseVersion(pd.__version__) == LooseVersion("1.5.0"):
             expected = ps.Series([-0.08827348295047496, 0.4413674147523748], index=["b", "a"])
             self.assert_eq(self.psdf1.corrwith(self.psdf2.b), expected, almost=True)
         else:


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org