You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Willi Raschkowski (Jira)" <ji...@apache.org> on 2021/11/26 14:14:00 UTC

[jira] [Comment Edited] (SPARK-37465) PySpark tests failing on Pandas 0.23

    [ https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449584#comment-17449584 ] 

Willi Raschkowski edited comment on SPARK-37465 at 11/26/21, 2:13 PM:
----------------------------------------------------------------------

I also noticed that {{CategoricalOpsTest}} fails on pandas 0.25.3 (latest 0.x) and works on 1.x:
{code:java}
$ conda list | grep pandas
pandas                    0.25.3           py36he6710b0_0
$ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest'
...
Running tests...
----------------------------------------------------------------------
/home/circleci/project/python/pyspark/context.py:238: FutureWarning: Python 3.6 support is deprecated in Spark 3.2.
  FutureWarning
  test_abs (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (2.353s)
  test_add (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (1.382s)
  test_and (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.265s)
ok (6.569s)                                                                     alOpsTest) ... 
  test_eq (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (1.514s)
  test_floordiv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.910s)
  test_from_to_pandas (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.143s)
  test_ge (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.795s)
  test_gt (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.891s)
  test_invert (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.044s)
  test_isnull (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.097s)
  test_le (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.863s)
  test_lt (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.844s)
  test_mod (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.897s)
  test_mul (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.860s)
  test_ne (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (1.405s)
  test_neg (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.044s)
  test_or (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.160s)
  test_pow (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.821s)
  test_radd (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.081s)
  test_rand (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.100s)
  test_rfloordiv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.083s)
  test_rmod (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.050s)
  test_rmul (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.079s)
  test_ror (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.095s)
  test_rpow (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.078s)
  test_rsub (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.078s)
  test_rtruediv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.079s)
  test_sub (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.818s)
  test_truediv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.832s)

======================================================================
FAIL [1.611s]: test_eq (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 122, in assertPandasEqual
    **kwargs
  File "/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py", line 1248, in assert_series_equal
    assert_attr_equal('name', left, right, obj=obj)
  File "/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py", line 941, in assert_attr_equal
    raise_assert_detail(obj, msg, left_attr, right_attr)
AssertionError: Series are different

Attribute "name" are different
[left]:  that_numeric_cat
[right]: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py", line 268, in test_eq
    psdf["this_numeric_cat"] == psdf["that_numeric_cat"],
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 223, in assert_eq
    self.assertPandasEqual(lobj, robj, check_exact=check_exact)
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 130, in assertPandasEqual
    raise AssertionError(msg) from e
AssertionError: Series are different

Attribute "name" are different
[left]:  that_numeric_cat
[right]: None

Left:
Name: that_numeric_cat, dtype: bool
bool

Right:
dtype: bool
bool

...
{code}
Upgrading pandas to 1.x fixes it:
{code:java}
$ conda list | grep pandas
pandas                    1.0.0            py36h0573a6f_0  
$ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest'
Running PySpark tests. Output is in /home/circleci/project/python/unit-tests.log
Will test against the following Python executables: ['python3.6']
Will test the following Python tests: ['pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest']
python3.6 python_implementation is CPython
python3.6 version is: Python 3.6.12 :: Anaconda, Inc.
Starting test(python3.6): pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest
Finished test(python3.6): pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest (34s)
Tests passed in 34 seconds
{code}


was (Author: raschkowski):
I also noticed another that {{CategoricalOpsTest}} fails on pandas 0.25.3 (latest 0.x) and works on 1.x:
{code:java}
$ conda list | grep pandas
pandas                    0.25.3           py36he6710b0_0
$ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest'
...
Running tests...
----------------------------------------------------------------------
/home/circleci/project/python/pyspark/context.py:238: FutureWarning: Python 3.6 support is deprecated in Spark 3.2.
  FutureWarning
  test_abs (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (2.353s)
  test_add (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (1.382s)
  test_and (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.265s)
ok (6.569s)                                                                     alOpsTest) ... 
  test_eq (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (1.514s)
  test_floordiv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.910s)
  test_from_to_pandas (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.143s)
  test_ge (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.795s)
  test_gt (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.891s)
  test_invert (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.044s)
  test_isnull (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.097s)
  test_le (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.863s)
  test_lt (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.844s)
  test_mod (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.897s)
  test_mul (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.860s)
  test_ne (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (1.405s)
  test_neg (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.044s)
  test_or (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.160s)
  test_pow (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.821s)
  test_radd (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.081s)
  test_rand (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.100s)
  test_rfloordiv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.083s)
  test_rmod (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.050s)
  test_rmul (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.079s)
  test_ror (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.095s)
  test_rpow (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.078s)
  test_rsub (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.078s)
  test_rtruediv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.079s)
  test_sub (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.818s)
  test_truediv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.832s)

======================================================================
FAIL [1.611s]: test_eq (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 122, in assertPandasEqual
    **kwargs
  File "/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py", line 1248, in assert_series_equal
    assert_attr_equal('name', left, right, obj=obj)
  File "/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py", line 941, in assert_attr_equal
    raise_assert_detail(obj, msg, left_attr, right_attr)
AssertionError: Series are different

Attribute "name" are different
[left]:  that_numeric_cat
[right]: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py", line 268, in test_eq
    psdf["this_numeric_cat"] == psdf["that_numeric_cat"],
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 223, in assert_eq
    self.assertPandasEqual(lobj, robj, check_exact=check_exact)
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 130, in assertPandasEqual
    raise AssertionError(msg) from e
AssertionError: Series are different

Attribute "name" are different
[left]:  that_numeric_cat
[right]: None

Left:
Name: that_numeric_cat, dtype: bool
bool

Right:
dtype: bool
bool

...
{code}

Upgrading pandas to 1.x fixes it:
{code}
$ conda list | grep pandas
pandas                    1.0.0            py36h0573a6f_0  
$ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest'
Running PySpark tests. Output is in /home/circleci/project/python/unit-tests.log
Will test against the following Python executables: ['python3.6']
Will test the following Python tests: ['pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest']
python3.6 python_implementation is CPython
python3.6 version is: Python 3.6.12 :: Anaconda, Inc.
Starting test(python3.6): pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest
Finished test(python3.6): pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest (34s)
Tests passed in 34 seconds
{code}

> PySpark tests failing on Pandas 0.23
> ------------------------------------
>
>                 Key: SPARK-37465
>                 URL: https://issues.apache.org/jira/browse/SPARK-37465
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Willi Raschkowski
>            Priority: Major
>
> I was running Spark tests with Pandas {{0.23.4}} and got the error below. The minimum Pandas version is currently {{0.23.2}} [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222] in Pandas.
> {code:java}
> $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTest.test_floordiv'
> ...
> ======================================================================
> ERROR [5.785s]: test_floordiv (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", line 128, in test_floordiv
>     self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int))
>   File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1069, in wrapper
>     result = safe_na_op(lvalues, rvalues)
>   File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1033, in safe_na_op
>     return na_op(lvalues, rvalues)
>   File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1027, in na_op
>     result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
>   File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", line 641, in fill_zeros
>     signs = np.sign(y if name.startswith(('r', '__r')) else x)
> TypeError: ufunc 'sign' did not contain a loop with signature matching types dtype('bool') dtype('bool')
> {code}
> These are my relevant package versions:
> {code:java}
> $ conda list | grep -e numpy -e pyarrow -e pandas -e python
> # packages in environment at /home/circleci/miniconda/envs/python3:
> numpy                     1.16.6           py36h0a8e133_3  
> numpy-base                1.16.6           py36h41b4c56_3  
> pandas                    0.23.4           py36h04863e7_0  
> pyarrow                   1.0.1           py36h6200943_36_cpu    conda-forge
> python                    3.6.12               hcff3b4d_2    anaconda
> python-dateutil           2.8.1                      py_0    anaconda
> python_abi                3.6                     1_cp36m    conda-forg
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org