You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2024/01/10 23:58:26 UTC

[PR] [SPARK-46665][PYTHON] Remove Pandas dependency for `pyspark.testing` [spark]

itholic opened a new pull request, #44675:
URL: https://github.com/apache/spark/pull/44675

   ### What changes were proposed in this pull request?
   
   This PR proposes to remove Pandas dependency for `pyspark.testing`.
   
   ### Why are the changes needed?
   
   `pyspark.testing.assertDataFrameEqual` and `pyspark.testing.assertSchemaEqual` should not dependent on Pandas, but currently they are:
   
   ```python
   >>> from pyspark.testing import assertDataFrameEqual
   AttributeError: module 'pandas' has no attribute '__version__'
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API changes, but importing `pyspark.testing.assertDataFrameEqual` and `pyspark.testing.assertSchemaEqual` without Pandas installation would not raise any exception.
   
   
   ### How was this patch tested?
   
   Manually test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove Pandas dependency for `pyspark.testing` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44675:
URL: https://github.com/apache/spark/pull/44675#discussion_r1448444255


##########
python/pyspark/testing/psutils.py:
##########
@@ -0,0 +1,157 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import warnings
+from typing import Union, TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from pyspark.pandas.frame import DataFrame
+    from pyspark.pandas.indexes import Index
+    from pyspark.pandas.series import Series
+    import pandas as pd
+
+from pyspark.errors import PySparkAssertionError
+
+
+__all__ = ["assertPandasOnSparkEqual"]
+
+
+def assertPandasOnSparkEqual(
+    actual: Union["DataFrame", "Series", "Index"],
+    expected: Union["DataFrame", "pd.DataFrame", "Series", "pd.Series", "Index", "pd.Index"],
+    checkExact: bool = True,
+    almost: bool = False,
+    rtol: float = 1e-5,
+    atol: float = 1e-8,
+    checkRowOrder: bool = True,
+):
+    r"""
+    A util function to assert equality between actual (pandas-on-Spark object) and expected
+    (pandas-on-Spark or pandas object).
+
+    .. versionadded:: 3.5.0
+
+    .. deprecated:: 3.5.1
+        `assertPandasOnSparkEqual` will be removed in Spark 4.0.0.

Review Comment:
   Actually let's just remove this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove Pandas dependency for `pyspark.testing` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44675:
URL: https://github.com/apache/spark/pull/44675#discussion_r1448175750


##########
python/pyspark/testing/__init__.py:
##########
@@ -16,6 +16,11 @@
 #
 from pyspark.testing.utils import assertDataFrameEqual, assertSchemaEqual
 
-from pyspark.testing.pandasutils import assertPandasOnSparkEqual
+__all__ = ["assertDataFrameEqual", "assertSchemaEqual"]
 
-__all__ = ["assertDataFrameEqual", "assertSchemaEqual", "assertPandasOnSparkEqual"]
+try:
+    from pyspark.testing.pandasutils import assertPandasOnSparkEqual

Review Comment:
   Moved `assertPandasOnSparkEqual` from `pyspark.testing.pandasutils` to `pyspark.testing.psutils` to remove pandas dependency from the existing testing utils keeping `assertPandasOnSparkEqual` as an API as is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove Pandas dependency for `pyspark.testing` [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #44675:
URL: https://github.com/apache/spark/pull/44675#discussion_r1448231765


##########
python/pyspark/testing/psutils.py:
##########
@@ -0,0 +1,190 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import warnings
+from typing import Union
+
+tabulate_requirement_message = None
+try:
+    from tabulate import tabulate
+except ImportError as e:
+    # If tabulate requirement is not satisfied, skip related tests.
+    tabulate_requirement_message = str(e)
+have_tabulate = tabulate_requirement_message is None
+
+matplotlib_requirement_message = None
+try:
+    import matplotlib
+except ImportError as e:
+    # If matplotlib requirement is not satisfied, skip related tests.
+    matplotlib_requirement_message = str(e)
+have_matplotlib = matplotlib_requirement_message is None
+
+plotly_requirement_message = None
+try:
+    import plotly
+except ImportError as e:
+    # If plotly requirement is not satisfied, skip related tests.
+    plotly_requirement_message = str(e)
+have_plotly = plotly_requirement_message is None

Review Comment:
   Why do we need these libraries?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove `assertPandasOnSparkEqual` [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #44675:
URL: https://github.com/apache/spark/pull/44675#issuecomment-1888784124

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove Pandas dependency for `pyspark.testing` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44675:
URL: https://github.com/apache/spark/pull/44675#discussion_r1448532925


##########
python/pyspark/testing/psutils.py:
##########
@@ -0,0 +1,157 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import warnings
+from typing import Union, TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from pyspark.pandas.frame import DataFrame
+    from pyspark.pandas.indexes import Index
+    from pyspark.pandas.series import Series
+    import pandas as pd
+
+from pyspark.errors import PySparkAssertionError
+
+
+__all__ = ["assertPandasOnSparkEqual"]
+
+
+def assertPandasOnSparkEqual(
+    actual: Union["DataFrame", "Series", "Index"],
+    expected: Union["DataFrame", "pd.DataFrame", "Series", "pd.Series", "Index", "pd.Index"],
+    checkExact: bool = True,
+    almost: bool = False,
+    rtol: float = 1e-5,
+    atol: float = 1e-8,
+    checkRowOrder: bool = True,
+):
+    r"""
+    A util function to assert equality between actual (pandas-on-Spark object) and expected
+    (pandas-on-Spark or pandas object).
+
+    .. versionadded:: 3.5.0
+
+    .. deprecated:: 3.5.1
+        `assertPandasOnSparkEqual` will be removed in Spark 4.0.0.

Review Comment:
   Removed & updated the PR description.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove Pandas dependency for `pyspark.testing` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44675:
URL: https://github.com/apache/spark/pull/44675#issuecomment-1886745880

   Let's fix the PR title and description accordingly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove Pandas dependency for `pyspark.testing` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44675:
URL: https://github.com/apache/spark/pull/44675#discussion_r1448120281


##########
python/pyspark/testing/__init__.py:
##########
@@ -16,6 +16,11 @@
 #
 from pyspark.testing.utils import assertDataFrameEqual, assertSchemaEqual
 
-from pyspark.testing.pandasutils import assertPandasOnSparkEqual
+__all__ = ["assertDataFrameEqual", "assertSchemaEqual"]
 
-__all__ = ["assertDataFrameEqual", "assertSchemaEqual", "assertPandasOnSparkEqual"]
+try:
+    from pyspark.testing.pandasutils import assertPandasOnSparkEqual

Review Comment:
   I think you should still keep the API, and throw an exception within this API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove Pandas dependency for `pyspark.testing` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44675:
URL: https://github.com/apache/spark/pull/44675#discussion_r1448305337


##########
python/pyspark/testing/psutils.py:
##########
@@ -0,0 +1,190 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import warnings
+from typing import Union
+
+tabulate_requirement_message = None
+try:
+    from tabulate import tabulate
+except ImportError as e:
+    # If tabulate requirement is not satisfied, skip related tests.
+    tabulate_requirement_message = str(e)
+have_tabulate = tabulate_requirement_message is None
+
+matplotlib_requirement_message = None
+try:
+    import matplotlib
+except ImportError as e:
+    # If matplotlib requirement is not satisfied, skip related tests.
+    matplotlib_requirement_message = str(e)
+have_matplotlib = matplotlib_requirement_message is None
+
+plotly_requirement_message = None
+try:
+    import plotly
+except ImportError as e:
+    # If plotly requirement is not satisfied, skip related tests.
+    plotly_requirement_message = str(e)
+have_plotly = plotly_requirement_message is None

Review Comment:
   Better to clean up them on second thought. Let me clear them. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46665][PYTHON] Remove `assertPandasOnSparkEqual` [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #44675: [SPARK-46665][PYTHON] Remove `assertPandasOnSparkEqual`
URL: https://github.com/apache/spark/pull/44675


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org