You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2023/03/22 19:05:10 UTC

[GitHub] [spark] itholic opened a new pull request, #40525: [WIP][SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

itholic opened a new pull request, #40525:
URL: https://github.com/apache/spark/pull/40525

   ### What changes were proposed in this pull request?
   
   This PR proposes to support pandas API on Spark for Spark Connect. This PR includes minimal changes to support basic functionality of the pandas API in Spark Connect, and sets up a testing environment into `pyspark/pandas/tests/connect` using all existing pandas API on Spark test bases to test the functionality of the pandas API on Spark in a remote Spark session.
   
   ### Why are the changes needed?
   
   By supporting the pandas API in Spark Connect, it can significantly improve the usability for existing PySpark and pandas users.
   
   ### Does this PR introduce _any_ user-facing change?
   
   It is designed to allow existing code for regular Spark sessions to be used without any user-facing changes. However, since some features of the existing pandas API on Spark are not fully supported, some features may be limited.
   
   ### How was this patch tested?
   
   A testing bed for Spark Connect has been set up to reproduce all existing tests for Spark Connect, ensuring that the existing tests can be replicated in Spark Connect. The current result for all tests as below:
   
   | Test file                                           | Test total | Test passed | Coverage |
   | --------------------------------------------------- | ---------- | ----------- | -------- |
   | test_parity_dataframe.py                            | 105        | 85          | 80.95%   |
   | test_parity_dataframe_slow.py                       | 66         | 48          | 72.73%   |
   | test_parity_dataframe_conversion.py                 | 11         | 11          | 100.00%  |
   | test_parity_dataframe_spark_io.py                   | 8          | 7           | 87.50%   |
   | test_parity_ops_on_diff_frames.py                   | 75         | 75          | 100.00%  |
   | test_parity_series.py                               | 131        | 104         | 79.39%   |
   | test_parity_series_datetime.py                      | 41         | 34          | 82.93%   |
   | test_parity_categorical.py                          | 29         | 22          | 75.86%   |
   | test_parity_config.py                               | 7          | 7           | 100.00%  |
   | test_parity_csv.py                                  | 18         | 18          | 100.00%  |
   | test_parity_default_index.py                        | 4          | 1           | 25.00%   |
   | test_parity_ewm.py                                  | 3          | 1           | 33.33%   |
   | test_parity_expanding.py                            | 22         | 2           | 9.09%    |
   | test_parity_extention.py                            | 7          | 7           | 100.00%  |
   | test_parity_frame_spark.py                          | 6          | 2           | 33.33%   |
   | test_parity_generic_functions.py                    | 4          | 1           | 25.00%   |
   | test_parity_groupby.py                              | 49         | 36          | 73.47%   |
   | test_parity_groupby_slow.py                         | 205        | 147         | 71.71%   |
   | test_parity_indexing.py                             | 3          | 3           | 100.00%  |
   | test_parity_indexops_spark.py                       | 3          | 3           | 100.00%  |
   | test_parity_internal.py                             | 1          | 0           | 0.00%    |
   | test_parity_namespace.py                            | 29         | 26          | 89.66%   |
   | test_parity_numpy_compat.py                         | 6          | 4           | 66.67%   |
   | test_parity_ops_on_diff_frames_groupby.py           | 22         | 13          | 59.09%   |
   | test_parity_ops_on_diff_frames_groupby_expanding.py | 7          | 0           | 0.00%    |
   | test_parity_ops_on_diff_frames_groupby_rolling.py   | 7          | 0           | 0.00%    |
   | test_parity_ops_on_diff_frames_slow.py              | 22         | 15          | 68.18%   |
   | test_parity_repr.py                                 | 5          | 5           | 100.00%  |
   | test_parity_resample.py                             | 5          | 3           | 60.00%   |
   | test_parity_reshape.py                              | 10         | 8           | 80.00%   |
   | test_parity_rolling.py                              | 21         | 1           | 4.76%    |
   | test_parity_scalars.py                              | 1          | 1           | 100.00%  |
   | test_parity_series_conversion.py                    | 2          | 2           | 100.00%  |
   | test_parity_series_string.py                        | 56         | 55          | 98.21%   |
   | test_parity_spark_functions.py                      | 1          | 1           | 100.00%  |
   | test_parity_sql.py                                  | 7          | 4           | 57.14%   |
   | test_parity_stats.py                                | 15         | 7           | 46.67%   |
   | test_parity_typedef.py                              | 10         | 10          | 100.00%  |
   | test_parity_utils.py                                | 5          | 5           | 100.00%  |
   | test_parity_window.py                               | 2          | 2           | 100.00%  |
   | test_parity_frame_plot.py                           | 7          | 5           | 71.43%   |
   | plot/test_parity_frame_plot_matplotlib.py           | 13         | 11          | 84.62%   |
   | plot/test_parity_frame_plot_plotly.py               | 12         | 9           | 75.00%   |
   | plot/test_parity_series_plot.py                     | 3          | 3           | 100.00%  |
   | plot/test_parity_series_plot_matplotlib.py          | 14         | 8           | 57.14%   |
   | plot/test_parity_series_plot_plotly.py              | 9          | 7           | 77.78%   |
   | indexes/test_parity_base.py                         | 144        | 75          | 52.08%   |
   | indexes/test_parity_category.py                     | 16         | 7           | 43.75%   |
   | indexes/test_parity_datetime.py                     | 13         | 11          | 84.62%   |
   | indexes/test_parity_timedelta.py                    | 2          | 1           | 50.00%   |
   | data_type_ops/test_parity_base.py                   | 2          | 2           | 100.00%  |
   | data_type_ops/test_parity_binary_ops.py             | 30         | 25          | 83.33%   |
   | data_type_ops/test_parity_boolean_ops.py            | 31         | 26          | 83.87%   |
   | data_type_ops/test_parity_categorical_ops.py        | 30         | 23          | 76.67%   |
   | data_type_ops/test_parity_complex_ops.py            | 30         | 30          | 100.00%  |
   | data_type_ops/test_parity_date_ops.py               | 30         | 25          | 83.33%   |
   | Total                                               | 1417       | 1044        | 73.68%   |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on a diff in pull request #40525: [WIP][SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.

bjornjorgensen commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1148591182


##########
python/pyspark/pandas/data_type_ops/base.py:
##########
@@ -24,7 +24,7 @@
 import pandas as pd
 from pandas.api.types import CategoricalDtype
 
-from pyspark.sql import functions as F, Column
+from pyspark.sql import functions as F, Column as PySparkColumn

Review Comment:
   Does this work? 
   from pyspark.sql import functions as F
   from pyspark.sql.column import Column as PySparkColumn 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155911972


##########
python/pyspark/pandas/tests/connect/test_parity_categorical.py:
##########
@@ -0,0 +1,82 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+import pandas as pd
+from pyspark import pandas as ps
+from pyspark.pandas.tests.test_categorical import CategoricalTestsMixin
+from pyspark.testing.connectutils import ReusedConnectTestCase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils, TestUtils
+
+
+class CategoricalParityTests(
+    CategoricalTestsMixin, PandasOnSparkTestUtils, TestUtils, ReusedConnectTestCase
+):
+    @property
+    def pdf(self):
+        return pd.DataFrame(

Review Comment:
   can't we inherit this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155891236


##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -196,7 +215,9 @@ def xor(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         elif _is_valid_for_logical_operator(right):
             right_is_boolean = _is_boolean_type(right)
 
-            def xor_func(left: Column, right: Any) -> Column:
+            Column = ConnectColumn if is_remote() else PySparkColumn
+
+            def xor_func(left: GenericColumn, right: Any) -> GenericColumn:
                 if not isinstance(right, Column):

Review Comment:
   ```suggestion
                   if not isinstance(right, (PySparkColumn, GenericColumn)):
   ```
   and remove `Column = ConnectColumn if is_remote() else PySparkColumn`



##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -219,22 +240,23 @@ def pretty_name(self) -> str:
     def mul(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         _sanitize_list_like(right)
         if isinstance(right, IndexOpsMixin) and isinstance(right.spark.data_type, StringType):
-            return column_op(SF.repeat)(right, left)
+            return column_op(cast(Callable[..., GenericColumn], SF.repeat))(right, left)
 
         if not is_valid_operand_for_numeric_arithmetic(right):
             raise TypeError("Multiplication can not be applied to given types.")
 
         right = transform_boolean_operand_to_numeric(right, spark_type=left.spark.data_type)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__mul__)(left, right)

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1159211030


##########
python/pyspark/pandas/spark/accessors.py:
##########
@@ -64,7 +68,7 @@ def column(self) -> Column:
         """
         return self._data._internal.spark_column_for(self._data._column_label)
 
-    def transform(self, func: Callable[[Column], Column]) -> IndexOpsLike:
+    def transform(self, func: Callable[[Column], Union[Column, ConnectColumn]]) -> IndexOpsLike:

Review Comment:
   ditto, let me address in subsequent PR since it brings another hundreds of mypy failures? I plan to create several subsequent tickets after completing the initial PR, so I'd want to focus on setting the test base first here in this PR and encourage community involvement for the remaining errands.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1159196999


##########
python/pyspark/pandas/internal.py:
##########
@@ -25,7 +25,13 @@
 import pandas as pd
 from pandas.api.types import CategoricalDtype  # noqa: F401
 from pyspark._globals import _NoValue, _NoValueType
-from pyspark.sql import functions as F, Column, DataFrame as SparkDataFrame, Window
+from pyspark.sql import (
+    functions as F,
+    Column,
+    DataFrame as LegacyDataFrame,
+    Window,
+    SparkSession as PySparkSession,
+)

Review Comment:
   Let me consolidate them into `PySparkxxx`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40525: [WIP][SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1480119115

   The remaining task at hand is to address numerous mypy annotation issues. If you have any good ideas for resolving linter, please feel free to let me know at any time :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155884514


##########
python/pyspark/pandas/data_type_ops/boolean_ops.py:
##########
@@ -39,6 +39,9 @@
 from pyspark.sql.column import Column
 from pyspark.sql.types import BooleanType, StringType
 

Review Comment:
   no newline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156657940


##########
python/pyspark/pandas/data_type_ops/base.py:
##########
@@ -470,14 +474,16 @@ def eq(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         else:
             from pyspark.pandas.base import column_op
 
-            return column_op(Column.__eq__)(left, right)
+            Column = ConnectColumn if is_remote() else PySparkColumn
+            return column_op(cast(Callable[..., GenericColumn], Column.__eq__))(left, right)

Review Comment:
   Sounds good. Then seems like we even don't need `cast` here. Let me apply the changes through whole files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156690881


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps
+from pyspark.pandas.tests.data_type_ops.test_string_ops import StringOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class StringOpsParityTests(
+    StringOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @property
+    def psdf(self):
+        return ps.from_pandas(self.pdf)

Review Comment:
   Because the existing `StringOpsTestsMixin` inherits `OpsTestBase` that has `psdf` from `ComparisonTestBase`, but `StringOpsParityTests` doesn't inherits `ComparisonTestBase` because it uses legacy Spark session.
   So, we should separately define `psdf` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501265948

   > We should probably think about adding grpcio as a hard dependency for whole PySpark project, but definitely not alone for pandas API on Spark
   
   I think this point is particularly crucial even though all of your opinions make sense to me.
   To summarize, I will made a fix following the suggestion in https://github.com/apache/spark/pull/40525#issuecomment-1501097772 so that the Pandas API on Spark can work without `grpcio` for now. Then, we can discuss adding the `grpcio` dependency to the entire PySpark project if needed in the future.
   Thank you for all the feedback!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501048286

   For example, you can't just import `from pyspark.sql.connect.column import Column as ConnectColumn` at `pyspark/pandas/_typing.py`. Even you don't use Spark connect, it will check the dependency. Can we avoid this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156690881


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps
+from pyspark.pandas.tests.data_type_ops.test_string_ops import StringOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class StringOpsParityTests(
+    StringOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @property
+    def psdf(self):
+        return ps.from_pandas(self.pdf)

Review Comment:
   Because the existing `StringOpsTestsMixin` inherits `OpsTestBase` that has `psdf` from `ComparisonTestBase`, but `StringOpsParityTests` doesn't inherits `ComparisonTestBase` because it uses legacy Spark session.
   So, we should separately define `psdf` here.
   Otherwise it complains `AttributeError: 'StringOpsParityTests' object has no attribute 'psdf'`.
   
   This is basically the same reason with https://github.com/apache/spark/pull/40525#discussion_r1156688698.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1154322879


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_binary_ops.py:
##########
@@ -0,0 +1,58 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.pandas.tests.data_type_ops.test_binary_ops import BinaryOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class BinaryOpsParityTests(
+    BinaryOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @unittest.skip("Fails in Spark Connect, should enable.")

Review Comment:
   Good points. I'm making a sort of list for the failure for each tests.
   Let me do it in followups to avoid adding any further complexity to this PR. 😂 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect
URL: https://github.com/apache/spark/pull/40525


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1154257731


##########
dev/sparktestsupport/modules.py:
##########
@@ -693,6 +693,56 @@ def __hash__(self):
         "pyspark.pandas.tests.test_typedef",
         "pyspark.pandas.tests.test_utils",
         "pyspark.pandas.tests.test_window",
+        # unittests - Spark Connect parity tests

Review Comment:
   I think we should move the tests into `pyspark-connect`, and then make `pyspark-connect` also depends on `pyspark_core`



##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_binary_ops.py:
##########
@@ -0,0 +1,58 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.pandas.tests.data_type_ops.test_binary_ops import BinaryOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class BinaryOpsParityTests(
+    BinaryOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @unittest.skip("Fails in Spark Connect, should enable.")

Review Comment:
   I think we'd better document the failure reason:
   
   - some are need be enabled
   - some cann't, may due to private method `_jvm` in the test
   - etc
   
   but this can be done in followups



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155898172


##########
python/pyspark/pandas/namespace.py:
##########
@@ -94,6 +94,8 @@
 from pyspark.pandas.indexes import Index, DatetimeIndex, TimedeltaIndex
 from pyspark.pandas.indexes.multi import MultiIndex
 

Review Comment:
   no newline



##########
python/pyspark/pandas/numpy_compat.py:
##########
@@ -23,6 +23,9 @@
 
 from pyspark.pandas.base import IndexOpsMixin
 

Review Comment:
   no newline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155438763


##########
python/pyspark/pandas/utils.py:
##########
@@ -792,7 +801,9 @@ def validate_mode(mode: str) -> str:
 
 
 @overload
-def verify_temp_column_name(df: SparkDataFrame, column_name_or_label: str) -> str:

Review Comment:
   Sounds reasonable to me. Let me update the naming. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501127314

   Actually even today Pandas on Spark users have to install specific dependencies that are not part of the default PySpark installation. 
   
   See https://github.com/apache/spark/blob/master/python/setup.py
   
   Now, the quickest fix is to simply add grpcio to the pandas on spark dependency list. 
   
   Given that we're forcing Pandas users into specific Java versions etc adding grpcio does not make it worse in my opinion. 
   
   The user surface remains the same. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501140136

   IIUC only grpcio is needed. The rest is localized to the spark connect client. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1161394032


##########
dev/sparktestsupport/modules.py:
##########
@@ -778,6 +778,68 @@ def __hash__(self):
         # ml unittests
         "pyspark.ml.tests.connect.test_connect_function",
         "pyspark.ml.tests.connect.test_parity_torch_distributor",
+        # pandas-on-Spark unittests

Review Comment:
   Let me address within follow-up PR. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1159212361


##########
python/pyspark/pandas/data_type_ops/boolean_ops.py:
##########
@@ -238,8 +241,8 @@ def __and__(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
             return right.__and__(left)
         else:
 
-            def and_func(left: Column, right: Any) -> Column:
-                if not isinstance(right, Column):
+            def and_func(left: GenericColumn, right: Any) -> GenericColumn:
+                if not isinstance(right, (Column, ConnectColumn)):

Review Comment:
   Nice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155905112


##########
python/pyspark/pandas/tests/connect/data_type_ops/testing_utils.py:
##########
@@ -0,0 +1,226 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import datetime
+import decimal
+from distutils.version import LooseVersion
+
+import numpy as np
+import pandas as pd
+
+import pyspark.pandas as ps
+from pyspark.pandas.typedef import extension_dtypes
+
+from pyspark.pandas.typedef.typehints import (
+    extension_dtypes_available,
+    extension_float_dtypes_available,
+    extension_object_dtypes_available,
+)
+
+if extension_dtypes_available:
+    from pandas import Int8Dtype, Int16Dtype, Int32Dtype, Int64Dtype
+
+if extension_float_dtypes_available:
+    from pandas import Float32Dtype, Float64Dtype
+
+if extension_object_dtypes_available:
+    from pandas import BooleanDtype, StringDtype
+
+
+class OpsTestBase:

Review Comment:
   why do we need to copy this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155893321


##########
python/pyspark/pandas/frame.py:
##########
@@ -5490,6 +5516,7 @@ def _assign(self, kwargs: Any) -> "DataFrame":
             if callable(v):
                 kwargs[k] = v(self)
 
+        Column = ConnectColumn if is_remote() else PySparkColumn

Review Comment:
   remove and fix `isinstance(v, Column)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155893722


##########
python/pyspark/pandas/frame.py:
##########
@@ -7443,12 +7472,12 @@ def _prepare_sort_by_scols(self, by: Union[Name, List[Name]]) -> List[Column]:
                     "The column %s is not unique. For a multi-index, the label must be a tuple "
                     "with elements corresponding to each level." % name_like_string(colname)
                 )
-            new_by.append(ser.spark.column)
+            new_by.append(cast(GenericColumn, ser.spark.column))
         return new_by
 
     def _sort(
         self,
-        by: List[Column],
+        by: Sequence[GenericColumn],

Review Comment:
   Why `Sequence`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155892126


##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -461,11 +494,13 @@ def rpow(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         if not isinstance(right, numbers.Number):
             raise TypeError("Exponentiation can not be applied to given types.")
 
-        def rpow_func(left: Column, right: Any) -> Column:
+        Column = ConnectColumn if is_remote() else PySparkColumn
+
+        def rpow_func(left: GenericColumn, right: Any) -> GenericColumn:
             return (
-                F.when(left.isNull(), np.nan)
+                F.when(left.isNull(), np.nan)  # type: ignore
                 .when(F.lit(right == 1), right)
-                .otherwise(Column.__rpow__(left, right))
+                .otherwise(Column.__rpow__(left, right))  # type: ignore

Review Comment:
   ditto `left.__rpow__`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501047078

   Thank you for the feedback, @bjornjorgensen !
   
   IMHO, it seems more reasonable to add `grpcio` as a dependency for the Pandas API on Spark instead of reverting all this change back (Oh, seems like you already open https://github.com/apache/spark/pull/40716 for this? 😄)
   
   The purpose of Spark Connect is to allow users to use existing PySpark project without any code changes through a remote client. Therefore, if a user is using the `pyspark.pandas` module in their existing code, it should work the same way through the remote client as well.
   
   I think we should support all the functionality of PySpark as much as possible including pandas API on Spark, since nobody can not sure whether existing PySpark users will use the Pandas API on Spark through Spark Connect or not at this point, not only the existing pandas users.
   
   Alternatively, we might be able to create completely separate package path for the Pandas API on Spark for Spark Connect. This would allow the existing Pandas API on Spark to be used without installing `grpcio`, but it would be much more overhead than simply changing the policy to add one package as an additional installation.
   
   WDYT? also cc @HyukjinKwon @grundprinzip @ueshin @zhengruifeng FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501126555

   That's only partially true, they still have to install PySpark. If they pip install PySpark[connect] the dependencies are properly resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156685617


##########
python/pyspark/pandas/utils.py:
##########
@@ -802,7 +820,8 @@ def verify_temp_column_name(df: "DataFrame", column_name_or_label: Name) -> Labe
 
 
 def verify_temp_column_name(
-    df: Union["DataFrame", SparkDataFrame], column_name_or_label: Union[str, Name]
+    df: Union["DataFrame", LegacyDataFrame, ConnectDataFrame],

Review Comment:
   Let me consolidate it into `LegacyDataFrame`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155900721


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_null_ops.py:
##########
@@ -0,0 +1,67 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps

Review Comment:
   ```suggestion
   from pyspark import pandas as pd
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156771777


##########
python/pyspark/pandas/frame.py:
##########
@@ -7462,19 +7490,24 @@ def _sort(
         if na_position not in ("first", "last"):
             raise ValueError("invalid na_position: '{}'".format(na_position))
 
-        # Mapper: Get a spark column function for (ascending, na_position) combination
+        Column = ConnectColumn if is_remote() else LegacyColumn
+        # Mapper: Get a spark colum
+        # n function for (ascending, na_position) combination
         mapper = {
-            (True, "first"): Column.asc_nulls_first,
-            (True, "last"): Column.asc_nulls_last,
-            (False, "first"): Column.desc_nulls_first,
-            (False, "last"): Column.desc_nulls_last,

Review Comment:
   Here too. I believe you can just replace `Column.desc_nulls_last` to `lambda x: x.desc_nulls_last`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156772788


##########
dev/sparktestsupport/modules.py:
##########
@@ -732,7 +732,7 @@ def __hash__(self):
 
 pyspark_connect = Module(
     name="pyspark-connect",
-    dependencies=[pyspark_sql, pyspark_ml, connect],
+    dependencies=[pyspark_core, pyspark_sql, pyspark_ml, connect],

Review Comment:
   I don't think you need core here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155437211


##########
dev/sparktestsupport/modules.py:
##########
@@ -777,6 +777,68 @@ def __hash__(self):
         "pyspark.ml.connect.functions",
         # ml unittests
         "pyspark.ml.tests.connect.test_connect_function",
+        # pandas-on-Spark unittests

Review Comment:
   Yes, that will be added in the subsequent task in SPARK-43003 right after this PR.
   
   For now, I want to focus on validating the basic functionalities through unit tests and linters in the initial PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1158882901


##########
python/pyspark/pandas/_typing.py:
##########
@@ -21,6 +21,12 @@
 import numpy as np
 from pandas.api.extensions import ExtensionDtype
 
+from pyspark.sql.column import Column as LegacyColumn

Review Comment:
   Actually I prefer `PySparkXxx` to `LegacyXxx`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1500814281

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501262560

   So my proposal is to keep it consistent for now (that dose not require `grpcio` when we don't use Spark Connect). And separately discuss if we should switch them to a hard required dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501118977

   We need to be careful in changing the imports too early or too late. The way that the is remote functionality works checks for an environment variable. If it's not set during the import setting it later yields undesired consequences. 
   
   Is it so had to add the dependency for grpc when using the pandas API? 
   
   What are we achieving with this? GRPC is a stable protocol and not a random library. It's available throughout all platforms. 
   
   What's the benefit of trying this pure approach?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1154324225


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_binary_ops.py:
##########
@@ -0,0 +1,58 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.pandas.tests.data_type_ops.test_binary_ops import BinaryOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class BinaryOpsParityTests(
+    BinaryOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @unittest.skip("Fails in Spark Connect, should enable.")

Review Comment:
   Just created ticket just in case to not forget: SPARK-42996



##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_binary_ops.py:
##########
@@ -0,0 +1,58 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.pandas.tests.data_type_ops.test_binary_ops import BinaryOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class BinaryOpsParityTests(
+    BinaryOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @unittest.skip("Fails in Spark Connect, should enable.")

Review Comment:
   Created ticket just in case to not forget: SPARK-42996



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155903701


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps
+from pyspark.pandas.tests.data_type_ops.test_string_ops import StringOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class StringOpsParityTests(
+    StringOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @property
+    def psdf(self):
+        return ps.from_pandas(self.pdf)

Review Comment:
   why do we need to overwrite this here though?



##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_timedelta_ops.py:
##########
@@ -0,0 +1,67 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps

Review Comment:
   not used



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155888734


##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -78,6 +82,7 @@ def add(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
             raise TypeError("Addition can not be applied to given types.")
 
         right = transform_boolean_operand_to_numeric(right, spark_type=left.spark.data_type)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__add__)(left, right)

Review Comment:
   ditto



##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -86,14 +91,15 @@ def sub(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
             raise TypeError("Subtraction can not be applied to given types.")
 
         right = transform_boolean_operand_to_numeric(right, spark_type=left.spark.data_type)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__sub__)(left, right)

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156657940


##########
python/pyspark/pandas/data_type_ops/base.py:
##########
@@ -470,14 +474,16 @@ def eq(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         else:
             from pyspark.pandas.base import column_op
 
-            return column_op(Column.__eq__)(left, right)
+            Column = ConnectColumn if is_remote() else PySparkColumn
+            return column_op(cast(Callable[..., GenericColumn], Column.__eq__))(left, right)

Review Comment:
   Sounds good. Then we even don't need `cast`. Let me apply the changes through whole changes.



##########
python/pyspark/pandas/data_type_ops/base.py:
##########
@@ -470,14 +474,16 @@ def eq(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         else:
             from pyspark.pandas.base import column_op
 
-            return column_op(Column.__eq__)(left, right)
+            Column = ConnectColumn if is_remote() else PySparkColumn
+            return column_op(cast(Callable[..., GenericColumn], Column.__eq__))(left, right)

Review Comment:
   Sounds good. Then seems like we even don't need `cast` here. Let me apply the changes through whole changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1495403113

   LGTM if tests pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1159212183


##########
python/pyspark/pandas/internal.py:
##########
@@ -616,8 +628,7 @@ def __init__(
         >>> internal.column_label_names
         [('column_labels_a',), ('column_labels_b',)]
         """
-
-        assert isinstance(spark_frame, SparkDataFrame)
+        assert isinstance(spark_frame, (LegacyDataFrame, ConnectDataFrame))

Review Comment:
   Nice workaround!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1489771881

   CI passed. cc @HyukjinKwon @ueshin @xinrong-meng @zhengruifeng PTAL when you find some time.
   
   I summarized key changes into PR description for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1154325557


##########
dev/sparktestsupport/modules.py:
##########
@@ -693,6 +693,56 @@ def __hash__(self):
         "pyspark.pandas.tests.test_typedef",
         "pyspark.pandas.tests.test_utils",
         "pyspark.pandas.tests.test_window",
+        # unittests - Spark Connect parity tests

Review Comment:
   Updated. Thanks! :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.

bjornjorgensen commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1500961292

   @itholic Thank you, great work :) 
   
   After this PR 
   `from pyspark import pandas as ps `
   
   ModuleNotFoundError                       Traceback (most recent call last)
   File /opt/spark/python/pyspark/sql/connect/utils.py:45, in require_minimum_grpc_version()
        44 try:
   ---> 45     import grpc
        46 except ImportError as error:
   
   ModuleNotFoundError: No module named 'grpc'
   
   The above exception was the direct cause of the following exception:
   
   ImportError                               Traceback (most recent call last)
   Cell In[1], line 11
         9 import pyarrow
        10 from pyspark import SparkConf, SparkContext
   ---> 11 from pyspark import pandas as ps
        12 from pyspark.sql import SparkSession
        13 from pyspark.sql.functions import col, concat, concat_ws, expr, lit, trim
   
   File /opt/spark/python/pyspark/pandas/__init__.py:59
        50     warnings.warn(
        51         "'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to "
        52         "set this environment variable to '1' in both driver and executor sides if you use "
      (...)
        55         "already launched."
        56     )
        57     os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
   ---> 59 from pyspark.pandas.frame import DataFrame
        60 from pyspark.pandas.indexes.base import Index
        61 from pyspark.pandas.indexes.category import CategoricalIndex
   
   File /opt/spark/python/pyspark/pandas/frame.py:88
        85 from pyspark.sql.window import Window
        87 from pyspark import pandas as ps  # For running doctests and reference resolution in PyCharm.
   ---> 88 from pyspark.pandas._typing import (
        89     Axis,
        90     DataFrameOrSeries,
        91     Dtype,
        92     Label,
        93     Name,
        94     Scalar,
        95     T,
        96     GenericColumn,
        97 )
        98 from pyspark.pandas.accessors import PandasOnSparkFrameMethods
        99 from pyspark.pandas.config import option_context, get_option
   
   File /opt/spark/python/pyspark/pandas/_typing.py:25
        22 from pandas.api.extensions import ExtensionDtype
        24 from pyspark.sql.column import Column as PySparkColumn
   ---> 25 from pyspark.sql.connect.column import Column as ConnectColumn
        26 from pyspark.sql.dataframe import DataFrame as PySparkDataFrame
        27 from pyspark.sql.connect.dataframe import DataFrame as ConnectDataFrame
   
   File /opt/spark/python/pyspark/sql/connect/column.py:19
         1 #
         2 # Licensed to the Apache Software Foundation (ASF) under one or more
         3 # contributor license agreements.  See the NOTICE file distributed with
      (...)
        15 # limitations under the License.
        16 #
        17 from pyspark.sql.connect.utils import check_dependencies
   ---> 19 check_dependencies(__name__)
        21 import datetime
        22 import decimal
   
   File /opt/spark/python/pyspark/sql/connect/utils.py:35, in check_dependencies(mod_name)
        33 require_minimum_pandas_version()
        34 require_minimum_pyarrow_version()
   ---> 35 require_minimum_grpc_version()
   
   File /opt/spark/python/pyspark/sql/connect/utils.py:47, in require_minimum_grpc_version()
        45     import grpc
        46 except ImportError as error:
   ---> 47     raise ImportError(
        48         "grpc >= %s must be installed; however, " "it was not found." % minimum_grpc_version
        49     ) from error
        50 if LooseVersion(grpc.__version__) < LooseVersion(minimum_grpc_version):
        51     raise ImportError(
        52         "gRPC >= %s must be installed; however, "
        53         "your version was %s." % (minimum_grpc_version, grpc.__version__)
        54     )
   
   ImportError: grpc >= 1.48.1 must be installed; however, it was not found.        
   
   `pip install grpc`
   
   Collecting grpc
     Downloading grpc-1.0.0.tar.gz (5.2 kB)
     Preparing metadata (setup.py) ... error
     error: subprocess-exited-with-error
     
     × python setup.py egg_info did not run successfully.
     │ exit code: 1
     ╰─> [6 lines of output]
         Traceback (most recent call last):
           File "<string>", line 2, in <module>
           File "<pip-setuptools-caller>", line 34, in <module>
           File "/tmp/pip-install-vp4d8s4c/grpc_c0f1992ad8f7456b8ac09ecbaeb81750/setup.py", line 33, in <module>
             raise RuntimeError(HINT)
         RuntimeError: Please install the official package with: pip install grpcio
         [end of output]
     
     note: This error originates from a subprocess, and is likely not a problem with pip.
   error: metadata-generation-failed
   
   × Encountered error while generating package metadata.
   ╰─> See above for output.
   
   note: This is an issue with the package mentioned above, not pip.
   hint: See above for details.
   Note: you may need to restart the kernel to use updated packages.
   
   After 
   `pip install grpcio`  
   then it works. 
   
   
   I don't think that ever pandas users that try pandas API on spark will use spark connect. So can we change this back? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1161386499


##########
dev/sparktestsupport/modules.py:
##########
@@ -778,6 +778,68 @@ def __hash__(self):
         # ml unittests
         "pyspark.ml.tests.connect.test_connect_function",
         "pyspark.ml.tests.connect.test_parity_torch_distributor",
+        # pandas-on-Spark unittests

Review Comment:
   `pyspark-connect` takes 2~3 hours after adding the PS related unittests, I guess we can add new modules `pyspark-connect-pandas` and `pyspark-connect-pandas-slow` @itholic 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155905487


##########
python/pyspark/pandas/tests/connect/indexes/test_parity_base.py:
##########
@@ -0,0 +1,74 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+import pandas as pd
+from pyspark import pandas as ps
+from pyspark.pandas.tests.indexes.test_base import IndexesTestsMixin
+from pyspark.testing.connectutils import ReusedConnectTestCase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils, TestUtils
+
+
+class IndexesParityTests(
+    IndexesTestsMixin, PandasOnSparkTestUtils, TestUtils, ReusedConnectTestCase
+):
+    @property
+    def pdf(self):

Review Comment:
   why can't we reuse from the parent?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155889925


##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -119,29 +127,36 @@ def radd(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         if not isinstance(right, numbers.Number):
             raise TypeError("Addition can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__radd__)(left, right)
 
     def rsub(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         _sanitize_list_like(right)
         if not isinstance(right, numbers.Number):
             raise TypeError("Subtraction can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__rsub__)(left, right)
 
     def rmul(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         _sanitize_list_like(right)
         if not isinstance(right, numbers.Number):
             raise TypeError("Multiplication can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__rmul__)(left, right)
 
     def rpow(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         _sanitize_list_like(right)
         if not isinstance(right, numbers.Number):
             raise TypeError("Exponentiation can not be applied to given types.")
 
-        def rpow_func(left: Column, right: Any) -> Column:
-            return F.when(F.lit(right == 1), right).otherwise(Column.__rpow__(left, right))
+        Column = ConnectColumn if is_remote() else PySparkColumn
+
+        def rpow_func(left: GenericColumn, right: Any) -> GenericColumn:
+            return F.when(F.lit(right == 1), right).otherwise(
+                Column.__rpow__(left, right)  # type: ignore

Review Comment:
   ditto. `left.__rpow__`



##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -167,18 +182,22 @@ def abs(self, operand: IndexOpsLike) -> IndexOpsLike:
 
     def lt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         _sanitize_list_like(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__lt__)(left, right)

Review Comment:
   ditto for this and next three



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155884044


##########
python/pyspark/pandas/_typing.py:
##########
@@ -21,6 +21,12 @@
 import numpy as np
 from pandas.api.extensions import ExtensionDtype
 
+from pyspark.sql.column import Column as LegacyColumn

Review Comment:
   Can you use the same naming? You call some `LegacyColumn` others `PySparkColumn`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1495297078

   Applied the comments. Thanks @HyukjinKwon and @zhengruifeng  for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155892588


##########
python/pyspark/pandas/frame.py:
##########
@@ -140,6 +149,11 @@
 )
 from pyspark.pandas.plot import PandasOnSparkPlotAccessor
 

Review Comment:
   no newline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156684161


##########
python/pyspark/pandas/tests/connect/indexes/test_parity_base.py:
##########
@@ -0,0 +1,74 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+import pandas as pd
+from pyspark import pandas as ps
+from pyspark.pandas.tests.indexes.test_base import IndexesTestsMixin
+from pyspark.testing.connectutils import ReusedConnectTestCase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils, TestUtils
+
+
+class IndexesParityTests(
+    IndexesTestsMixin, PandasOnSparkTestUtils, TestUtils, ReusedConnectTestCase
+):
+    @property
+    def pdf(self):

Review Comment:
   My mistake, we can reuse it. Let me update such things after double checking other tests as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1154324225


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_binary_ops.py:
##########
@@ -0,0 +1,58 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.pandas.tests.data_type_ops.test_binary_ops import BinaryOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class BinaryOpsParityTests(
+    BinaryOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @unittest.skip("Fails in Spark Connect, should enable.")

Review Comment:
   Just created ticket just in case don't forget: SPARK-42996



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155439646


##########
python/pyspark/pandas/utils.py:
##########
@@ -927,14 +939,20 @@ def spark_column_equals(left: Column, right: Column) -> bool:
     >>> spark_column_equals(sdf1["x"] + 1, sdf2["x"] + 1)
     False
     """
-    return left._jc.equals(right._jc)
+    if isinstance(left, Column):
+        return left._jc.equals(right._jc)
+    elif isinstance(left, SparkConnectColumn):
+        return repr(left) == repr(right)
 

Review Comment:
   Yes, let me add a exception



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155904039


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_udt_ops.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps

Review Comment:
   not used



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155413090


##########
python/pyspark/pandas/utils.py:
##########
@@ -466,7 +471,11 @@ def is_testing() -> bool:
 
 
 def default_session() -> SparkSession:
-    spark = SparkSession.getActiveSession()
+    if not is_remote():
+        spark = SparkSession.getActiveSession()
+    else:
+        # TODO: Implement `getActiveSession` for SparkConnect.
+        spark = None

Review Comment:
   there is a workaround for now
   
   ```suggestion
           from pyspark.sql.connect.session import _active_spark_session
   
           spark = _active_spark_session  # type: ignore[assignment]
   ```



##########
python/pyspark/pandas/utils.py:
##########
@@ -792,7 +801,9 @@ def validate_mode(mode: str) -> str:
 
 
 @overload
-def verify_temp_column_name(df: SparkDataFrame, column_name_or_label: str) -> str:

Review Comment:
   In the above type hint definition: `SparkDataFrame = Union[PySparkDataFrame, SparkConnectDataFrame]
   `
   
   can we unify the naming in some way? it is a bit confusing



##########
dev/sparktestsupport/modules.py:
##########
@@ -777,6 +777,68 @@ def __hash__(self):
         "pyspark.ml.connect.functions",
         # ml unittests
         "pyspark.ml.tests.connect.test_connect_function",
+        # pandas-on-Spark unittests

Review Comment:
   just to confirm, the doctests are not reused for now?



##########
python/pyspark/pandas/utils.py:
##########
@@ -927,14 +939,20 @@ def spark_column_equals(left: Column, right: Column) -> bool:
     >>> spark_column_equals(sdf1["x"] + 1, sdf2["x"] + 1)
     False
     """
-    return left._jc.equals(right._jc)
+    if isinstance(left, Column):
+        return left._jc.equals(right._jc)
+    elif isinstance(left, SparkConnectColumn):
+        return repr(left) == repr(right)
 

Review Comment:
   should we raise an exception here if it's not a Column or SparkConnectColumn



##########
python/pyspark/pandas/utils.py:
##########
@@ -792,7 +801,9 @@ def validate_mode(mode: str) -> str:
 
 
 @overload
-def verify_temp_column_name(df: SparkDataFrame, column_name_or_label: str) -> str:

Review Comment:
   I personally prefer something like:
   
   ```
   GenericDataFrame = Union[LegacyDataFrame, ConnectDataFrame]
   GenericColumn = Union[LegacyColumn, ConnectColumn]
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156659799


##########
python/pyspark/pandas/_typing.py:
##########
@@ -21,6 +21,12 @@
 import numpy as np
 from pandas.api.extensions import ExtensionDtype
 
+from pyspark.sql.column import Column as LegacyColumn

Review Comment:
   Sure. Let me consolidate them with `LegacyColumn`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501047901

   The problem is that you don't use Spark Connect but it complains that it needs `grpcio`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501097772

   Got it. Then I think we need to modify the current code to import the `ConnectColumn` only when `is_remote()` is `True` across all code paths. For example:
   
   **Before**
   ```python
   from pyspark.sql.column import Column as PySparkColumn
   from pyspark.sql.connect.column import Column as ConnectColumn
   from pyspark.sql.utils import is_remote
   
   def example():
       Column = ConnectColumn if is_remote() else PySparkColumn
       return Column.__eq__
   ```
   
   **After**
   ```python
   from pyspark.sql.column import Column as PySparkColumn
   from pyspark.sql.utils import is_remote
   
   def example():
       if is_remote():
           from pyspark.sql.connect.column import Column as ConnectColumn
           Column = ConnectColumn
       else:
           Column = PySparkColumn
       return Column.__eq__
   ```
   
   Also, we should remove `GenericColumn` and `GenericDataFrame`.
   
   Can you happen to think of a better solution? As of now, I haven't come up with a better way other than this.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.

bjornjorgensen commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501121526

   One of the key thing with pandas API on spark is that users only have to change `pd` to `ps` in the import. But now they also have to install GRPC. 
   
   Does this PR introduce any user-facing change?
   Yes, every pandas API on spark user need to install GRPC. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.

bjornjorgensen commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501132941

   Those modules that pandas Api on spark needs are pandas, pyarrow, numpy. Witch every pandas users already have installed. 
   And now they have to add connect, grpcio, grpcio-status and googleapis-common-protos.
   
   Why does pandas API on spark need such high coupling on the connect module?   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155437440


##########
python/pyspark/pandas/utils.py:
##########
@@ -466,7 +471,11 @@ def is_testing() -> bool:
 
 
 def default_session() -> SparkSession:
-    spark = SparkSession.getActiveSession()
+    if not is_remote():
+        spark = SparkSession.getActiveSession()
+    else:
+        # TODO: Implement `getActiveSession` for SparkConnect.
+        spark = None

Review Comment:
   Cool. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156690881


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps
+from pyspark.pandas.tests.data_type_ops.test_string_ops import StringOpsTestsMixin
+from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class StringOpsParityTests(
+    StringOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
+):
+    @property
+    def psdf(self):
+        return ps.from_pandas(self.pdf)

Review Comment:
   Because the existing `StringOpsTestsMixin` inherits `OpsTestBase` that has `psdf` from `ComparisonTestBase`, but `StringOpsParityTests` doesn't inherits `ComparisonTestBase` because it uses legacy Spark session.
   So, we should separately define `psdf` here.
   Otherwise it complains `AttributeError: 'StringOpsParityTests' object has no attribute 'psdf'`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501261686

   > Is it so had to add the dependency for grpc when using the pandas API?
   
   It's not super hard. But it's a bit odd to add this alone to pandas API on Spark. We should probably think about adding grpcio as a hard dependency for whole PySpark project, but definitely not alone for pandas API on Spark.
   
   > What are we achieving with this? GRPC is a stable protocol and not a random library. It's available throughout all platforms.
   > What's the benefit of trying this pure approach?
   
   So for the current status, we're trying to add the dependencies that the module need so users won't need to install the unnecessary dependency. In addition, adding the dependency breaks existing applications when they migrate from 3.4 to 3.5. It matters when PySpark is installed without `pip` (which is actually the official release channel of Apache Spark).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.

bjornjorgensen commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1501106942

   "The problem is that you don't use Spark Connect but it complains that it needs grpcio"
   Yes, this is correct :) 
   https://github.com/apache/spark/pull/40716 was for a typo so users of connect will know how to install the package.
   
   The is_remote() can probably work. 
   Perhaps we can look at how pyspark did solve this?   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155889467


##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -119,29 +127,36 @@ def radd(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         if not isinstance(right, numbers.Number):
             raise TypeError("Addition can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__radd__)(left, right)

Review Comment:
   ditto



##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -119,29 +127,36 @@ def radd(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         if not isinstance(right, numbers.Number):
             raise TypeError("Addition can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__radd__)(left, right)
 
     def rsub(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         _sanitize_list_like(right)
         if not isinstance(right, numbers.Number):
             raise TypeError("Subtraction can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__rsub__)(left, right)

Review Comment:
   ditto



##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -119,29 +127,36 @@ def radd(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         if not isinstance(right, numbers.Number):
             raise TypeError("Addition can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__radd__)(left, right)
 
     def rsub(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         _sanitize_list_like(right)
         if not isinstance(right, numbers.Number):
             raise TypeError("Subtraction can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__rsub__)(left, right)
 
     def rmul(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         _sanitize_list_like(right)
         if not isinstance(right, numbers.Number):
             raise TypeError("Multiplication can not be applied to given types.")
         right = transform_boolean_operand_to_numeric(right)
+        Column: Type[GenericColumn] = ConnectColumn if is_remote() else PySparkColumn
         return column_op(Column.__rmul__)(left, right)
 

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155883257


##########
python/pyspark/pandas/data_type_ops/base.py:
##########
@@ -470,14 +474,16 @@ def eq(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         else:
             from pyspark.pandas.base import column_op
 
-            return column_op(Column.__eq__)(left, right)
+            Column = ConnectColumn if is_remote() else PySparkColumn
+            return column_op(cast(Callable[..., GenericColumn], Column.__eq__))(left, right)
 
     def ne(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         from pyspark.pandas.base import column_op
 
         _sanitize_list_like(right)
 
-        return column_op(Column.__ne__)(left, right)
+        Column = ConnectColumn if is_remote() else PySparkColumn
+        return column_op(cast(Callable[..., GenericColumn], Column.__ne__))(left, right)

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155914955


##########
python/pyspark/pandas/utils.py:
##########
@@ -802,7 +820,8 @@ def verify_temp_column_name(df: "DataFrame", column_name_or_label: Name) -> Labe
 
 
 def verify_temp_column_name(
-    df: Union["DataFrame", SparkDataFrame], column_name_or_label: Union[str, Name]
+    df: Union["DataFrame", LegacyDataFrame, ConnectDataFrame],

Review Comment:
   Use the same name. LegacyDataFrame vs PySparkDataFrame



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1156688698


##########
python/pyspark/pandas/tests/connect/data_type_ops/testing_utils.py:
##########
@@ -0,0 +1,226 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import datetime
+import decimal
+from distutils.version import LooseVersion
+
+import numpy as np
+import pandas as pd
+
+import pyspark.pandas as ps
+from pyspark.pandas.typedef import extension_dtypes
+
+from pyspark.pandas.typedef.typehints import (
+    extension_dtypes_available,
+    extension_float_dtypes_available,
+    extension_object_dtypes_available,
+)
+
+if extension_dtypes_available:
+    from pandas import Int8Dtype, Int16Dtype, Int32Dtype, Int64Dtype
+
+if extension_float_dtypes_available:
+    from pandas import Float32Dtype, Float64Dtype
+
+if extension_object_dtypes_available:
+    from pandas import BooleanDtype, StringDtype
+
+
+class OpsTestBase:

Review Comment:
   Because the existing `OpsTestBase` under `python/pyspark/tests/data_type_ops/testing_utils.py` inherits `ComparisonTestBase` which uses legacy Spark session instead of Spark Connect session.
   
   So we need separate `OpsTestBase` for connect test that will be combination with `ReusedConnectTestCase` for testing.
   
   e.g.
   ```
   # test_parity_binary_ops.py
   class BinaryOpsParityTests(
       BinaryOpsTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase
   ):
       ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1158858408


##########
python/pyspark/pandas/data_type_ops/boolean_ops.py:
##########
@@ -238,8 +241,8 @@ def __and__(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
             return right.__and__(left)
         else:
 
-            def and_func(left: Column, right: Any) -> Column:
-                if not isinstance(right, Column):
+            def and_func(left: GenericColumn, right: Any) -> GenericColumn:
+                if not isinstance(right, (Column, ConnectColumn)):

Review Comment:
   I guess `isinstance(right, GenericColumn.__args__)` works?



##########
python/pyspark/pandas/internal.py:
##########
@@ -962,9 +1025,9 @@ def field_for(self, label: Label) -> InternalField:
             raise KeyError(name_like_string(label))
 
     @property
-    def spark_frame(self) -> SparkDataFrame:
+    def spark_frame(self) -> LegacyDataFrame:

Review Comment:
   `GenericDataFrame`?



##########
python/pyspark/pandas/internal.py:
##########
@@ -1179,7 +1242,7 @@ def resolved_copy(self) -> "InternalFrame":
 
     def with_new_sdf(
         self,
-        spark_frame: SparkDataFrame,
+        spark_frame: LegacyDataFrame,

Review Comment:
   ditto?



##########
python/pyspark/pandas/internal.py:
##########
@@ -25,7 +25,13 @@
 import pandas as pd
 from pandas.api.types import CategoricalDtype  # noqa: F401
 from pyspark._globals import _NoValue, _NoValueType
-from pyspark.sql import functions as F, Column, DataFrame as SparkDataFrame, Window
+from pyspark.sql import (
+    functions as F,
+    Column,
+    DataFrame as LegacyDataFrame,
+    Window,
+    SparkSession as PySparkSession,
+)

Review Comment:
   Should use only either `LegacyXxx` or `PySparkXxx`?



##########
python/pyspark/pandas/internal.py:
##########
@@ -616,8 +628,7 @@ def __init__(
         >>> internal.column_label_names
         [('column_labels_a',), ('column_labels_b',)]
         """
-
-        assert isinstance(spark_frame, SparkDataFrame)
+        assert isinstance(spark_frame, (LegacyDataFrame, ConnectDataFrame))

Review Comment:
   `isinstance(spark_frame, GenericDataFrame.__args__)` should work?



##########
python/pyspark/pandas/internal.py:
##########
@@ -1037,7 +1100,7 @@ def data_fields(self) -> List[InternalField]:
         return self._data_fields
 
     @lazy_property
-    def to_internal_spark_frame(self) -> SparkDataFrame:
+    def to_internal_spark_frame(self) -> LegacyDataFrame:

Review Comment:
   ditto?



##########
python/pyspark/pandas/spark/accessors.py:
##########
@@ -64,7 +68,7 @@ def column(self) -> Column:
         """
         return self._data._internal.spark_column_for(self._data._column_label)
 
-    def transform(self, func: Callable[[Column], Column]) -> IndexOpsLike:
+    def transform(self, func: Callable[[Column], Union[Column, ConnectColumn]]) -> IndexOpsLike:

Review Comment:
   `Callable[[GenericColumn], GenericColumn]`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155902736


##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_null_ops.py:
##########
@@ -0,0 +1,67 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps

Review Comment:
   not used?



##########
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_num_ops.py:
##########
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark import pandas as ps

Review Comment:
   not used



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1154321016


##########
dev/sparktestsupport/modules.py:
##########
@@ -693,6 +693,56 @@ def __hash__(self):
         "pyspark.pandas.tests.test_typedef",
         "pyspark.pandas.tests.test_utils",
         "pyspark.pandas.tests.test_window",
+        # unittests - Spark Connect parity tests

Review Comment:
   That makes pretty much sense to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155889262


##########
python/pyspark/pandas/data_type_ops/num_ops.py:
##########
@@ -104,11 +110,13 @@ def pow(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         if not is_valid_operand_for_numeric_arithmetic(right):
             raise TypeError("Exponentiation can not be applied to given types.")
 
-        def pow_func(left: Column, right: Any) -> Column:
+        Column = ConnectColumn if is_remote() else PySparkColumn
+
+        def pow_func(left: GenericColumn, right: Any) -> GenericColumn:
             return (
-                F.when(left == 1, left)
+                F.when(left == 1, left)  # type: ignore
                 .when(F.lit(right) == 0, 1)
-                .otherwise(Column.__pow__(left, right))
+                .otherwise(Column.__pow__(left, right))  # type: ignore

Review Comment:
   ditto. `left.__pow__`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1158832902


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1628,11 +1629,29 @@ def checkpoint(self, *args: Any, **kwargs: Any) -> None:
     def localCheckpoint(self, *args: Any, **kwargs: Any) -> None:
         raise NotImplementedError("localCheckpoint() is not implemented.")
 
-    def to_pandas_on_spark(self, *args: Any, **kwargs: Any) -> None:
-        raise NotImplementedError("to_pandas_on_spark() is not implemented.")
-
-    def pandas_api(self, *args: Any, **kwargs: Any) -> None:
-        raise NotImplementedError("pandas_api() is not implemented.")
+    def to_pandas_on_spark(
+        self, index_col: Optional[Union[str, List[str]]] = None
+    ) -> "PandasOnSparkDataFrame":
+        warnings.warn(
+            "DataFrame.to_pandas_on_spark is deprecated. Use DataFrame.pandas_api instead.",
+            FutureWarning,
+        )
+        return self.pandas_api(index_col)
+
+    def pandas_api(

Review Comment:
   nit: Shall we add a docstring for `pandas_api`?
   
   How about
   ```
   pandas_api.__doc__ = PySparkDataFrame.pandas_api.__doc__
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1159192785


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1628,11 +1629,29 @@ def checkpoint(self, *args: Any, **kwargs: Any) -> None:
     def localCheckpoint(self, *args: Any, **kwargs: Any) -> None:
         raise NotImplementedError("localCheckpoint() is not implemented.")
 
-    def to_pandas_on_spark(self, *args: Any, **kwargs: Any) -> None:
-        raise NotImplementedError("to_pandas_on_spark() is not implemented.")
-
-    def pandas_api(self, *args: Any, **kwargs: Any) -> None:
-        raise NotImplementedError("pandas_api() is not implemented.")
+    def to_pandas_on_spark(
+        self, index_col: Optional[Union[str, List[str]]] = None
+    ) -> "PandasOnSparkDataFrame":
+        warnings.warn(
+            "DataFrame.to_pandas_on_spark is deprecated. Use DataFrame.pandas_api instead.",
+            FutureWarning,
+        )
+        return self.pandas_api(index_col)
+
+    def pandas_api(

Review Comment:
   Nice catch. Will address the comments.



##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1628,11 +1629,29 @@ def checkpoint(self, *args: Any, **kwargs: Any) -> None:
     def localCheckpoint(self, *args: Any, **kwargs: Any) -> None:
         raise NotImplementedError("localCheckpoint() is not implemented.")
 
-    def to_pandas_on_spark(self, *args: Any, **kwargs: Any) -> None:
-        raise NotImplementedError("to_pandas_on_spark() is not implemented.")
-
-    def pandas_api(self, *args: Any, **kwargs: Any) -> None:
-        raise NotImplementedError("pandas_api() is not implemented.")
+    def to_pandas_on_spark(
+        self, index_col: Optional[Union[str, List[str]]] = None
+    ) -> "PandasOnSparkDataFrame":
+        warnings.warn(
+            "DataFrame.to_pandas_on_spark is deprecated. Use DataFrame.pandas_api instead.",
+            FutureWarning,
+        )
+        return self.pandas_api(index_col)
+
+    def pandas_api(

Review Comment:
   Nice catch. Will address the comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1159207166


##########
python/pyspark/pandas/internal.py:
##########
@@ -962,9 +1025,9 @@ def field_for(self, label: Label) -> InternalField:
             raise KeyError(name_like_string(label))
 
     @property
-    def spark_frame(self) -> SparkDataFrame:
+    def spark_frame(self) -> LegacyDataFrame:

Review Comment:
   Maybe can I address remaining mypy check including this in the follow-up, because updating this brings another hundreds of mypy errors?
   Btw, there are too many mypy errors to handle all typing at once in this PR, so I created a separate ticket to address remaining typing within subsequent tasks: SPARK-43045



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155882908


##########
python/pyspark/pandas/data_type_ops/base.py:
##########
@@ -470,14 +474,16 @@ def eq(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
         else:
             from pyspark.pandas.base import column_op
 
-            return column_op(Column.__eq__)(left, right)
+            Column = ConnectColumn if is_remote() else PySparkColumn
+            return column_op(cast(Callable[..., GenericColumn], Column.__eq__))(left, right)

Review Comment:
   can you do 
   ```python
   return column_op(cast(Callable[..., GenericColumn], lambda: x, y: x.__eq__(y)))(left, right)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155880102


##########
python/pyspark/pandas/data_type_ops/base.py:
##########
@@ -53,6 +53,10 @@
     spark_type_to_pandas_dtype,
 )
 

Review Comment:
   I think it shouldn't have a newline here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on a diff in pull request #40525: [WIP][SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.

bjornjorgensen commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1148591182


##########
python/pyspark/pandas/data_type_ops/base.py:
##########
@@ -24,7 +24,7 @@
 import pandas as pd
 from pandas.api.types import CategoricalDtype
 
-from pyspark.sql import functions as F, Column
+from pyspark.sql import functions as F, Column as PySparkColumn

Review Comment:
   Does this work? 
   from pyspark.sql import functions as F
   from pyspark.sql.column import Column as PySparkColumn 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org