You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/04/03 01:31:38 UTC

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

zhengruifeng commented on code in PR #40525:
URL: https://github.com/apache/spark/pull/40525#discussion_r1155413090


##########
python/pyspark/pandas/utils.py:
##########
@@ -466,7 +471,11 @@ def is_testing() -> bool:
 
 
 def default_session() -> SparkSession:
-    spark = SparkSession.getActiveSession()
+    if not is_remote():
+        spark = SparkSession.getActiveSession()
+    else:
+        # TODO: Implement `getActiveSession` for SparkConnect.
+        spark = None

Review Comment:
   there is a workaround for now
   
   ```suggestion
           from pyspark.sql.connect.session import _active_spark_session
   
           spark = _active_spark_session  # type: ignore[assignment]
   ```



##########
python/pyspark/pandas/utils.py:
##########
@@ -792,7 +801,9 @@ def validate_mode(mode: str) -> str:
 
 
 @overload
-def verify_temp_column_name(df: SparkDataFrame, column_name_or_label: str) -> str:

Review Comment:
   In the above type hint definition: `SparkDataFrame = Union[PySparkDataFrame, SparkConnectDataFrame]
   `
   
   can we unify the naming in some way? it is a bit confusing



##########
dev/sparktestsupport/modules.py:
##########
@@ -777,6 +777,68 @@ def __hash__(self):
         "pyspark.ml.connect.functions",
         # ml unittests
         "pyspark.ml.tests.connect.test_connect_function",
+        # pandas-on-Spark unittests

Review Comment:
   just to confirm, the doctests are not reused for now?



##########
python/pyspark/pandas/utils.py:
##########
@@ -927,14 +939,20 @@ def spark_column_equals(left: Column, right: Column) -> bool:
     >>> spark_column_equals(sdf1["x"] + 1, sdf2["x"] + 1)
     False
     """
-    return left._jc.equals(right._jc)
+    if isinstance(left, Column):
+        return left._jc.equals(right._jc)
+    elif isinstance(left, SparkConnectColumn):
+        return repr(left) == repr(right)
 

Review Comment:
   should we raise an exception here if it's not a Column or SparkConnectColumn



##########
python/pyspark/pandas/utils.py:
##########
@@ -792,7 +801,9 @@ def validate_mode(mode: str) -> str:
 
 
 @overload
-def verify_temp_column_name(df: SparkDataFrame, column_name_or_label: str) -> str:

Review Comment:
   I personally prefer something like:
   
   ```
   GenericDataFrame = Union[LegacyDataFrame, ConnectDataFrame]
   GenericColumn = Union[LegacyColumn, ConnectColumn]
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org